Sorting, Paging, and Deep Paging in Solr
NOTE: Solr Deep Paging with cursorMark functionality is not yet in a released Solr version, so to try it out you’ll either need to download a Heliosearch/Solr release, or a nightly test build of Solr.
Basic Sorting
First let’s add 12 documents (in this case metadata about books) to Solr in CSV format (Comma Separated Values):
$ curl http://localhost:8983/solr/update?commitWithin=5000 -H 'Content-type:text/csv' -d ' id,cat,pubyear_i,title,author,series_s,sequence_i book1,fantasy,2000,A Storm of Swords,George R.R. Martin,A Song of Ice and Fire,3 book2,fantasy,2005,A Feast for Crows,George R.R. Martin,A Song of Ice and Fire,4 book3,fantasy,2011,A Dance with Dragons,George R.R. Martin,A Song of Ice and Fire,5 book4,sci-fi,1987,Consider Phlebas,Iain M. Banks,The Culture,1 book5,sci-fi,1988,The Player of Games,Iain M. Banks,The Culture,2 book6,sci-fi,1990,Use of Weapons,Iain M. Banks,The Culture,3 book7,fantasy,1984,Shadows Linger,Glen Cook,The Black Company,2 book8,fantasy,1984,The White Rose,Glen Cook,The Black Company,3 book9,fantasy,1989,Shadow Games,Glen Cook,The Black Company,4 book10,sci-fi,2001,Gridlinked,Neal Asher,Ian Cormac,1 book11,sci-fi,2003,The Line of Polity,Neal Asher,Ian Cormac,2 book12,sci-fi,2005,Brass Man,Neal Asher,Ian Cormac,3 '
Now we can issue a query with the following parameters:
q=id:book* matches document ids that start with “book”
sort=pubyear_i desc sorts matches in descending order by the year of publication
fl=title,pubyear_i “fl” stands for “field list”, the stored fields to return for the resulting matches.
(simply click the link below if Solr is up and running on the same box as your browser)
{
"responseHeader":{
"status":0,
"QTime":2,
"params":{
"fl":"title,pubyear_i",
"sort":"pubyear_i desc",
"q":"id:book*"}},
"response":{"numFound":12,"start":0,"docs":[
{
"pubyear_i":2011,
"title":["A Dance with Dragons"]},
{
"pubyear_i":2005,
"title":["A Feast for Crows"]},
{
"pubyear_i":2005,
"title":["Brass Man"]},
{
"pubyear_i":2003,
"title":["The Line of Polity"]},
{
"pubyear_i":2001,
"title":["Gridlinked"]},
{
"pubyear_i":2000,
"title":["A Storm of Swords"]},
{
"pubyear_i":1990,
"title":["Use of Weapons"]},
{
"pubyear_i":1989,
"title":["Shadow Games"]},
{
"pubyear_i":1988,
"title":["The Player of Games"]},
{
"pubyear_i":1987,
"title":["Consider Phlebas"]}]
}}
Note that we found 12 books (see numFound in the response above) but there are only 10 books in the response. This is because the rows parameter defaults to 10.
Basic Paging
There are two parameters that control paging:
start – The starting offset into the ranked (sorted) list of documents. Defaults to 0.
rows – The maximum number of documents to return. Defaults to 10.
For example, if we add start=3 and rows=2 to the example query above, we should get the 4th and 5th books in the ranked document list.
{"response":{"numFound":12,"start":3,"docs":[
{
"pubyear_i":2003,
"title":["The Line of Polity"]},
{
"pubyear_i":2001,
"title":["Gridlinked"]}]
}}
Deep Paging
Deep paging refers to specifying a large start offset into the search results.
Basic paging can be inefficient with large start values since to return documents 1,000,000 through 1,000,010 in a sorted document list (only 10 documents), the search engine must find the top 1,000,010 documents and then take the last 10 to return to the user. Solr is smart enough to only retrieve the stored fields for the final 10 documents, but there is still the overhead of sorting the internal ids of the top 1,000,010 documents.
Deep paging via basic paging controls is even more inefficient for distributed searches (SolrCloud) since the sort values for the first 1,000,010 documents from each shard need to be returned and merged at an aggregator node in order to find the correct 10.
Deep Paging with a Cursor
The cursorMark parameter allows efficient iteration over a large result set. It works on both a single node and with distributed searches and SolrCloud mode.
NOTE: this functionality is not yet in a released Solr version, so to try it out you’ll either need to download , or a nightly test build of Solr.
Using cursorMark:
-
sortmust include a tie-breaker sort on theidfield. This prevents tie-breaking by internal lucene document id (which can change). -
startmust be 0 for all calls including acursorMark. - pass
cursorMark=*for the first request. - Solr will return a
nextCursorMarkin the response. Simply use this value forcursorMarkon the next call to continue paging through the results.
First request:
{"response":{"numFound":12,"start":0,"docs":[
{
"pubyear_i":2011,
"title":["A Dance with Dragons"]},
{
"pubyear_i":2005,
"title":["A Feast for Crows"]},
{
"pubyear_i":2005,
"title":["Brass Man"]},
{
"pubyear_i":2003,
"title":["The Line of Polity"]},
{
"pubyear_i":2001,
"title":["Gridlinked"]}]
},
"nextCursorMark":"AoJRfSVib29rNw=="}
For the next request, simply set cursorMark to the value we received for nextCursorMark.
{"response":{"numFound":12,"start":0,"docs":[
{
"pubyear_i":2000,
"title":["A Storm of Swords"]},
{
"pubyear_i":1990,
"title":["Use of Weapons"]},
{
"pubyear_i":1989,
"title":["Shadow Games"]},
{
"pubyear_i":1988,
"title":["The Player of Games"]},
{
"pubyear_i":1987,
"title":["Consider Phlebas"]}]
},
"nextCursorMark":"AoJTfCVib29rNA=="}
And our final request, again setting cursorMark to the new value we received for nextCursorMark in the response.
{"response":{"numFound":12,"start":0,"docs":[
{
"pubyear_i":1984,
"title":["Shadows Linger"]},
{
"pubyear_i":1984,
"title":["The White Rose"]}]
},
"nextCursorMark":"AoJQfCZib29rMTE="}
Deep Paging cursorMark implementation notes
- The cursorMark parameter itself contains all the necessary state. There is no server-side state.
- The
startparameter returned is always 0. It’s up to the client to figure out (or remember) what the position is for display purposes. - There is no need to page to the end of the result set with cursorMark (since there is no server-side state kept). Stop where ever you want.
- You know you have reached the end of a result set when you do not get back the full number of rows requested, or when the
nextCursorMarkreturned is the same as thecursorMarkyou sent - Although
startmust always be 0, you can vary the number ofrowsfor every call to vary the page size. - You can re-use cursorMark values, changing other things like what stored fields are returned or what fields are faceted.
- A client can efficiently go back pages by remembering previous cursorMarks and re-submitting them.
(at which point, no documents will be in the returned list).
