Sorting, Paging, and Deep Paging in Solr
NOTE: Solr Deep Paging with cursorMark functionality is not yet in a released Solr version, so to try it out you’ll either need to download a Heliosearch/Solr release, or a nightly test build of Solr.
Basic Sorting
First let’s add 12 documents (in this case metadata about books) to Solr in CSV format (Comma Separated Values):
$ curl http://localhost:8983/solr/update?commitWithin=5000 -H 'Content-type:text/csv' -d ' id,cat,pubyear_i,title,author,series_s,sequence_i book1,fantasy,2000,A Storm of Swords,George R.R. Martin,A Song of Ice and Fire,3 book2,fantasy,2005,A Feast for Crows,George R.R. Martin,A Song of Ice and Fire,4 book3,fantasy,2011,A Dance with Dragons,George R.R. Martin,A Song of Ice and Fire,5 book4,sci-fi,1987,Consider Phlebas,Iain M. Banks,The Culture,1 book5,sci-fi,1988,The Player of Games,Iain M. Banks,The Culture,2 book6,sci-fi,1990,Use of Weapons,Iain M. Banks,The Culture,3 book7,fantasy,1984,Shadows Linger,Glen Cook,The Black Company,2 book8,fantasy,1984,The White Rose,Glen Cook,The Black Company,3 book9,fantasy,1989,Shadow Games,Glen Cook,The Black Company,4 book10,sci-fi,2001,Gridlinked,Neal Asher,Ian Cormac,1 book11,sci-fi,2003,The Line of Polity,Neal Asher,Ian Cormac,2 book12,sci-fi,2005,Brass Man,Neal Asher,Ian Cormac,3 '
Now we can issue a query with the following parameters:
q=id:book*
matches document ids that start with “book”
sort=pubyear_i desc
sorts matches in descending order by the year of publication
fl=title,pubyear_i
“fl” stands for “field list”, the stored fields to return for the resulting matches.
(simply click the link below if Solr is up and running on the same box as your browser)
{ "responseHeader":{ "status":0, "QTime":2, "params":{ "fl":"title,pubyear_i", "sort":"pubyear_i desc", "q":"id:book*"}}, "response":{"numFound":12,"start":0,"docs":[ { "pubyear_i":2011, "title":["A Dance with Dragons"]}, { "pubyear_i":2005, "title":["A Feast for Crows"]}, { "pubyear_i":2005, "title":["Brass Man"]}, { "pubyear_i":2003, "title":["The Line of Polity"]}, { "pubyear_i":2001, "title":["Gridlinked"]}, { "pubyear_i":2000, "title":["A Storm of Swords"]}, { "pubyear_i":1990, "title":["Use of Weapons"]}, { "pubyear_i":1989, "title":["Shadow Games"]}, { "pubyear_i":1988, "title":["The Player of Games"]}, { "pubyear_i":1987, "title":["Consider Phlebas"]}] }}
Note that we found 12 books (see numFound
in the response above) but there are only 10 books in the response. This is because the rows
parameter defaults to 10.
Basic Paging
There are two parameters that control paging:
start
– The starting offset into the ranked (sorted) list of documents. Defaults to 0.
rows
– The maximum number of documents to return. Defaults to 10.
For example, if we add start=3
and rows=2
to the example query above, we should get the 4th and 5th books in the ranked document list.
{"response":{"numFound":12,"start":3,"docs":[ { "pubyear_i":2003, "title":["The Line of Polity"]}, { "pubyear_i":2001, "title":["Gridlinked"]}] }}
Deep Paging
Deep paging refers to specifying a large start
offset into the search results.
Basic paging can be inefficient with large start
values since to return documents 1,000,000 through 1,000,010 in a sorted document list (only 10 documents), the search engine must find the top 1,000,010 documents and then take the last 10 to return to the user. Solr is smart enough to only retrieve the stored fields for the final 10 documents, but there is still the overhead of sorting the internal ids of the top 1,000,010 documents.
Deep paging via basic paging controls is even more inefficient for distributed searches (SolrCloud) since the sort values for the first 1,000,010 documents from each shard need to be returned and merged at an aggregator node in order to find the correct 10.
Deep Paging with a Cursor
The cursorMark
parameter allows efficient iteration over a large result set. It works on both a single node and with distributed searches and SolrCloud mode.
NOTE: this functionality is not yet in a released Solr version, so to try it out you’ll either need to download , or a nightly test build of Solr.
Using cursorMark:
-
sort
must include a tie-breaker sort on theid
field. This prevents tie-breaking by internal lucene document id (which can change). -
start
must be 0 for all calls including acursorMark
. - pass
cursorMark=*
for the first request. - Solr will return a
nextCursorMark
in the response. Simply use this value forcursorMark
on the next call to continue paging through the results.
First request:
{"response":{"numFound":12,"start":0,"docs":[ { "pubyear_i":2011, "title":["A Dance with Dragons"]}, { "pubyear_i":2005, "title":["A Feast for Crows"]}, { "pubyear_i":2005, "title":["Brass Man"]}, { "pubyear_i":2003, "title":["The Line of Polity"]}, { "pubyear_i":2001, "title":["Gridlinked"]}] }, "nextCursorMark":"AoJRfSVib29rNw=="}
For the next request, simply set cursorMark
to the value we received for nextCursorMark
.
{"response":{"numFound":12,"start":0,"docs":[ { "pubyear_i":2000, "title":["A Storm of Swords"]}, { "pubyear_i":1990, "title":["Use of Weapons"]}, { "pubyear_i":1989, "title":["Shadow Games"]}, { "pubyear_i":1988, "title":["The Player of Games"]}, { "pubyear_i":1987, "title":["Consider Phlebas"]}] }, "nextCursorMark":"AoJTfCVib29rNA=="}
And our final request, again setting cursorMark
to the new value we received for nextCursorMark
in the response.
{"response":{"numFound":12,"start":0,"docs":[ { "pubyear_i":1984, "title":["Shadows Linger"]}, { "pubyear_i":1984, "title":["The White Rose"]}] }, "nextCursorMark":"AoJQfCZib29rMTE="}
Deep Paging cursorMark implementation notes
- The cursorMark parameter itself contains all the necessary state. There is no server-side state.
- The
start
parameter returned is always 0. It’s up to the client to figure out (or remember) what the position is for display purposes. - There is no need to page to the end of the result set with cursorMark (since there is no server-side state kept). Stop where ever you want.
- You know you have reached the end of a result set when you do not get back the full number of rows requested, or when the
nextCursorMark
returned is the same as thecursorMark
you sent - Although
start
must always be 0, you can vary the number ofrows
for every call to vary the page size. - You can re-use cursorMark values, changing other things like what stored fields are returned or what fields are faceted.
- A client can efficiently go back pages by remembering previous cursorMarks and re-submitting them.
(at which point, no documents will be in the returned list).