Solr’s New RankQuery API

posted in: Uncategorized

Coming in Solr 4.9 is a new RankQuery API. Before diving into how the RankQuery API works, I’ll give a little background into how ranking works in Lucene/Solr.

A Lucene search can have three parts to it:

1) A Query: queries are used to find documents in the index.
2) A Filter: filters are document sets that limit the results that come back from the query.
3) A Collector: collectors define what is collected from documents that match the query and the filter.

Solr uses two Lucene collectors to handle document ranking/sorting. There is a collector that handles ranking by score (relevance ranking). And there is a collector that handles sorting when a sort criteria is specified.

A RankQuery allows you to inject a custom ranking collector. By injecting your own ranking collector you can take full control of the ranking process.

What’s the use case for this feature?

It’s a fairly common scenario for applications using Solr to pull a large set of data into the application and then rerank the documents. There are two things wrong with this scenario: first the reranking process doesn’t get to see the full result set, and second there are significant performance issues involved with pulling large results out of Solr.

RankQueries allow you to move this type of custom ranking logic directly into your own custom collector.

Let’s investigate how to go about writing your own RankQuery.

First let’s look at the http interface:

q=hello+word&rq={!myranker param1=a param2=3}

Notice the new “rq” parameter. This parameter points to a QParserPlugin that returns a Query object that extends the RankQuery class.

The RankQuery class has three abstract methods that need to be defined:

getTopDocsCollector(int len,SolrIndexSearcher.QueryCommand cmd, IndexSearcher searcher)
wrap(Query mainQuery)

getTopDocsCollector is where you return your custom ranking collector which must extend the Lucene TopDocsCollector class.

getMergeStrategy allows you to inject a custom MergeStrategy class. In another blog I’ll explain merge strategies in detail, but for now you can simply know that they control how documents from the shards are merged during a distributed search. If you’re implementing your own collector, you may need to implement a MergeStrategy to ensure that the ranking algorithm is applied as the documents are merged from the shards.

wrap is called by Solr, to wrap the RankQuery around the main Query. The RankQuery will then act as the cache key for the QueryResultCache. So, the RankQuery will have to proxy the Query interface calls to the main query, and implement hashCode and equals so the QueryResultCache works properly with the RankQuery. If the RankQuery affects scoring, you may also want to implement a Lucene Explanation to explain the new score.

Obviously RankQueries are an advanced feature, and this doesn’t even touch on the implementation of your custom collector or merge strategy. So to help people get started, I added the first RankQuery implementation for Solr. You can view the source to the ReRankQParserPlugin here. This nifty class uses the new RankQuery feature to hook in the Lucene QueryRescorer.

They’ll be blogs coming soon explaining MergeStrategies and the ReRankQParserPlugin in detail so stay tuned.