Solr Filter Caching

The filter caching features in Solr allow for precise control over how filter queries are handled in order to maximize performance. Solr has the ability to specify if a filter is cached, specify the order filters are evaluated, and specify post filtering.

Solr Filter Queries

Adding a filter expressed as a query to a Solr request is a easy… simply add an additional fq parameter for each filter query.

   &fq=year:[2014 TO *]

By default, Solr resolves all of the filters before the main query. Each filter query is looked up individually in Solr’s filterCache (which is pretty advanced itself, supporting concurrent lookups, different eviction policies such as LRU or LFU, and auto-warming). Caching each filter query separately accelerates Solr’s query throughput by greatly improving cache hit rates since many types of filters tend to be reused across different requests.

Filters embedded in a query

Update: starting with Heliosearch 0.07, there is support built directly into the standard query parser for creating a filter query that uses the filter cache.

To Cache or not to Cache

The advanced filter control API adds the ability to *not* cache a filter. Some filters may see almost no reuse across different requests, and not caching them can lead to a smaller, more effective filterCache with a higher hit rate.

To tell Solr not to cache a filter, we use the same powerful local params DSL that adds metadata to query parameters and is used to specify different types of query syntaxes and query parsers. For a normal query that does not have any localParam metadata, simply prepend a local param of cache=false. For example:

 &fq={!cache=false}year:[2014 TO *]

To add cache=false to a filter query that already had localParams, simply add it right in with the rest of the params. For example, if we want to use Solr’s native spatial abilities to restrict our matches to locations within 50 km of Stanford, our filter query would look like:

 &fq={!geofilt sfield=location pt=37.42,-122.17 d=50} 

It’s easy to modify this filter to tell Solr not to cache it by adding cache=false in with the rest of the local parameters:

 &fq={!geofilt sfield=location pt=37.42,-122.17 d=50 cache=false} 

Leapfrog anyone?

When a filter isn’t generated up front and cached, it’s executed in parallel with the main query. First, the filter is asked about the first document id that it matches. The query is then asked about the first document that is equal to or greater than that document. The filter is then asked about the first document that is equal to or greater than that. The filter and the query play this game of leapfrog until they land on the same document and it’s declared a match, after which the document is collected and scored.

How much is that filter?

Advanced filtering adds even more fine grained control by introducing the notion of cost. If there are multiple non-cached filters in a response, filters with a lower cost will be checked before those with a higher cost.

   &fq={!cache=false cost=10}year:[2014 TO *]
   &fq={!geofilt cache=false cost=20}

In the example above, the filter based on year has a lower cost and will thus always be checked before the spatial filter.

As an aside, notice how spatial queries will use global spatial request parameters if they are not specified locally. This can make it even easier to construct requests containing spatial functions.

Expensive Filters

Some filters are slow enough that you don’t even want to run them in parallel with the query and other filters, even if they are consulted last, since asking them “what is the next doc you match on or after this given doc” is so expensive. For these types of filters, you really want to only ask them “do you match this doc” only after the query and all other filters have been consulted. Solr has special support for this called “post filtering“.

Post filtering is triggered by filters that have a cost>=100 and have explicit support for it. If there are multiple post filters in a single request, they will be ordered by cost.

The frange qparser has post filter support and allows powerful queries specifying ranges over arbitrarily complex function queries.

For example, if we wanted to take the log of popularity, divide it by the square root of the distance, and filter out documents with a result less than 5, we could run this as a post filter using frange:

&fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist()))

Post filtering support for the spatial filter queries bbox and geofilt has been available since Solr 4.0 too. To execute our previous un-cached spatial filter as a post filter, simply modify its cost to be greater than 100:

   &fq={!geofilt cache=false cost=150}

Custom Post Filters

If you have expensive custom logic you’d like to add as a post filter (say per-document custom security ACLs), you can implement your own QParserPlugin that returns Query objects that implement Solr’s PostFilter interface. You can set the default cost or hardcode a cost higher than 100 if you want to only support post filtering. Then, you can use your custom parser as you would any other builtin query type via fq={!myqueryparser arg1=x arg2=y} and Solr will handle the rest!

Try it out!

In conclusion, hopefully this gives more insight into just one of many factors working under the hood to make Solr so fast.
To try out the absolute latest functionality, you can always get a nightly build of trunk. Feedback is always appreciated!