Solr 5 is on it’s way and along with it comes a shiny new implementation of Collapse & Expand. The general functionality remains the same but the performance characteristics have changed. This blog provides a guide to getting the best performance out of Collapse & Expand for your use case.
General Performance Enhancements
Optimized String DocValues
Solr 4.* versions of Collapse & Expand were optimized for use with the Lucene FieldCache. DocValues support was present but collapsing on a DocValues String field was significantly slower then collapsing on a non-DocValues field.
In Solr 5, Collapse & Expand on a DocValues field has been optimized and is now roughly 3 times faster when collapsing on a large result set. If your application currently collapses on a DocValues field you’ll see this performance improvement out-of-box, with no changes necessary.
Faster Expand over large result sets
The Solr 4.* version of Expand uses a PostFilter to select the expanded documents for the groups in the page. PostFilters are applied to each document in the result set so they slow down as the number of documents in the result set rises.
The Solr 5 version of Expand applies a Lucene TermsFilter to select the expanded documents. This approach is much faster when the page size is small and the full result set is large, which is the case for many Expand use cases.
The effect of this is a large performance improvement when expanding large result sets. The key to this improvement though is keeping the page size small. Once the page size grows beyond 200, only the PostFilter approach is used.
Use Case Specific Performance Guide
-
Fastest Query
Solr 4.* always operated in this mode. In Solr 5 this not the default mode so you’ll need to switch it on. There are two things that need to be done to achieve the fastest query speed.
-
Use the new “TOP_FC” hint
There is a new “hint” local parameter for the CollapsingQParserPlugin. If you set this parameter to TOP_FC both Collapse & Expand will use a top level Lucene FieldCache instead of the default MultiDocValues caches.
Sample syntax:
fq={!collapse field=fieldA hint=TOP_FC}
-
Use a String Collapse Field
The TOP_FC hint only works on String collapse fields. So you’ll need to index your collapse fields as Strings.
Important: There are two important downsides to the added performance that comes from using a top level FieldCache. First, this approach is the least real-time friendly (slowest to warm.) Second, this approach guarantees “insanity” if the collapse field is used for sorting, faceting or with function queries. Insanity, in Lucene terms, means that a field is cached in memory more than once. Basically what this means is that there is a waste of memory if the collapse field is used for faceting etc…
-
Use the new “TOP_FC” hint
-
Real-Time Indexing
In Solr 4.* there were no specific implementations for collapsing on numeric fields, everything was treated as a String. In Solr 5 there are implementations for collapsing on numeric fields which are slower at query time, but extremely real-time indexing friendly.
To get the most friendly real-time indexing for Collapse & Expand, define your collapse field as an integer or float (64 bit numeric collapse fields are not supported in Solr 5.). Indexing the collapse field with DocValues will further improve the real-time performance without hindering the query performance further.
-
Balanced
A balance between query performance and real-time indexing performance can be achieved by simply indexing the collapse fields as Strings. For String collapses, if you don’t use the top_fc hint, an underlying implementation is chosen that provides good query performance and good real-time indexing performance.
This is the right approach for use cases that don’t have heavy requirements in either direction but need good performance for both.