The CollapsingQParserPlugin: Solr’s New High Performance Field Collapsing PostFilter

posted in: lucene, search, solr
Solr has had full featured Field Collapsing and Grouping since 2010. The CollapsingQParserPlugin, which was recently committed, provides an alternative approach with more specific design goals.

Design Goals

The initial CollapsingQParserPlugin requirements were driven by a Solr ECommerce user. The requirements were quickly boiled down to two main design goals:

  1. It needed to perform well when collapsing large result sets with a high number of distinct groups.
  2. It needed to work smoothly with other Solr features such as faceting, query elevation and sorting.

The first goal was really the driving factor in the development because this was where Solr’s existing grouping functionality was not performing well. Specifically using the parameter group.ngroups, while collapsing large result sets on a high cardinality field, was not performing well.

The PostFilter Design

It was determined fairly quickly that changing Lucene/Solr’s standard grouping would be more difficult then starting from scratch with a new design. The main reason for this was that Lucene/Solr’s grouping covers a fairly large set of requirements that were added onto over the years. Trying to make changes to this code base would have meant ensuring that every change was consistent with every feature. This may have been possible, but I felt it would have been too restrictive. Also landing the patch in a short period of time would have been difficult as this code base spanned both Solr and Lucene and would have involved approvals from lot’s of interested parties.

Adding the functionality as a PostFilter was a natural choice for a number of reasons:

  1. Field collapsing is a filter, it removes documents from the results. So it matched the PostFilter design.
  2. A PostFilter is pluggable, so if the patch wasn’t committable it could always be plugged in without having to have a patched code base.
  3. The narrow nature of the requirements fit nicely with PostFilter’s, which were designed to support case specific filter implementations. If someone needed different collapse functionality they could plug in a different implementation.
  4. PostFilters fit into the normal search flow. So faceting, sorting, query elevation would naturally fit with a PostFilter.

SOLR-5020

Before a field collapsing PostFilter could be implemented a change needed to be made to Solr’s DelegatingCollector class.

I note this here because this change is interesting to PostFilter developers and will likely be followed up with a blog post of its own. In a nutshell SOLR-5020 added a finish() method to DelegatingCollector, which would be called after the full search was completed.

Before this change, there wasn’t a signal to the DelegatingCollector that all the documents had been processed. In the case of FieldCollapsing, you couldn’t delegate to the ranking collectors until all the documents had been considered.

The finish() method made this possible and opens the door to other PostFilter algorithms that need to see all the documents before delegating to the ranking collectors.

How It Works

The CollapsingQParserPlugin has the following basic design and flow:

  • It is a PostFilter, so it wraps a DelegatingCollector around the ranking collectors and filters documents.
  • In the collect() method it collapses the documents based on a collapse field and collapse criteria. It does not delegate to the ranking collectors in the collect() method.
  • In the finish() method it sends the collapsed result set to the ranking collectors.

Here are examples of how to use it in a query:

Sample syntax:

Collapse based on the highest scoring document:

fq={!collapse field=field_name}
 
Collapse based on the min value of a numeric field:

fq={!collapse field=field_name min=field_name}
 
Collapse based on the max value of a numeric field:

fq={!collapse field=field_name max=field_name}
 
Collapse based on the max value of a function. The cscore() function works only in the context of the CollapsinqQParserPlugin and returns the score of the current document being collapsed.

fq={!collapse field=field_name max=sum(cscore(),field(A))}
 
Collapse with a null policy:

fq={!collapse field=field_name nullPolicy=nullPolicy}

 
There are three null policies:

  • ignore : removes docs with a null value in the collapse field (default).
  • expand : treats each doc with a null value in the collapse field as a separate group.
  • collapse : collapses all docs with a null value into a single group using either highest score, or min/max.

Interaction With Other Solr Features

The CollapsingQParserPlugin creates a collapsed result set that is forwarded to the ranking collectors. So all downstream features are applied to the collapsed set.

Special consideration was made for the QueryElevationComponent so that elevated documents are never collapsed out of the result set.

The concept of “ngroups” in the standard Solr grouping goes away because the numFound in the search results reflects the total number of groups matched.

Facets will be computed on the collapsed set by default. You can use the tag/exclude facet options to remove the CollapsingQParserPlugin filter query for specific facets if you want to see facet counts for the un-collapsed result set.

Distributed field collapse will work like a normal search, but you’ll need to keep all documents within the same group on the same shard.

So, How Does It Perform?

For it’s main design goal, which is to perform well when collapsing large result sets on high cardinality fields, it performs very well.

For example, in one performance test, a search with 10 million full results and 1 million collapsed groups:
Standard grouping with ngroups : 17 seconds.
CollapsingQParserPlugin: 300 milli-seconds.

Release Information

Solr 4.6: Initial release (SOLR-5027)
Solr 4.6.1: Bug fix release (SOLR-5416, SOLR-5408)
Solr 4.7: Function value collapse criteria was added (SOLR-5536).