Introducing Solr's New Sorting Engine

Starting with Solr 4.10, Solr has the ability to sort and export entire result sets. This new feature allows Solr to be used as a general purpose distributed sorting engine.

Use Cases

This new capability opens up Solr for a wide variety of new use cases that were typically performed in aggregation engines like Hadoop. Some of these use cases include:

Session Analysis

It’s a very common use case to perform analysis on user sessions from an application or website. By far the easiest and most scalable way to do this is to sort the log records by session ID. This allows you to process the sessions as a stream, gathering analytics as you iterate through the sessions.

Large Scale Time Series Rollups

Often in data warehousing, it’s too expensive to retain all the raw time series data indefinitely. In these scenarios it’s a common practice to roll-up the raw data to create time series aggregations for long term storage.

The simplest and most scalable approach to this is to return the log records sorted by a time stamp. This allows the time series roll-ups to be processed as a stream.

Sort Based Statistics

Certain statistics such as Median, Percentiles, Quartiles, Rankings, Dense Rankings etc… rely on the data being sorted first. With Solr’s new sorting capability you can simply sort a numeric column and derive the sort based stats while iterating through the results.

Distributed Joins

It’s very common in data warehousing to have to perform large scale joins of different data sets. The easiest and most scalable approach to this is to sort the data sets on the join keys and then perform a merge join. Using Solr’s sorting feature you can make separate calls to different SolrCloud collections, sort on the join keys and perform the merge join while iterating through the result sets.

High Cardinality Aggregations

Sometimes it’s necessary to perform aggregations on high cardinality fields. For example, you may want to see the top 100 terms searched on your site. Or you may want to capture the long tail of searches on your site.

The easiest and most scalable way to do this is to sort the results on the high cardinality field and perform the aggregation while iterating through the results. With this approach you can simply gather the results in a priority queue to capture the top or bottom results.

How Does It Work

To ask Solr to export the full sorted result set you use the new export request handler. The syntax is very simple:

/solr/collection/export?q=my-query&sort=fieldA desc, fieldB desc&fl=fieldA,fieldB,fieldC

Solr will then stream the full sorted result back in json format. The sort criteria and field list (fl) are mandatory.

Distributed Support

The initial release treats all queries as non-distributed requests. So the client is responsible for making the calls to each Solr instance and merging the results. Because each of the core use cases involve specialized merge logic, it makes sense for the client to handle the merge (atleast in the initial release).

Using SolrJ’s CloudSolrServer as a model, you can build clients that automatically send requests to all the shards in a collection (or multiple collections) and then merge the sorted sets any way you wish.

In the future, Solr will likely support some of the core use cases in distributed mode. For example there could be special merge handlers that calculate distributed Median, Percentiles etc.., or perform general case distributed joins.

Sorting

In the initial release you can sort on any of the following field types:

1) int
2) long
3) float
4) double
5) string

You can specify up to 4 sort fields in a single request. For example:
sort=fieldA+desc,fieldB+asc,fieldC+desc,fieldD+asc

You can mix and match field types and sort direction as you wish.

The sort fields must be single valued fields.

Sorting on scores is not supported in the initial release.

Exporting

You can export all the field types that can be sorted (int,long,float,double,string). The exported fields can be single or multi-valued.

Exporting scores is not supported in the initial release.

Use Of Lucene DocValues

All the fields being sorted/exported must have docValues set to true (https://cwiki.apache.org/confluence/display/solr/DocValues).

You can choose between the different docValuesFormats to trade off memory usage and performance. The fastest is likely to be the “Direct” doc values format as it is uncompressed and fully in-memory. The initial tests were performed with the default Lucene410 docValues format and the “Direct” doc values format.

Lucene/Solr has traditionally used these types of caches for internal sorting and faceting. The export engine uses these caches for sorting but also for exporting data.

Future

This is just the initial release. Here are some features that are likely to follow:

1) Sorting and exporting scores.
2) Sorting and exporting DateField and Currency field types.
3) Support for common distributed use cases such as time series rollups and sort based stats.
4) Support un-sorted exports.
5) Support Avro exports.
6) Improved throughput by using a separate thread for sorting and exporting results.