Several weeks ago Heliosearch v.08 was released, which included an early release of the Streaming Aggregation API. One of the main design goals for the Streaming API was to provide the foundation for the Heliosearch SQL Sub-Project.
The Streaming API abstracts search results as streams of tuples or TupleStreams. It also provides a set of TupleStream decorators that support both streaming transformations (union, intersect, unique, join, group by etc…) and streaming aggregations (sum, avg, count, min, max etc…).
The Streaming API also provides a ParallelStream implementation that partitions streaming transformations and streaming aggregations across a set of parallel worker nodes.
The SQL Sub-Project
The SQL Sub-Project provides a SQL Parser that compiles a SQL statement to a Streaming API TupleStream. During the compilation, the SQL Parser can apply the Streaming API’s ParallelStream to partition the TupleStream across N worker nodes. This provides parallel SQL capabilities across SolrCloud collections.
Here is a small example of how to use the SQL Parser:
Properties props = new Properties(); props.add("collection1.zkhost", "localhost:9983); // Define the zkHost for collection1. props.add("collection1.handler", "/export"); // Define the handler for collection1. props.add("workers.num", "7"); // define the number of parallel workers props.add("workers.zkhost", "localhost:9983); // define the zkhost for parallel workers. props.add("workers.collection", "collection2"); // define the collection for the workers. String sql = "select fieldA, sum(fieldB) from collection1 where fieldC='hello' group by fieldA"; TupleStream tupleStream = SQLParser.parse(sql, props); //Parse the sql statement tupleStream.open(); Tuple tuple = null; while(true) { tuple = tupleStream.read(); if(tuple.EOF) { break; } String fieldA = tuple.getString("fieldA"); Double sum = tuple.getDouble("sum(fieldB)"); } tupleStream.close();
Here is a quick review of the code snippet above:
-
The first few lines of code define the properties for the SQLParser . The “collection1.*” properties provide a table definition for the collection1 table in the SQL statement. In the example they define the zkHost for the table and the Solr request handler.
The “worker.*” properties define the parallel workers for the SQL statement.
-
Then a SQL SELECT statement is defined and parsed by the SQLParser. The WHERE clause in the select statement gets translated to a Lucene query, so the SELECT statement automatically supports full text search.
-
Note the GROUP BY aggregate clause in the SQL SELECT statement. Under the covers this gets translated to a RollupStream. A RollupStream is a TupleStream decorator that rolls up aggregates over a set of dynamic buckets. The GROUP BY fields are translated to the buckets to be rolled up.
-
Then the tupleStream is opened and the tuples are iterated. Each tuple contains a unique field value for fieldA and the rolled up aggregate for the function sum(fieldB).
- Also note that because parallel workers are defined in the props, the Stream is automatically partitioned across N worker nodes. In this case that means the GROUP BY aggregations are rolled up in parallel on 7 worker nodes. The SQLParser sets up the stream partitioning parameters under the covers, providing seamless parallel SQL execution.
JDBC Driver
Once the SQLParser implementation starts to mature, a JDBC Driver will be added so that any Java application can treat Heliosearch like a SQL database. This will allow any visualization tool that supports JDBC to be used with Heliosearch.
helio_sql branch
The SQL sub-project has been kicked off in the branch on Github. Currently there is support for a simple SELECT statement and a Parallel GROUP BY with an aggregate function. Many more SQL constructs will be supported soon.