Solr 4.8 has been released. Here’s an overview of how to use some of the new features.
Complex Phrase Queries
The complexphrase
query parser can produce phrase queries with embedded wildcards and boolean queries.
It works via multiple passes, parsing a query and then re-parsing any phrase queries for additional markup. At query execution time, span queries are generated to implement the complex phrase logic.
The simplest example is a phrase query containing a prefix query:
q={!complexphrase}"apple ip*"
This will match text with both “apple ipod” and “apple ipad”.
One can specify inOrder=false
as a localParam to also match “ipod apple” and “ipad apple”.
q={!complexphrase inOrder=false}"apple ip*"
One can also specify a different default field to search with the df
localParam:
q={!complexphrase df=name}"john* smith"
This will match both “john smith” and “johnathan smith” in the name
field. Of course one could always specify the field directly in the query as well:
q={!complexphrase}name:"john* smith"
Phrase slop works to specify the proximity of the clauses. For example, the following would also match a name of “johnathan q smith”:
q={!complexphrase}name:"john* smith"~1
And of course we can throw in parens, OR clauses, and other complex logic as well:
q={!complexphrase}name:"(aaa OR (bbb* OR ccc)) ddd -eee (fff~1 OR ggg)" AND text:"nnn? (ooo OR ppp) -qqq www"~3
Indexing Child Documents in JSON
Previously, one had to use XML or binary format (or SolrJ) to index nested child documents (needed for block join functionality). Support has now been added for JSON:
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d ' [ { "id": "chapter1", "title" : "Indexing Child Documents in JSON", "content_type": "chapter", "_childDocuments_": [ { "id": "1-1", "content_type": "page", "text": "ho hum... this is page 1 of chapter 1" }, { "id": "1-2", "content_type": "page", "text": "more text... this is page 2 of chapter 1" } ] } ] '
Block Join Example
Now if we query on “ho hum”, we obviously get page 1 of chapter 1 back:
http://localhost:8983/solr/query?q="ho hum" [...] "response":{"numFound":1,"start":0,"docs":[ { "id":"1-1", "content_type":["page"]}] }
But if we wanted to select chapters based on matches in pages, we could utilize a parent block join:
http://localhost:8983/solr/query?q={!parent which='content_type:chapter'}"ho hum" [...] "response":{"numFound":1,"start":0,"docs":[ { "id":"chapter1", "content_type":["chapter"]}] }
A child block join can be used to restrict (or match) child pages based on matches in a chapter (parent). For example, the following request returns all pages for which the chapter title contains “Indexing”:
http://localhost:8983/solr/query?q={!child of=content_type:chapter}title:Indexing [...] "response":{"numFound":2,"start":0,"docs":[ { "id":"1-1", "content_type":["page"]}, { "id":"1-2", "content_type":["page"]}] }
The query above would probably be more useful as a filter… for example, if we wanted to search for “hum” on all pages where the chapter had “Indexing” in the title:
http://localhost:8983/solr/query?q=hum&fq={!child of=content_type:chapter}title:Indexing [...] "response":{"numFound":1,"start":0,"docs":[ { "id":"1-1", "content_type":["page"]}] }
Expand Component
The ExpandComponent can be used to expand parent/child relationships in Solr. Joel previously blogged about the Expand Component and gave an example of how it could be used to expand a block join.
Named Config Sets
This is more in the “configuration” category of features. SolrCloud has always allowed multiple collections to share configuration, and now that capability has been brought to Solr’s non-cloud mode.
Since collections can be created or destroyed, we obviously don’t want shared configuration for these collections to be under the collection itself. The default location for config sets is in the “configsets” directory under the solr home (the example solr server currently doesn’t have this directory by default).
Let’s create a configSet named “generic” and then create two new collections (single core) called “books” and “music”:
/heliosearch/solr/example$ mkdir -p solr/configsets/generic/conf/ /heliosearch/solr/example$ cp -r solr/collection1/conf/* solr/configsets/generic/conf/ /heliosearch/solr/example$ curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=books&configSet=generic' /heliosearch/solr/example$ curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=music&configSet=generic'
Now you should be able to go to the admin console and go to the “Core Selector” on the bottom left hand side to see the new cores/collections we just created.
Let’s inspect what was done from the command line:
/heliosearch/solr/example$ ls -F solr README.txt bin/ books/ collection1/ configsets/ music/ solr.xml zoo.cfg /heliosearch/solr/example$ ls -F solr/books core.properties data/ /heliosearch/solr/example$ cat solr/books/core.properties #Written by CorePropertiesLocator #Thu Apr 24 21:12:33 EDT 2014 name=books configSet=generic
So we can see that the new cores created only contain a data directory and lack a “conf” directory of their own. The core.properties file points to the correct named configSet.
Stopwords and Synonyms REST API
Stopwords and Synonyms may now be managed via a REST API!
The new analysis filter types are ManagedStopFilterFactory and ManagedSynonymFilterFactory.
The example schema.xml now contains a field type that uses these new analysis filters:
To test this out, let’s also change the dynamic field *_en
to use managed_en
:
Synonyms
After starting the example server, we can retrieve the current english synonyms:
curl "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english" [...] "managedMap":{ "gb":["gib", "gigabyte"], "happy":["glad", "joyful"], "tv":["television"]}}}
Lets add a new synonym:
curl -XPUT "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english" -H 'Content-type:application/json' --data-binary '{"mb":["MiB","megabyte"]}'
Before these changes are visible to the actual search or indexing code in Solr, we need to reload the Solr core:
curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1"
And now we can do a query on a field that matches the dynamicField we set up and can see the results of the new synonym:
curl "http://localhost:8983/solr/query?q=foo_en:mb&debugQuery=true" [...] "debug":{ "rawquerystring":"foo_en:mb", "querystring":"foo_en:mb", "parsedquery":"(foo_en:megabyte foo_en:mib)/no_coord", "parsedquery_toString":"foo_en:megabyte foo_en:mib",
To delete the stopword we just added:
curl -XDELETE "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english/mb"
Stopwords
To retrieve the list of stopwords:
curl "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english"
To add a new stopword:
curl -XPUT "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english" -H 'Content-type:application/json' --data-binary '["foo"]'
To delete the stopword we just added:
curl -XDELETE "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english/foo"
Other changes
There have been numerous SolrCloud changes, including:
- A new List collections and cluster status API which clients can use to read collection and shard information instead of reading data directly from ZooKeeper.
- Some long running SolrCloud commands (like shard splitting) may now be run in “async” mode to avoid client timeouts
- A new ADDREPLICA command in the Collections API
Other changes include:
- Solr 4.8 now requires Java7!
- RegexReplaceProcessorFactory now supports pattern capture group substitution in the replacement string.
- A DocExpirationUpdateProcessorFactory that can mark documents based on a TTL (time-to-live) and periodically delete expired documents