What’s coming in Solr 5.2

A few days ago I found out that the first Release Candidate for Solr 5.2 is almost out via Anshum Gupta. So let’s take a peek of some of the new features coming in this new release of your favorite Search Engine.

Field value cardinality

The first feature that I would like to highlight is the efficient field value cardinality stats. I’m not going to deep dive in how this or any other of the features is implemented, Hoss have done a really good job on this post by Lucidworks, suffice to say that this new stat added to the Stats Component is a big win if you’re using Solr as your analytics backend. And yes, this new feature works seamlessly with the Facets Component.

Actually Solr kind of supported this feature for quite some time but it was a very näive implementation that performed very poorly specially in distributed environments.

How do you use this feature? of course using a local param the curl command looks like this:

$ curl 'http://localhost:8983/solr/techproducts/query?rows=0&q=*:*&stats=true&stats.field={!count=true+cardinality=true}manu_id_s'

The response will contain all the well known sections and of course something like:

"stats":{
    "stats_fields":{
    "manu_id_s":{
        "count":18,
        "cardinality":14}}}}

In the previous output you can check that 18 of the documents found have a value for the manu_id_s field and 14 of those are unique values, this is really really helpful.

You can think that this is rather trivial right? Well, this is not so true, if you think about it computing this in only one node with not a very large collection is almost a no brainer, but what happen when you want to compute the cardinality for a field with a very high number of unique values (like twitter usernames) over a distributed environment? Then things get a little messy because computing the total of unique values in each shard and then sending all this data to the coordinating node and then aggregating this to finally send a response is a really expensive operation, if you’re interested in how is done take a look at HyperLogLog, the algorithm used by Solr and Elasticsearch to compute field cardinality. In this post you’ll find a very instructive benchmark on the performance of this new feature, and yes also what compromises you need to do in order of being able to compute this value for very large collections.

I started with this feature, because when combine this with Solr facet functions you get whole new level of analytics capabilities out of Solr.

Solr facet functions

I was going to talk about this feature, but turns out that Yonik Seeley did a much better job, so please read his post instead.

Rule based replica assignment for SolrCloud

If you’re running a large SolrCloud this feature will make your life a lot easier, put in a very simple way this new feature will give you fine grained control over which nodes are selected when Solr needs to assign new nodes to a collection. Right now Solr will use either a random strategy to select the new node(s) or allow you to specify manually a list of node names to assign, but keeping track of each node name in your cluster is hard, hence the “simplify your life”.

With this feature you will be able to specify a set of rules that will automatically control how nodes are added to a collection. This feature will be used in the following scenarios:

  • Collection creation
  • Shard creation/splitting
  • Replica creation

If you want to know more about this really helpful feature I recommend you read this post published by the folks at Lucidworks. In this post you can find what is defined as a Rule, and some common use cases when this feature comes very handy along with a lot of examples.

Flatter request structure for the JSON Facet API

This feature is also designed for improving the easy of use of the new JSON API introduced in Solr 5.1, essentially this means that the facet type can be specified as a normal facet parameter, for example take this request in Solr 5.1:

top_authors : { terms : {
field : author,
limit : 5,
}}

In Solr 5.2 the “type” (terms) can be specified with a new "type" parameter:

top_authors : {
type : terms,
field : author,
limit : 5
}

Range facet mincount

Range facets now support the mincount parameter to exclude range facet buckets that don’t meet a minimum document count.

prices:{
type:range,
field:price,
mincount:1,
start:0, end:100, gap:10
}

Streaming expressions

This is a new feature around the newly introduced Streaming API added in Solr 5.1. This new expression interface allow you to do very interesting stuff, for instance:

// merge two distinct searches together on common fields
merge(
search(collection1, q="id:(0 3 4)", fl="id,a_s,a_i,a_f", sort="a_f asc, a_s asc"),
search(collection2, q="id:(1 2)", fl="id,a_s,a_i,a_f", sort="a_f asc, a_s asc"),
on="a_f asc, a_s asc")

// find top 20 unique records of a search
top(
n=20,
unique(
search(collection1, q=*:*, fl="id,a_s,a_i,a_f", sort="a_f desc"),
over="a_f desc"),
sort="a_f desc")

This examples are taken from the JIRA issue.

This are only some of the new features coming in Solr 5.2, there are a lot of other features (small and big) even some very interesting around the Facet API.

In my opinion the Field value cardinality is one of the biggest features in this release, and I say this from my personal opinion and without the intent of downgrading the other features. This personal opinion is based solely on the fact that this feature helps to close a little more the gap between the analytic features existing between Solr and Elasticsearch, which if you’re using Solr as your analytics backend means a lot. Of course this feature with the new improvements in the Facet API and the SolrCloud related features brings Solr to a hole new level.

Lastly if you want to test this new version before is finally released, which is indeed a good practice to ensure that updating Solr does not break anything else you can checkout and build the source code from the lucene_solr_5_2 branch. And just to point out that even if Solr 5.2 isn’t out yet (officially) there are new features being added to the Solr 5.3 release, one major feature for the upcoming release is the Cross Data Center Replication, as reported by Anshum Gupta:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s