Fusion is a great peace of software, is brought to us by the good people of Lucidwords, but if you’re not too deep in the Search business one question pops out in your mind: Why do I need Fusion, if the good old guys (and girls) of Lucidworks gives Solr for free?
Yes is true, Solr is an amazing product, you cand download it for free, you even can add your own components, but it’s not all sunshines and rainbows, if you want to successfully integrate Solr into your site, application or whatever you need to write code to interface your own application with Solr, even more, tuning solr to achieve the desired level of accuracy or to accomplish that super-special use case of yours can be tricky, hey! just think that companies like Lucidworks and OpenSourceConnections make a living out of it.
If you go right now to the Solr website you can read this:
Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™
If you look in my previous quote I highlighted the words enterprise search platform, why? although Solr has a lot of features (well covered in a reference book of over 500 pages) I don’t think is quite enterprise yet, meaning from downloading it to effectively using it in your product you’ll need to do some steps, that depending on your use case can vary a lot, of course.
Solr has improved a lot since the first time I used it (back in the 3.X series) but it has a long path to go yet, a path that has to be more focused in creating new amazing features, and shipping with more analyzers, more NLP sauce and more scalability features, and less in providing enterprise ready easy-of-use software. As a search engineer I’m confortable using Solr, tweaking the schema of the data or the API endpoints and writing components to achieve what my employer or client wants, even writing code around Solr to make easy for others in my team to write search applications without the need to know the inner workings of Solr, hey! this is how I earn my beans.
But (here comes the but) if you’re the product manager of a website not so big and you don’t have Search Engineers in your payroll, Solr could be a little to overwhelming, unless you’re confortable with your engineers spending time learning Solr (and no PMs, one week is not enough).
Solr is BIG, in it guts lives a dragon, called Lucene which is very powerfull but also a little bit scary and may take some time to take along with. If this is not enough Solr add some layers of good code around the dragon that gives it a more powerful roar and an even larger flare. So is not enough to know the dragon/Lucene, you need to know what Solr adds to the mix, which is a lot, from faceting code to scalability capabilities using Zookeeper (one more creature in the forest). If you want to do advanced things with Solr you’ll need to spend some time in the woods living with this creatures. The summary of this paragraph is that Solr is easy enough to get you started to test the sap of full text search, but for more advanced use cases is kind of challenging, for instance perhaps you want to do some sort of collaborative filtering (you want your results to be influenced a little for the user selection process) this is not trivial with Solr or should have to be.
Getting data into Solr
In order to use Solr you’ll first need to create your own data ingestion pipeline; perhaps your data now lives in an SQL/NoSQL database, or you have it published in several websites, or in plain old files in your servers filesystems. Or you use some form of Cloud based storage like Google Drive or Dropbox.
Whatever the source of your information is, you need to write some code to take that data and put it in Solr, sure Solr comes with a very comprehensive client library SolrJ but using it comes with a price: Your engineers will need to learn how to use the library, what is a good practice and what cannot be done; depending on the amount of data perhaps you need multiple threads to do the data ingestion and reduce the amount of time required to load all your data.
One more option if your current storage solution is a local filesystem or a SQL database is to use the DataImporHandler which has been around for quite a bit, and yes is usable and is a great solution but it has its tipping point, meaning that there is that sweet spot when is not good enough, or fast enough, or easy enough, and then you’ll need to fallback into using a custom client written in any language that has a library for Solr (such as SolrJ).
In this particular aspect Fusion comes with a big gun to the rescue “Data Sources & Connectors” so in a beautiful interface you configure your input, which could be for instance (Filesystem, FTP, HDFS, S3, Azure, SMB) almost anything, including the Web (yes Fusion comes with a very handy Web crawler), or the Sharepoint server in your company, or any number of Databases, or a Hadoop Cluster, or Twitter. And more connectors are coming with each release, such as Google Drive or Dropbox.
Transforming data before it goes into Solr
So until here with Fusion we can choose our data source and configure it from a confortable web interface and its ready to go. But what if you don’t want all the data to be imported? no problem, with “Index Pipelines” you can easily apply transformations to your data prior to the indexing stage, as described in the documentation:
An Index Pipeline takes content and transforms it into a document suitable for indexing by Solr via a series of modularized operations called Index Stages. The objects sent from stage to stage are PipelineDocument objects.
I’m not going to give a detailed explanation of this feature, but its good enough to know that is there you don’t need to write any code to use it.
In a previous post I talked about some features you need to have if Search is actually important for you: one MUST have is monitoring your search usage, meaning you need an analytics component on top of your search solution to get information about how your users are searching. Solr doesn’t provide any default way of doing this, yes sure you can build one around Solr itself even using Solr as your Datastore and running queries against your Solr cluster, and everything will be great, you can even talk at conferences about how you and your team build it, but the bottom line is that you need to write the damn thing first, which can be tricky. This is one of those aspects where Fusion excels very very well. By default Fusion provides a Report feature which basically will provide reports on:
- Top queries: the queries that have been performed most often. This is based on the ‘topQueries’ report.
- Top clicked: the items that have been clicked most frequently. This requires that user click events have been stored in the system, and that they have been aggregated to get click boost data. This is based on the ‘topClicked’ report.
- Slow queries: shows queries that have been statistically slower than others. This report is based on the ‘histo’ report, which is a histogram of query times.
- Zero results queries: shows queries that have returned 0 results. This report is based on the ‘lessThanN’ report.
- Query rate (last 10 minutes): shows the query rate over the last 10 minutes. This report is based on the ‘dateHisto’ report.
Also, if you are using SolrCloud you know that your configuration doesn’t exist only in your server
conf folder but it also needs to be uploaded to Zookeeper (you know that other forest animal that we talked at the begining), this is also very very easy with Fusion, which is a small gain after all.
As I was saying, usually you want to influence your search results ordering using some input from the user, this could be some sort of rating of each individual result or using some indirect measure of a “vote” i.e clicking, if a user clicks on a result this could be interpreted as a positive vote from the user, which in result can be used to boost this document higher next time. This has its ups and downs which are outside the scope of this post, but go ahead click and amuse yourself with the knowledge of the topic by the engineers at OpenSource Connections.
The point here is that if you want to build this, you have a bunch of new problems in your hand. How are you going to count this “hits” or ratings, is an integer a good choice? How this integer is going to affect the overall scoring formula? Should I use a multiplicative boost or on additive one? Should I use the logarithmic value instead? I think you get the picture. Well you’ll be pleased to know that Fusion got your back in this too, there is this huge thing in Fusion called Signals, Aggregations, and Recommendations That abstract all this mathematic part of the problem for you, if you check out the documentation:
Signals can be pretty much anything. Typically, signals are events with timestamps which provide information relevant to search. For example, clickstream data provides a cascade of timestamped datapoints, such as: user A searched for term X, then user A clicked on document Y, then user A clicked on document Z. Raw signal data is generally a large set of small datapoints which in and of themselves aren’t informative and require further processing.
Aggregation is the “processing” part of signals processing. An aggregator reads in raw signals and returns interesting summaries, ranging from simple sums to sophisticated statistical functions.
Recommendations are Solr queries run against aggregated signals. Fusion provides Item-to-Query recommendations (improved query results), Query-to-Item recommendations (top queries which lead to an item), or Item-to-Item recommendations (e.g. “customers who bought this also bought that”).
Yes, perhaps your use case can’t benefit from this Fusion feature, but for the must part of the PMs out there this could be just enough, and it will get better over time. If you want to know more about this feature check out this awesome post on Lucidworks blog.
If you are serious about search you must know by now that you also need some solution for your monitoring needs, basically you need to know the state of your cluster, RAM/CPU usage, state of each collection and many more metrics. The Solr Admin UI comes with some of this but is basically insufficient, perhaps you don’t need very advanced metrics (so solutions like SPM are kind of an overkill, also great job Sematext), but is very possible that the ones provided by the Solr Admin UI are insufficient. Once more Fusion provides just the right amount of metrics for you to be confortable with, and even a little more.
If you run a multitenant environment, you need to provide some kind of security, perhaps you don’t want that every engineer in your team has the ability to fire a
DELETE query and wipe out all the data, or you want fine control over what a user can or cannot do in your cluster. Solr has not built in security mechanism and the overall consensus is that you need to build your own around Solr, meaning adding security in your application layer or adding some security measures in the container where Solr is running. Fusion is bundled with Users & Roles capabilities allowing you to user your own company LDAP.
Not all the features
Each of the features discused previously deserve a post on each own, and I’m trying to get the time to write a more comprehensive review of Fusion digging a little more in some features, but this is a short list of the things that Fusion can do for you out of the box, even better Fusion comes with a Solr instance inside, but it also integrates with your existing Solr installation, how cool is that? I haven’t covered all the features that Fusion puts on top of Solr, you should check out the product page and the documentation for a more comprehensive view, some of my favorites are also: Advanced Relevancy Tunning, Enhanced Admin UI, Dashboards, etc.
At the end I haven’t answered the question raised in this post title: What Fusion is doing for Solr? My answer to this question is quite simple, basically Fusion is putting the enterprise portion that Solr is selling in its description, don’t get me wrong Solr is AMAZING, but it’s not quite usable by the enterprise in my own personal opinion, and here is where Fusion comes along and acts as a great player. The enterprise clients want to download a product and start using it mostly without the need to write code, yes I know I said mostly because not everyone is the same.
I will be glad to know other people’s opinion about this.