Google Search Appliance is going away

TL;DR

Build your own search solution around your specific needs using one of the great open source alternatives solutions out there (Apache Solr/Elasticsearch) and stop worrying about what your search vendor will offer (or not) in the future.

The news

If you’re somehow related to the Search Enterprise world you must have noticed by now that Google is discontinuing the Google Search Appliance (GSA) product, and that it will not be available for sale from 2017. Miles Kehoe and Martin White have written a very good analysis and some follow up steps as what to do next. I think that this proves once again the traction that Open Source Search alternatives are gaining and not as Laurent Fanichet of Sinequa says that the ends of GSA “seals the end of the era of commoditized search”, in my opinion this commoditization of search products is a consequence of the rapid growth of open source search options. It’s true that neither Apache Solr or Elasticsearch are drop-in replacements for GSA, but both of this solutions provide a solid foundation for any enterprise or home cooked solution.

In my opinion this situation, right now, is a very good reminder on what’s wrong with closed vendor/solutions like GSA/Endeca and FAST ESP (for instance), you develop your business around this key technologies (yes, search is a critical feature) and you wake up one day to find out that is no longer available or supported or actively developed. Sure, building search products is not easy but rather soon than later if will payoff, first you’ll be constructing a solution around a technology that is not very likely to disappear, Apache/Solr has been around for ages, and is being developed under the Apache Software Foundation, which means that a lot of great people is putting great ideas and code behind it; and the community around it will ensure that it will never go away (I’m not trying to be an absolutist here, but … you get the point). Sure both Apache Solr and Elasticsearch have companies supporting the development of the products (Lucidworks and Elastic) respectively, but the licensing options allow you to put your head in the pillow without any worries about whats happening tomorrow, perhaps the company will go away (not likely) but the products as they’re today will remain available and you could develop your own “fork” if you need to, or hire developers to work on your particular needs.

Migrating to something new is always a scary process, but there are a lot of great consultancy firms out there willing to help you: Open Source Connections, Flax, just to say a few, even the very same company that is behind the Solr development Lucidworks or Elastic which supports Elasticsearch development.

Also, know that you’re not alone, other companies have done the switch already and with happy results, let’s take for instance the case of Career Builder and other companies that once they made the change are developing stronger products making search a more essential part of their business. If you want to see who is using Solr and for what, check out the videos of the great Lucene Solr Revolution conference (in any of the years), this conference is a gold mine of Solr use cases.

One particular product developed by Lucidworks is Fusion, which is something like “Solr on steroids”. I talked previously about Fusion and Solr, and the short version is that if you don’t want to build a full search team / search product to support your business, then you can use something like Fusion or DataSax, that provides enterprise features ready to use. The key difference is that this products are built around Open Source technology and you won’t be tied up to a particular vendor. A lot of options exist, I really advice you to use something that is baked up by Open Source and if you could build your own solution around this technologies, even better.

As for the future, I think we’ll see a couple of companies writing “connectors” or interfaces to ease the transition out of GSA into their own particular plattforms.

So wrapping things up, I’m not saying anything new, but people get out of one closed vendor to the next one and only remember that bad things happen in times like this, when a product is discontinued. Moving to Apache Solr/Elasticsearch in 2016 is no that hard, consultancy options exists, the community is out there willing to help for free, and there are plenty of success stories to present to your stakeholders to convince them that the migration is worth the effort.

And finally, let’s add the two cents on free marketing:)

I’m currently working as a search engineer/consultant so if you think that I can help you, please get in touch.

Increasing performance of an XLSX import feature

In a Symfony2 application we’re developing for a client, a key feature is the import of some XLSX files containing previous representation of the data (let’s call it legacy data) into the system.

When the team started to work on this feature the client provided us with test files to analyze the structure, the data that needed to be imported, etc. This files were rather small, only a few hounders of kilobytes (almost all of the files were under 500Kb). So one of the developers in our team build the feature to import the data using the amazing PHPExcel library. And we all were happy for some time.

Continue reading “Increasing performance of an XLSX import feature”

Take-away points from Lucene/Solr Revolution 2014

A lot late but finally I could got my hands on the videos from the Solr/LuceneRevolution 2014 conference, and this are my take-away points of the last years edition.

I’ve compiled some key points (from my own perspective of the conference) in a very short and summarized list that I want to share, keep in mind that this are opinions of my own.

  • If search is key to your business/app or Website then you must monitor how your users are using the search capabilities you’re publishing, this is true if you’re using Solr or Elasticsearch or even Lucene, in the conference we see references to this in several presentations (Airbnb, Evernote, etc.). This could be used to improve your search but can also provide very valuable insight on how your search is being used, basically allowing you to acknowledge if your “formula” is working or not.

  • Sometimes its OK to use Lucene and build from the ground up your own custom Search solution, this is not for everybody, but for some cases it will be worth the effort (Twitter, LinkedIn)

  • Solr can be used for the must unexpected use cases, yes we know that when you have text you can use Solr/Elasticsearch/Lucene to search on it, but did you know that you can even search images by color? Also Solr can be used just to deduplicate content, how cool is that? Do you need an engine that you can feed data and then execute quick queries against? then Solr is for you, the use case is limited by your own imagination (or your own needs).

  • If search is a core feature for your use case consider abstracting the inner workings of the search engine of your choice from the rest of your engineering team, meaning provide a library that will allow to other members create amazing apps without wasting time in learning how Solr/Lucene work, or how you scaled your infrastructure, this are complex issues. In previous editions of this same conference we’ve seen good examples of this approach, the case of CareerBuilder comes to my mind.

  • Paired with the previous point you also must provide tools that will allow the other members of your engineering team to debug a query, do A/B testing, bring people who knows the content to tell about the results quality, of course non of this is easy to build or maintain but the effort is well compensated, the idea is to create an ecosystem that democratizes search in your organization.

  • The new Analytics component is a powerful new addition to Solr, coming in the recently released Solr 5.0 and available to previous versions in the form of a patch. This awesome feature was presented by Steve Bower from Bloomberg and is an leap step forward compared to the Stats Components, that old friend that some of us use. I think that this brand new search component combined with AnalyticsQuery and the introducion of the PostFilter interface are leading Solr in a path to become one of the must customizable analytics platform, one item that in my personal opinion Elasticsearch attacked before Solr.

Its always nice to see some more advanced solutions that use Lucene in its core, with its own layers, LinkedIn is a great example of this, although keep in mind that this is not something trivial and you’ll need very talented engineers to create this type of system, and in most cases this is not really required.

Use DocValues there is no other way of saying this, if you want to do analytics, faceting on very large collections you’ll have to use DocValues it improves the memory usage a lot, if you don’t trust me you’ll hear exactly that in several talks on the conference from more advanced folks.

I think that search engines are getting a lot of attraction, not in the traditional Google, Bing, Yahoo! style but actually as a technology that can power very interesting use cases, mostly analytics and because of this, people/companies has been looking for ways to run this products in an even larger scale. Take a look at the talks by Tomás Fernández Löbbe from the Amazon CloudSearch team and Jessica Mallet from Apple and you’ll think entirely different about your own setup, trust me on this.

I’d love to have the opportunity to go to this conference in the near future, its a real joy to share a room with the must talented engineers out there pushing search to the future. So to finish this post I just want to let you an invitation to the next Lucene/Solr Revolution event, which will be in Austin, TX October 13-16. The registration opens this spring, so if you want to stay informed on this event visit the site or follow @LuceneSolrRev or @Lucidworks Twitter accounts for the most updated news.

Adding some safeguard measures to Solr with a SearchComponent

Just to be clear I wont be taking here about adding security to Solr, some documentation about this topic could be found in several places. I’m going to talk about some other kind of problems, those problems that comes when a user of your application requests a great amount of documents from your Solr cluster. Just to clarify perhaps your particular use case requires that your users can fetch a great deal of documents from Solr and in this case you should’ve planed for CPU and RAM requirements accordingly, taking into account your particular schema, cache, analyzers, and even using DocValues to improve query speed. But there are some cases when you just want to put some security measures to prevent your users from fetching a a lot of documents from Solr. Out of the box Solr comes without any options to deal with this case, and sure you could implement this in your app, but if you have several “clients” to your Solr data (this was my case), then you’ve to spread this change into all of those clients and every time you change something in one place it will be needed to change it in all the other pieces of code. Continue reading “Adding some safeguard measures to Solr with a SearchComponent”

MIME Type filtering with Nutch 1.x

As I’ve sayed before Nutch is a great Web crawler, and of course it provides the options for you to configure the process as you want. In this particular case if you only want to crawl some files (i.e HTML pages only and obviate all the images) you could use one several mechanisms available. Nutch basically provides a type of plugin to allow you to filter your crawl based on the URLs, actually Nutch come with a few of this plugins right out of the box for you to use, essentially the urlfilter-* family under the plugins directory.

Continue reading “MIME Type filtering with Nutch 1.x”