Google Search Appliance is going away

TL;DR

Build your own search solution around your specific needs using one of the great open source alternatives solutions out there (Apache Solr/Elasticsearch) and stop worrying about what your search vendor will offer (or not) in the future.

The news

If you’re somehow related to the Search Enterprise world you must have noticed by now that Google is discontinuing the Google Search Appliance (GSA) product, and that it will not be available for sale from 2017. Miles Kehoe and Martin White have written a very good analysis and some follow up steps as what to do next. I think that this proves once again the traction that Open Source Search alternatives are gaining and not as Laurent Fanichet of Sinequa says that the ends of GSA “seals the end of the era of commoditized search”, in my opinion this commoditization of search products is a consequence of the rapid growth of open source search options. It’s true that neither Apache Solr or Elasticsearch are drop-in replacements for GSA, but both of this solutions provide a solid foundation for any enterprise or home cooked solution. Continue reading “Google Search Appliance is going away”

Advertisements

Increasing performance of an XLSX import feature

In a Symfony2 application we’re developing for a client, a key feature is the import of some XLSX files containing previous representation of the data (let’s call it legacy data) into the system.

When the team started to work on this feature the client provided us with test files to analyze the structure, the data that needed to be imported, etc. This files were rather small, only a few hounders of kilobytes (almost all of the files were under 500Kb). So one of the developers in our team build the feature to import the data using the amazing PHPExcel library. And we all were happy for some time.

Continue reading “Increasing performance of an XLSX import feature”

Take-away points from Lucene/Solr Revolution 2014

A lot late but finally I could got my hands on the videos from the Solr/LuceneRevolution 2014 conference, and this are my take-away points of the last years edition.

I’ve compiled some key points (from my own perspective of the conference) in a very short and summarized list that I want to share, keep in mind that this are opinions of my own.

  • If search is key to your business/app or Website then you must monitor how your users are using the search capabilities you’re publishing, this is true if you’re using Solr or Elasticsearch or even Lucene, in the conference we see references to this in several presentations (Airbnb, Evernote, etc.). This could be used to improve your search but can also provide very valuable insight on how your search is being used, basically allowing you to acknowledge if your “formula” is working or not. Continue reading “Take-away points from Lucene/Solr Revolution 2014”

Adding some safeguard measures to Solr with a SearchComponent

Just to be clear I wont be taking here about adding security to Solr, some documentation about this topic could be found in several places. I’m going to talk about some other kind of problems, those problems that comes when a user of your application requests a great amount of documents from your Solr cluster. Just to clarify perhaps your particular use case requires that your users can fetch a great deal of documents from Solr and in this case you should’ve planed for CPU and RAM requirements accordingly, taking into account your particular schema, cache, analyzers, and even using DocValues to improve query speed. But there are some cases when you just want to put some security measures to prevent your users from fetching a a lot of documents from Solr. Out of the box Solr comes without any options to deal with this case, and sure you could implement this in your app, but if you have several “clients” to your Solr data (this was my case), then you’ve to spread this change into all of those clients and every time you change something in one place it will be needed to change it in all the other pieces of code. Continue reading “Adding some safeguard measures to Solr with a SearchComponent”

MIME Type filtering with Nutch 1.x

As I’ve sayed before Nutch is a great Web crawler, and of course it provides the options for you to configure the process as you want. In this particular case if you only want to crawl some files (i.e HTML pages only and obviate all the images) you could use one several mechanisms available. Nutch basically provides a type of plugin to allow you to filter your crawl based on the URLs, actually Nutch come with a few of this plugins right out of the box for you to use, essentially the urlfilter-* family under the plugins directory.

Continue reading “MIME Type filtering with Nutch 1.x”