MIME Type filtering with Nutch 1.x

As I’ve sayed before Nutch is a great Web crawler, and of course it provides the options for you to configure the process as you want. In this particular case if you only want to crawl some files (i.e HTML pages only and obviate all the images) you could use one several mechanisms available. Nutch basically provides a type of plugin to allow you to filter your crawl based on the URLs, actually Nutch come with a few of this plugins right out of the box for you to use, essentially the urlfilter-* family under the plugins directory.

Continúa leyendo MIME Type filtering with Nutch 1.x

Solr: Instant Apache Solr for Indexing Data How-to Book Review

Apache Solr Beginner's GuideI’ve been writing this review for to long, sadly some work related issues forbade me of concluding this sooner. And this is why I want to formally apologize with the author Alfredo Serafini and Punit Shetty, the “guy” form Packt Publishing; wonderful fellows who gives me the opportunity of writing this reviews and provide access to the book, which otherwise I couldn’t afford.

When I start reading a book (any technical book, actually) I like to take a sneak peek through the index before actually reading the book and my first expression about the index of this great book was: This is a BEGINNERS GUIDE? My first impression came from the fact that in the index I saw sections about merging of segments and it’s impact in your indexes, another section about writing Solr plugins; so you don’t need to be a rocket scientist to understand any of those topics; and yet you don’t expect to see those topics covered in a book with the word “beginners” in the title. Nevertheless it’s fare to say that this only increase my interest on reading the book, and as I sayed before this was my first impression BEFORE reading the book, but when we start reading the book you get hit by this sentence in the Acknowledgments: Continúa leyendo Solr: Instant Apache Solr for Indexing Data How-to Book Review

Indexing inlinks and outlinks with Nutch 1.x

Nutch its an amazing piece of software, its one of the most versatile Web crawlers out there. I’ve been playing with Nutch for quite some time now, since version 1.3 (If my memory doesn’t play tricks with me). The truth be told Nutch do all the hard lifting of running a crawl over the Internet, Intranet or just on those sites you want to crawl. Also, Nutch gives the user the ability of changing it’s own behavior using configurations.

During the crawl life cycle Nutch needs to pay attention to the outlinks of any given page, this is how Nutch knows where to go next. Of course this outlinks could be affected by a set of configuration options in your nutch-site.xml file, but this configurations are out of the reach of this blog post.

So now we know that Nutch has the capabilities of handling outlinks and inlinks, but how can we index this inlinks and outlinks into Solr? Being able of indexing the inlinks and outlinks could be what you want, depending of your use case. Sadly Nutch doesn’t do this by default, but it’s no problem to add this custom logic trough a plugin.

Nutch provides several extension points which basically corresponds to the kind of plugins you can develop to extend Nutch beyond your wildest dreams. Generalizing our requirement we could say that what we need to do is to index (send to our persistence layer) some data that Nutch already has. So what we need is to implement and IndexingFilter plugin to carry on our task.

Important Note: If you just want to “grab and use” here is the link to the Github repo where I’ve uploaded the full code (with tests included) so you could just build and add the jar into your nutch installation.

If you’re still reading down here then, let’s start with the “technical stuff”. What we need to do is grab the outlinks and inlinks of a Webpage before its send into Solr (or some other backend), as I previously say what we need is an IndexingFilter plugin, basically an IndexingFilter plugin is just a custom class that implements the IndexingFilter interface. This interface has it’s own magic sauce inside it, but it’s way beyond our blog post, so to keep it simple let’s say that this interface force us to implement three methods:

  • filter
  • setConf
  • getConf

Both setConf and getConf are pretty straight forward if you don’t want to provide any kind of configuration options to your plugin; We won’t cover how to handle configuration options for your custom plugins, but it’s enough to say that the Configuration object provided by Hadoop requires a no brainer to use.

So let’s focus on the filter() method which is the one with the greater amount of “condiment”. Basically this method take a few arguments:

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) { … }

An instance of a NutchDocument, which is the object that eventually will be passed into an indexer (Solr by default) which will persist the documents. The instance of the Parse class holds all the data that Tika and all the parser plugins has extracted from the web page’s raw content. Of course the url of the webpage being indexed; an instance of the CrawlDatum class, this holds all the crawl related information: status, fetch time, fetch interval, modified time and a lot of other information. And finally but not least an instance of the Inlinks class. Let’s take a moment so we we could realize that we just found the inlinks of any webpage, which solves like the 50% of our problem.

The outlinks could not be so far away, but let’s hold that thought for a moment and see how the filter method suppose to work:

The filter method accepts a NutchDocument as a parameter and returns a NutchDocument, so it basically takes a document add or removes data from this documents (the data of a NutchDocument it’s called a Field, which is an instance of NutchField class.

So, basically the NutchDocument provides a way to store important data, basically information that we would like to persist into our backend, totally agnostic about the persistance layer we want to use: all what an Indexer plugin needs it’s a nutch document that will holds all the data, later on an actual Indexer of your choosing will translate what is stored in a NutchDocument into your persistance layer, which could be: Solr or Elasticseach (both bundled with the default installation of Nutch 1.8 at the time of writing this blog post) or even anything you could think of, even a queue like Kafka or RabbitMQ. The next figure illustrates this process that we just described, summarizing and IndexingFilter plugin it’s only responsible of adding or removing fields from a NutchDocument.

 Nutch Indexing filters life cycle

As we see saw previously the filter method of the IndexingFilter interface receives all the inlinks of the webpage (or URL) being processed at the moment. So what we need to do in our custom plugin is just add a new inlink field into the NutchDocument, so basically something like this:

In this case the add method of the NutchDocument field, handles the case for multivalued fields, so in case of need it “converts” a common field into a multivalued one. Meaning that multivalued fields can save as many values as we want within the same key, in this case a webpage could have several inlinks and several outlinks. Note: You’ll need to add to your Solr configuration the fields required to store the inlinks and outlinks, the main requirement in this case is that the field needs to be multivalued to hold all the inlinks of one document, of course the type of analyzers, tokenizers, etc. that you configure for this field depends on your specific use case. So now we just solved the 50% of our problem, but how to deal with the outlinks? Well as we just say in the beginning of this post, Nutch use outlinks to continue surfing throw the Web, so in some place the outlinks must be stored, this place is in the Parse information, which has a lot of sense if you ask me, basically the outlinks are extracted from the raw content of the Web page, if is an HTML document, you’ll need to search in it’s content and find all those tags that contains an outlink. So to index the outlinks we’ll need to add this snippet of code:

A working plugin, if you just want to grab an use a custom indexing plugin to index your inlinks and outlinks your could clone this GitHub repo. This plugin relies on a few options to let you configure how you want to store your inlinks and outlinks, for instance it allows you to filter over your inlinks and outlinks to store only those links from a different host that the webpage being processed, so you could just keep those “external” links, and forget about all the inlinks of the same host.

If you want to know more about the plugin you should read the README and see how you could configure this plugin. I also encourage you to look into the code, the same technique explained here is used.

So, this all folks, I hope you’ve enjoyed the journey through this very large post. If you have any question, put it in a comment or contact me on Twitter.

Reviewing Instant Apache Solr for Indexing Data How-to

Instant Apache Solr for Indexing Data How-to Cover Recently I’ve had the chance of reviewing a new book about Solr: Instant Apache Solr for Indexing Data How-to thanks to the author Alexandre Rafalovitch for providing me the opportunity of doing this. It was a pleasant reading, really interesting so here are my impressions.

A remarkable point about this book is the the approach used with examples, basically the examples are first guaranteed to work and then explained in a great detail without overwhelming the reader. It’s fair to say that I was already familiar with Solr, so I knew several of the concepts explained in the first chapters, although I’ve been using a previous version (3.6) of Solr, so this book offers a very good perspective to dose seeking for a preliminary introduction into Solr 4. The author manages to expose some very complicated topics in a very fancy and yet simple form smoothly driving you from simpler topics into advanced ones.

Continúa leyendo Reviewing Instant Apache Solr for Indexing Data How-to

PyJupiter meet the World, World this is PyJupiter

From a few months now I’ve been using Jupiter to speed up certain tasks in my laptop (usually I work with an external monitor in my home, so I’m constantly connecting and disconecting the monitor from my Acer Aspire One D250, obviously its more comfortable to work in a 15.6″ display that in the built in 10.1″ laptop display).

Jupiter is great without any doubt. But for me has some “issues”. The main one is that the GUI is developed in C#, this means that needs mono installed to work. In Ubuntu 11.10 (distribution that I’m using right now) mono came installed by default, applications like Banshee are also developed in mono. In the future nearby (with Ubuntu 12.04), this is not true any more.

Besides all this, in Ubuntu 11.10, installing Jupiter also means installing all this packages:

libmono-corlib2.0-cil libmono-data-tds2.0-cil libmono-i18n-west2.0-cil libmono-messaging2.0-cil libmono-posix2.0-cil libmono-security2.0-cil libmono-sharpzip2.84-cil libmono-sqlite2.0-cil libmono-system-data-linq2.0-cil libmono-system-data2.0-cil libmono-system-messaging2.0-cil libmono-system-web2.0-cil libmono-system2.0-cil libmono-wcf3.0-cil libmono2.0-cil

This is about 17MB of packages that I need to install just to use the Applet. Besides all these elements, the original implementation uses an icon in the notification area, in Unity the use of an indicator provides e better integration, and I really like keeping my system organize.

Keeping all this in mind, i started to rewrite the GUI part of Jupiter in something called PyJupiter, essentially I continue using all the bash scripts and just rewrote the GUI, using Python, PyGTK, and improving the integration with Unity using an indicator. After some advise from the original developer Andrew Wyatt I also included the systray icon to keep compatibility with those users that don’t use Unity. Also I included some multithread capabilities, so each submenu is rendered by a different thread increasing the ideal of a responsive interface. Some minor tweaks have also been made in the menu, using radio buttons or check boxes instead of the original asters to indicate the active element.

I’ve packaged the system into a deb package, so installation in Ubuntu 11.10 is easy, for those distributions RPM based the code is inside the deb file, so packaging for others systems should be easy.

So without anymore delay I’ll give you PyJupiter (Download).

Please share any comments/thoughts in the comments.

Mistery Solved! Rajesh usa Mac

Para aquellos que no lo saben soy un gran fan de la serie The Big Bang Theory que es actualmente transmitida actualmente por CBS, es simplemente de otro mundo esta serie.

En fin, los protagonistas poseen (como buenos geeks) sus portátiles, ya en internet podemos encontrar referencias sobre los modelos que utilizan sus protagonistas. En fin Rajesh Ramayan Koothrappali utiliza una MacBook Pro (creo) y aunque siempre supuse que utilizaba Mac en esa laptop nunca en la serie se confirmó pues siempre aparecía teniendo una videoconferencia con sus padres en la primera y segunda temporadas. Ahora en la tercera en el capítulo 17 “The Precious Fragmentation” podemos observar,  cuando se dispone a realizar una un chat de video (a través de iChat supongo), con claridad las marcas distintivas de un Sistema Operativo Mac y a juzgar por  el wallpaper podríamos asumir que es MacOS X 10.5 Leopard en alguna de sus actualizaciones, tal parece que nuestro querido Rajesh no ha actualizado a Snow Leopard todavía :-).

En fin, aquí les dejo una toma del momento en que se puede observar que la portátil posee Mac instalado.

Rajesh usa Mac!

URL Validator en Rails

Hace un tiempo que he comenzado a estudiar Ruby on Rails, es algo que había puesto en espera durante mucho tiempo. Yo vengo de utilizar symfony para algunos proyectos en los que me he visto involucrado y al final me he acostumbrado bastante, porque de por sí symfony es un gran framework y ahorra muchísimo trabajo, desde los validators que ofrece hasta muchas de las funcionalidades como el Admin Generator que realmente son una pasada.

Pues bueno, la cosa es que necesité validar en Rails que una URL entrada por el usuario fuera válida, en symfony, para esto simplemente definía de la siguiente forma:

$this->validatorSchema['url'] = new sfValidatorUrl(array('required'=>true));
$this->validatorSchema['url']->setMessage('invalid', 'Please enter a valid web address (http://google.com)');

Esto lo ponía en el método configure() de la clase correspondiente a mi formulario y listo!.

En Rails, me encontré que esto no era tan sencillo, de hecho buscando en Google encontré varios ejemplos de personas que lo hacían utilizando expresiones regulares. Aunque esto para nada es extraño realmente no me gusta memorizar algo tan largo y buscando me encontré con esta otra alternativa, utilizando validates_format_of

validates_format_of :url, :with => URI::regexp(%w(http https))

La cosilla está en que el método regexp de URI admite como parámetros la lista de protocolos que deseamos que se validen y en función de esto genera las expresiones regulares correspondientes y por tanto validará todo lo que URI.parse sea capaz de parsear.