As I’ve sayed before Nutch is a great Web crawler, and of course it provides the options for you to configure the process as you want. In this particular case if you only want to crawl some files (i.e HTML pages only and obviate all the images) you could use one several mechanisms available. Nutch basically provides a type of plugin to allow you to filter your crawl based on the URLs, actually Nutch come with a few of this plugins right out of the box for you to use, essentially the urlfilter-* family under the plugins directory.
So far so good, we can customize our entire crawl in a great detail, as long as we can differentiate the resources we really want to grab by it’s URLs. This works great no doubt about it but in my experience I’ve encounter a couple of cases when this it’s not enough.
One problem I’ve encounter very frequently is essentially the crawling of RSS feeds, let’s see, in my case we want to crawl a bunch of sites for pretty much all the URLs (not going into details here) so this could include PDFs, DOCs, HTML, and a long etc. So far we don’t have any issue with this, but what about RSSs? a bunch of this sites have RSSs in the form of: http://www.awesomesite.com/feed so essentially we can’t use the suffix plugin to restrict this case, sure we could setup a regex with the urlfilter-regex plugin for this case, but how about all those other websites that has a different URL? Of course you could do a little research and check periodically your index to spot those cases and manually (or even automatically) add those new regular expressions into the corresponding configuration file. We store the MIME type corresponding to each URL into our Solr index, so we could run a query to get those documents perhaps group them by domain, then extract those new RSS urls and then delete the documents from the index; add the regular expression into our nutch configuration, and then replicate this configuration change over our servers. But since we’re targeting a whole country I really don’t think this is a very good idea, besides it adds a new moving part into our existing architecture. Even if you’re running nutch in a Hadoop cluster o event in semi-distributed mode, then you’ll need to perform additional steps. Summarizing: It would be great if you could say to Nutch to only index some document if it matches some predefined MIME type that you would be interested in; that been said the reverse case will also be desirable, meaning that you perhaps could want to allow everything except some specific MIME type that its not interest for you.
I faced one more problem when trying to adapt Nutch to my needs, in a different project we’re also using Nutch to fetch documents, only PDF documents. This documents are hosted in several websites, and we only had in our seed file the initial URL of the website not the urls to all those PDF files. So the desired behavior we wanted from Nutch was to fetch and parse all the website (including the HTML pages), so the links to the PDF files could be discovered, using the same builtin mechanism that Nutch already has, but only to index the PDF files, and not all the other content (HTML pages).
So, let’s put our hands into the business. Nutch provides several extensions points that allow you to put some custom logic that you’ll need. As a matter of a fact, almost everything in Nutch it’s a plugin, as we said before it offers several plugins to “play” with the URLs, but in the cases layered before this were near useless.
Taken into account both cases explained above we wanted to stop a feed URL or an HTML page from being fetched or parsed, because in both cases new links could be found; but we wanted to get in control of what ends up in our Solr index. For this I wrote an IndexingFilter that allows to specify in a configuration file a bunch of regular expressions that are matched against the MIME type of the document prior to being indexed in your backend (Solr in my case). The configuration file looked like this:
Basically here we are saying that we wanted to allow everything except those documents that has text/html in the MIME type field. As usual we wanted to take advantage as much as possible of the heavy lifting that Nutch those for us. The first bless is that Nutch already extracts the MIME type of a document, so this is a plus and we don’t have to care about this, Tika has done the job for us. Although if you aren’t using any parser at all you may need to write a parser that figures it out the MIME type of a content.
So now we only need to write a custom class that implements IndexingFilter interface and write down the filter method, no? One more thing we need to do, if the document being indexed doesn’t fulfill the conditions we’ve established then we need to tell nutch to skip this document, before hand you may don’t know how to do this, but a simple walk around the nutch source code provides the answer:
If you take a peek into the file IndexingFilters.java in the Nutch source code (which is the class that calls all the other indexing filter plugins) you’ll se a comment inside the filter method:
So the answer appears by itself all we need to do is return null to stop a document for being further processed by the remaining plugins and also preventing from being indexed. The only part remaining is how to get the MIME type inside our class. This is not so hard either, basically the MIME type must come in the parsed data and metadata that Tika took from the content no? So essentially all we need to do is ask for it:
So far so good, but there are some cases when this could be insufficient, perhaps it’s those cases where Tika couldn’t get the MIME type correctly, a work around to this situation exist, which is essentially taken the MIME type out of the CrawlDatum object itself, which could be accomplished with the following code:
One last shot, if neither of this approaches worked, is to try to extract the MIME type from the URL itself, using an utility class (import org.apache.nutch.util.MimeUtil).
So if you don’t have it but now give up! define a default policy you can allowed or denied it. One final step would be to standardize the MIME type, so we could validate (and put our mine to rest) about how the MIME type is defined. Essentially this is a simply as using the MimeUtil class:
Now that we have the MIME type we can match it against the defined rules in the configuration file. The full code of this plugin could be found on the GitHub repo, essentially the code fragments I’ve placed in this post are fragments of the source code posted on GitHub.
One more comment about the format used in the published plugin: The format it’s pretty easy and very similar to the urlfilter-suffix plugin, you could use a + or – to define a global mode: accept or denied in each case, and the following regular expressions that are used as exceptions for the global mode. So the following configuration:
Will block all documents except those containing “image” in the MIME type field. And of course if you change the “-“ for a “+” it will allow everything except image documents.
So far so good, hope this helps and even if you don’t want to code your own plugin you could download a build the plugin from the GitHub repo and use it.