MIME Type filtering with Nutch 1.x

As I’ve sayed before Nutch is a great Web crawler, and of course it provides the options for you to configure the process as you want. In this particular case if you only want to crawl some files (i.e HTML pages only and obviate all the images) you could use one several mechanisms available. Nutch basically provides a type of plugin to allow you to filter your crawl based on the URLs, actually Nutch come with a few of this plugins right out of the box for you to use, essentially the urlfilter-* family under the plugins directory.

Continue reading “MIME Type filtering with Nutch 1.x”

Advertisements