Get Started with the web crawler Apache Nutch 1.x

by Adrian Mejia in February 04, 2012 and last update was in February 06, 2012

Apache Nutch is an open source scalable Web crawler written in Java and based on Lucene/Solr for the indexing and search part. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. [*]

By using Nutch, we can find web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward.
Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole web crawling does not necessarily mean crawling the entire world wide web. We can limit a whole web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like we the one we used when we did the crawl command. [*]
Some of the advantages of Nutch, when compared to a simple Fetcher
  • highly scalable and relatively feature rich crawler
  • features like politeness which obeys robots.txt rules
  • robust and scalable - you can run Nutch on a cluster of 100 machines
  • quality - you can bias the crawling to fetch “important” pages first

Basics about Nutch

First you need to know that, Nutch data is composed of:

  • The crawl database, or crawldb. This contains information about every url known to Nutch, including whether it was fetched, and, if so, when.
  • The link database, or linkdb. This contains the list of known links to each url, including both the source url and anchor text of the link.
  • A set of segments. Each segment is a set of urls that are fetched as a unit. Segments are directories with the following subdirectories:
  1. crawl_generate names a set of urls to be fetche
  2. crawl_fetch contains the status of fetching each url
  3. content contains the raw content retrieved from each url
  4. parse_text contains the parsed text of each url
  5. parse_data contains outlinks and metadata parsed from each url
  6. crawl_parse contains the outlink urls, used to update the crawldb

Nutch and Hadoop

As of the official Nutch 1.3 release the source code architecture has been greatly simplified to allow us to run Nutch in one of two modes; namely local and deploy. By default, Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style.  If you are interested in deployed mode read here.

Getting hands dirt with Nutch

1 Setup Nutch from binary distribution

  1. Unzip your binary Nutch package to $HOME/nutch-1.3
  2. cd $HOME/nutch-1.3/runtime/local
  3. From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory.
2. Verify your Nutch installation
  1. run "bin/nutch"
  2. You can confirm a correct installation if you seeing the following:  Usage: nutch [-core] COMMAND
Some troubleshooting tips:
Run the following command if you are seeing "Permission denied":
chmod +x bin/nutch
Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home #mac
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk 
export NUTCH_HOME=/var/www/nutch-1.3/runtime/local
Example of using Nutch to crawl wikipedia pages:
Here we are try to crawl and sublinks in the same domain.
  1. $ cd NUTCH_HOME/runtime/local
  2. $ echo "" > urls
  3. add: `+^http://([a-z0-9]*\.)*` in conf/regex-urlfilter.txt
  4. $ bin/nutch crawl urls -dir crawl-wiki-ci -depth 2
  5. statistics associated with the crawldb
    1. $ nutch readdb crawl-wiki-ci/crawldb/ -stats
      1. CrawlDb statistics start: crawl-wiki-ci/crawldb/Statistics for CrawlDb: crawl-wiki-ci/crawldb/
        TOTAL urls:     2727
        retry 0:     2727
        min score:     0.0
        avg score:     8.107811E-4
        max score:     1.341
        status 1 (db_unfetched):     2665
        status 2 (db_fetched):     61
        status 3 (db_gone):     1
        CrawlDb statistics: done
  6. Dump of the URLs from the crawldb
    1. $ nutch readdb crawl-wiki-ci/crawldb/ -dump crawl-wiki-ci/stats
      1.     Version: 7Status: 1 (db_unfetched)
        Fetch time: Sat Feb 04 00:50:50 EST 2012
        Modified time: Wed Dec 31 19:00:00 EST 1969
        Retries since fetch: 0
        Retry interval: 2592000 seconds (30 days)
        Score: 1.9607843E-4
        Signature: null
  7. Top 10 highest rate links
    1. $ nutch readdb crawl-wiki-ci/crawldb/ -topN 10 crawl-wiki-ci/stats/top10/
      1. 1.3416613
  8. Dump of a Nutch segment
    1. $ nutch readseg -dump crawl-wiki-ci/segments/20120204004509/ crawl-wiki-ci/stats/segments
      1. CrawlDatum::Version: 7
        Status: 1 (db_unfetched)
        Fetch time: Sat Feb 04 00:45:03 EST 2012
        Modified time: Wed Dec 31 19:00:00 EST 1969
        Retries since fetch: 0
        Retry interval: 2592000 seconds (30 days)
        Score: 1.0
        Signature: null
        Metadata: _ngt_: 1328334307529

      2. Content::
        Version: -1
        contentType: application/xhtml+xml
        metadata: Content-Language=en Age=52614 Content-Length=29341 Last-Modified=Sat, 28 Jan 2012 17:27:22 GMT _fst_=33 Connection=close X-Cache-Lookup=MISS from Server=Apache X-Cache=MISS from X-Content-Type-Options=nosniff Cache-Control=private, s-maxage=0, max-age=0, must-revalidate Vary=Accept-Encoding,Cookie Date=Fri, 03 Feb 2012 15:08:18 GMT Content-Encoding=gzip nutch.crawl.score=1.0 Content-Type=text/html; charset=UTF-8
        <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">
        <html lang="en" dir="ltr" class="client-nojs" xmlns="">
        <title>Collective intelligence - Wikipedia, the free encyclopedia</title>
        <meta …. 




Tags: how-to, apache, nutch, search engines, web crawlers

Add a new comment