Get Started with the web crawler Apache Nutch 1.x

Apache Nutch is an open source scalable Web crawler written in Java and based on Lucene/Solr for the indexing and search part. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. [*]

Motivation

By using Nutch, we can find web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward.

Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole web crawling does not necessarily mean crawling the entire world wide web. We can limit a whole web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like we the one we used when we did the crawl command. [*]

Some of the advantages of Nutch, when compared to a simple Fetcher

highly scalable and relatively feature rich crawler
features like politeness which obeys robots.txt rules
robust and scalable - you can run Nutch on a cluster of 100 machines
quality - you can bias the crawling to fetch “important” pages first

Basics about Nutch

First you need to know that, Nutch data is composed of:

The crawl database, or crawldb. This contains information about every url known to Nutch, including whether it was fetched, and, if so, when.
The link database, or linkdb. This contains the list of known links to each url, including both the source url and anchor text of the link.
A set of segments. Each segment is a set of urls that are fetched as a unit. Segments are directories with the following subdirectories:

crawl_generate names a set of urls to be fetche
crawl_fetch contains the status of fetching each url
content contains the raw content retrieved from each url
parse_text contains the parsed text of each url
parse_data contains outlinks and metadata parsed from each url
crawl_parse contains the outlink urls, used to update the crawldb

Nutch and Hadoop

As of the official Nutch 1.3 release the source code architecture has been greatly simplified to allow us to run Nutch in one of two modes; namely local and deploy. By default, Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style. If you are interested in deployed mode read here.

Getting hands dirt with Nutch

Setup Nutch from binary distribution

Unzip your binary Nutch package to $HOME/nutch-1.3
cd $HOME/nutch-1.3/runtime/local
From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory.

Verify your Nutch installation

run "bin/nutch"
You can confirm a correct installation if you seeing the following: Usage: nutch [-core] COMMAND

Some troubleshooting tips:

Run the following command if you are seeing "Permission denied":

chmod +x bin/nutch

Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home #mac

Ubuntu:

export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk

export NUTCH_HOME=/var/www/nutch-1.3/runtime/local

Example of using Nutch to crawl wikipedia pages:

Here we are try to crawl http://en.wikipedia.org/wiki/Collective_intelligence and sublinks in the same domain.

$ cd NUTCH_HOME/runtime/local
$ echo "http://en.wikipedia.org/wiki/Collective_intelligence" > urls
add: `+^http://([a-z0-9]*\.)*wikipedia.org/` in conf/regex-urlfilter.txt
$ bin/nutch crawl urls -dir crawl-wiki-ci -depth 2
statistics associated with the crawldb
1. $ nutch readdb crawl-wiki-ci/crawldb/ -stats
  1. CrawlDb statistics start: crawl-wiki-ci/crawldb/Statistics for CrawlDb: crawl-wiki-ci/crawldb/
    TOTAL urls:     2727
    retry 0:     2727
    min score:     0.0
    avg score:     8.107811E-4
    max score:     1.341
    status 1 (db_unfetched):     2665
    status 2 (db_fetched):     61
    status 3 (db_gone):     1
    CrawlDb statistics: done
Dump of the URLs from the crawldb
1. $ nutch readdb crawl-wiki-ci/crawldb/ -dump crawl-wiki-ci/stats
  1. http://en.wikipedia.org/wiki/Special:RecentChangesLinked/MIT_Center_for_Collective_Intelligence Version: 7Status: 1 (db_unfetched)
    Fetch time: Sat Feb 04 00:50:50 EST 2012
    Modified time: Wed Dec 31 19:00:00 EST 1969
    Retries since fetch: 0
    Retry interval: 2592000 seconds (30 days)
    Score: 1.9607843E-4
    Signature: null
    Metadata:
    ….
Top 10 highest rate links
1. $ nutch readdb crawl-wiki-ci/crawldb/ -topN 10 crawl-wiki-ci/stats/top10/
  1. 1.3416613     http://en.wikipedia.org/wiki/Collective_intelligence0.030499997     http://en.wikipedia.org/wiki/Howard_Bloom
    0.02763889     http://en.wikipedia.org/wiki/Groupthink
    0.02591739     http://en.wikipedia.org/wiki/Wikipedia
    0.024347823     http://en.wikipedia.org/wiki/Pierre_L%C3%A9vy_(philosopher)
    0.023733648     http://en.wikipedia.org/wiki/Wikipedia:Citation_needed
    0.017142152     http://en.wikipedia.org/w/opensearch_desc.php
    0.016599996     http://en.wikipedia.org/wiki/Artificial_intelligence
    0.016499996     http://en.wikipedia.org/wiki/Consensus_decision_making
    0.015199998     http://en.wikipedia.org/wiki/Group_selection
Dump of a Nutch segment
1. $ nutch readseg -dump crawl-wiki-ci/segments/20120204004509/ crawl-wiki-ci/stats/segments
  1. CrawlDatum::Version: 7
    Status: 1 (db_unfetched)
    Fetch time: Sat Feb 04 00:45:03 EST 2012
    Modified time: Wed Dec 31 19:00:00 EST 1969
    Retries since fetch: 0
    Retry interval: 2592000 seconds (30 days)
    Score: 1.0
    Signature: null
    Metadata: _ngt_: 1328334307529
  2. Content::
    Version: -1
    url: http://en.wikipedia.org/wiki/Collective_intelligence
    base: http://en.wikipedia.org/wiki/Collective_intelligence
    contentType: application/xhtml+xml
    metadata: Content-Language=en Age=52614 Content-Length=29341 Last-Modified=Sat, 28 Jan 2012 17:27:22 GMT _fst_=33 nutch.segment.name=20120204004509 Connection=close X-Cache-Lookup=MISS from sq72.wikimedia.org:80 Server=Apache X-Cache=MISS from sq72.wikimedia.org X-Content-Type-Options=nosniff Cache-Control=private, s-maxage=0, max-age=0, must-revalidate Vary=Accept-Encoding,Cookie Date=Fri, 03 Feb 2012 15:08:18 GMT Content-Encoding=gzip nutch.crawl.score=1.0 Content-Type=text/html; charset=UTF-8
    Content:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html lang="en" dir="ltr" class="client-nojs" xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <title>Collective intelligence - Wikipedia, the free encyclopedia</title>
    <meta ….

References:

http://wiki.apache.org/nutch/NutchTutorial
http://en.wikipedia.org/wiki/Nutch

Now, your turn!

Thanks for reading this far. Here are some things you can do next:

Found a typo? Edit this post.
Got questions? comment below.
Was it useful? Show your support and share it.