EPL 660: Lab 6 Introduction to Nutch - CS-UCY

27
University of Cyprus Department of Computer Science EPL 660: Lab 6 Introduction to Nutch Andreas Kamilaris

Transcript of EPL 660: Lab 6 Introduction to Nutch - CS-UCY

University of CyprusDepartment of Computer Science

EPL 660: Lab 6Introduction to Nutch

Andreas Kamilaris

University of CyprusOverview• Complete Web search engine

– Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins (e.g. parsing)+ MapReduce & Distributed FS (Hadoop)

• Java-based

• Open source

University of CyprusReasons to run your own search engine• Transparency: Nutch is open source, anyone

can see how ranking algorithms work.– Google allows rankings to be based on payments.– Can be used by academic and governmental

organizations, where fairness of rankings may be very important.

• Understanding: see how large-scale search engine works.– Google source code is not available.

• Extensibility: can be customized andincorporated into your application

University of CyprusNutch in Practice• Nutch installations typically operate at one of three

scales: – local filesystem reliable (no network errors, caching is

unnecessary).– Intranet-level.– whole Web whole-Web crawling is difficult.

• Many crawling-oriented challenges when building a complete Web search engine:– Which pages do we start with? – How do we partition the work between a set of crawlers? – How often do we re-crawl? – How do we cope with broken links, unresponsive sites,

and unintelligible or duplicate content?

University of CyprusNutch Vs Lucene• Nutch is built on top of Lucene.

• "Should I use Lucene or Nutch?" – Use Lucene if you don't need a web crawler.

• e.g. you want to make a database searchable

• Nutch is a better fit for sites where you don't havedirect access to the underlying data, or it comesfrom disparate sources.

University of CyprusNutch Architecture• Nutch crawler + searcher

• Crawler: fetches pages, creates inverted index.

• Searcher: uses inverted index to answer queries.

• Crawler and Searcher are highly decoupled, enabling independent scaling on separate hardware platforms.

University of CyprusNutch Crawler• It consists of four main components:

– WebDB– Segments– Index– Crawl tool

University of CyprusWeb Database (WebDB)• Persistent data structure for mirroring the structure

and properties of the Web graph being crawled.• Used only by the crawler (not used during

searching).• The WebDB stores two types of entities.

– Pages: pages on the Web.– Links: the set of links from one page (to other pages).

• In the WebDB Web graph, the nodes are pagesand the edges are links.

University of CyprusSegments• A Segment is a collection of pages fetched and

indexed by the crawler in a single run.– limited lifespan (named by the date and time created).

• Fetchlist of a segment involves a list of URLs forthe crawler to fetch.

University of CyprusIndex• Nutch uses Lucene for indexing.

• Inverted index of all of the pages the system hasretrieved.– Each segment has its own index.

• A (global) inverted index is created by merging allindividual segment indexes.

University of CyprusCrawl tool• Crawling is a cyclical process:

1. The crawler generates a set of fetchlists from theWebDB.

2. A set of fetchers downloads the content from the Web.3. The crawler updates the WebDB with new links that

were found.4. The crawler generates a new set of fetchlists (for links

that haven't been fetched for a given period, includingthe new links found in the previous cycle).

5. This cycle repeats.

University of Cyprus

1. Create a new WebDB (admin db -create).2. Inject root URLs into the WebDB (inject).3. Generate a fetchlist from the WebDB in a new segment (generate).4. Fetch content from URLs in the fetchlist (fetch).5. Update the WebDB with links from fetched pages (updatedb).6. Repeat steps 3-5 until the required depth is reached.7. Update segments with scores and links from the WebDB

(updatesegs).8. Index the fetched pages (index).9. Eliminate duplicate content (and duplicate URLs) from the indexes

(dedup).10.Merge the indexes into a single index for searching (merge).

Steps in a Crawl+Index cycle

University of CyprusNutch as a Crawler

Initial URLs

Generator Fetcher

Segment

Webpages/files

Web

Parsergenerate

Injector

WebDB

read/write

Crawl tool

update get

read/write

University of CyprusNutch as a complete Web Search Engine

Indexer (Lucene)

Segments

Index

Searcher (Lucene)

GUI

WebDB LinkDB

(Tomcat)

University of CyprusRunning a Crawl

• The site structure for the site we are going to crawl:

• echo 'http://keaton/tinysite/A.html' > urls– file urls contains the root URL from which to populate the initial fetch list (page A).

• The Crawl tool uses a filter to decide which URLs go into the WebDB.– restrict the domain to the server on the intranet (/keaton).

• bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log– The Crawl tool uses the root URLs in urls file to start the crawl.– The results go to directory crawl-tinysite.– -depth flag tells the Crawler how many generate/fetch/update cycles to carry out

to get full page coverage.

University of CyprusExamine Results (File System)• Directories and files created after running the Crawl tool:

segments (pages)• Crawl created three segments in timestampedsubdirectories.• Each segment has its own index.

WebDB

Lucene index

A, B, C, C-duplinks to Wikipedia are not in WebDB(filter was used)

University of CyprusExamine Results (Pages&Links)

arguments changed into -stats in release 1.2

University of CyprusExamine results (Segments)• The Crawl tool created three segments in timestamped

subdirectories:

• PARSED column– Useful when running fetchers with parsing turned off, to be run

later as a separate process.• STARTED and FINISHED columns indicate the times when fetching

started and finished. – Invaluable for bigger crawls, when tracking down why crawling is

taking a long time.• COUNT column

– Shows the number of fetched pages in the segment. – E.g. last segment has two entries, corresponding to pages C and

C-duplicate.

changed into readseg in release 1.2

University of CyprusExamine results (Index&Search)• Command line searching through NutchBean:

bin/nutch.org.apache.nutch.searcher.NutchBean <keyword>where keyword is the search term.

Search results

University of CyprusExamine results (Index&Search)• GUI-based searching with Luke.• Luke is the Lucene

Index Toolbox.

• It accesses existing Lucene indexes and allows you to display and modify their contents.

• You can browse by doc number, view docs, execute search, analyze search results, retrieve ranked lists etc.

Download from: http://code.google.com/p/luke/

University of CyprusNutch Distributed File System (NDFS)• NDFS stores the crawling and indexes.• Data divided into blocks.• Blocks can be copied, replicated.• Namenode Vs Datanodes.• Datanodes hold and serve blocks.• Namenode holds metainfo.

– Filename block list– Block datanode-location

• Datanodes report to namenode every few seconds.

University of CyprusNutch & Hadoop• Hadoop is used in Nutch to manage data obtained

from the crawling process.

• MapReduce for indexing, parsing, WebDBconstruction, even fetching.

University of CyprusPlugins• Provide extensions to extension-points.• Each extension point defines an interface that

must be implemented by extension.• Some core extension points:

– IndexingFilter: add meta-data to indexed fields.– Parser: to parse a new type of document.– NutchAnalyzer: language specific analyzers.

University of CyprusGet Started with Nutch1. Download the latest Apache Nutch release (release

1.2) from: http://www.apache.org/dyn/closer.cgi/nutch/2. Set NUTCH_JAVA_HOME to the root of your JVM

installation.(* you need to set also JAVA_HOME to work).

3. Open up /conf/nutch-default.xml file, search for http.agent.name and give it value “MYNAME Spider”.

4. Create a urls file containing a list of root URLs.5. You can filter the crawling by editing the file

/conf/crawl-urlfilter.txt, replacing MY.DOMAIN.NAMEwith the name of the domain you wish to crawl.(* actually if you don’t do it, it will not work!)

University of CyprusInstalling in Tomcat1. You need to put the Nutch war file into your servlet

container.2. Assuming you've unpacked Tomcat as ~/local/tomcat, the

Nutch war file may be installed with the commands:

3. The webapp finds its indexes in ./crawl, relative to where you start Tomcat. Start Tomcat using a command like:

4. Then visit: http://localhost:8080/nutch/

mkdir ~/local/tomcat/webapps/nutchcp nutch*.war ~/local/tomcat/webapps/nutch/jar xvf ~/local/tomcat/webapps/nutch/nutch.warrm nutch-1.1.war

~/local/tomcat/bin/catalina.sh start

University of CyprusCrawl Command Vs Whole-Web Crawling• The Crawl command is more appropriate when

you intend to crawl up to around one million pages on a handful of Web servers.

• Whole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines.– More control over the crawl process.– Incremental crawling.

University of CyprusReferences• Nutch Web site: http://nutch.apache.org/• Nutch Docs: http://lucene.apache.org/nutch/• Nutch Wiki: http://wiki.apache.org/nutch/

(Support, mailing lists, tutorials, presentations)

• Prasad Pingali, CLIA consortium, Nutch Workshop, 2007.

• Tom White, Introduction to Nutch, java.net websitehttp://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html