EPL 660: Lab 6 Introduction to Nutch - CS-UCY
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of EPL 660: Lab 6 Introduction to Nutch - CS-UCY
University of CyprusDepartment of Computer Science
EPL 660: Lab 6Introduction to Nutch
Andreas Kamilaris
University of CyprusOverview• Complete Web search engine
– Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins (e.g. parsing)+ MapReduce & Distributed FS (Hadoop)
• Java-based
• Open source
University of CyprusReasons to run your own search engine• Transparency: Nutch is open source, anyone
can see how ranking algorithms work.– Google allows rankings to be based on payments.– Can be used by academic and governmental
organizations, where fairness of rankings may be very important.
• Understanding: see how large-scale search engine works.– Google source code is not available.
• Extensibility: can be customized andincorporated into your application
University of CyprusNutch in Practice• Nutch installations typically operate at one of three
scales: – local filesystem reliable (no network errors, caching is
unnecessary).– Intranet-level.– whole Web whole-Web crawling is difficult.
• Many crawling-oriented challenges when building a complete Web search engine:– Which pages do we start with? – How do we partition the work between a set of crawlers? – How often do we re-crawl? – How do we cope with broken links, unresponsive sites,
and unintelligible or duplicate content?
University of CyprusNutch Vs Lucene• Nutch is built on top of Lucene.
• "Should I use Lucene or Nutch?" – Use Lucene if you don't need a web crawler.
• e.g. you want to make a database searchable
• Nutch is a better fit for sites where you don't havedirect access to the underlying data, or it comesfrom disparate sources.
University of CyprusNutch Architecture• Nutch crawler + searcher
• Crawler: fetches pages, creates inverted index.
• Searcher: uses inverted index to answer queries.
• Crawler and Searcher are highly decoupled, enabling independent scaling on separate hardware platforms.
University of CyprusNutch Crawler• It consists of four main components:
– WebDB– Segments– Index– Crawl tool
University of CyprusWeb Database (WebDB)• Persistent data structure for mirroring the structure
and properties of the Web graph being crawled.• Used only by the crawler (not used during
searching).• The WebDB stores two types of entities.
– Pages: pages on the Web.– Links: the set of links from one page (to other pages).
• In the WebDB Web graph, the nodes are pagesand the edges are links.
University of CyprusSegments• A Segment is a collection of pages fetched and
indexed by the crawler in a single run.– limited lifespan (named by the date and time created).
• Fetchlist of a segment involves a list of URLs forthe crawler to fetch.
University of CyprusIndex• Nutch uses Lucene for indexing.
• Inverted index of all of the pages the system hasretrieved.– Each segment has its own index.
• A (global) inverted index is created by merging allindividual segment indexes.
University of CyprusCrawl tool• Crawling is a cyclical process:
1. The crawler generates a set of fetchlists from theWebDB.
2. A set of fetchers downloads the content from the Web.3. The crawler updates the WebDB with new links that
were found.4. The crawler generates a new set of fetchlists (for links
that haven't been fetched for a given period, includingthe new links found in the previous cycle).
5. This cycle repeats.
University of Cyprus
1. Create a new WebDB (admin db -create).2. Inject root URLs into the WebDB (inject).3. Generate a fetchlist from the WebDB in a new segment (generate).4. Fetch content from URLs in the fetchlist (fetch).5. Update the WebDB with links from fetched pages (updatedb).6. Repeat steps 3-5 until the required depth is reached.7. Update segments with scores and links from the WebDB
(updatesegs).8. Index the fetched pages (index).9. Eliminate duplicate content (and duplicate URLs) from the indexes
(dedup).10.Merge the indexes into a single index for searching (merge).
Steps in a Crawl+Index cycle
University of CyprusNutch as a Crawler
Initial URLs
Generator Fetcher
Segment
Webpages/files
Web
Parsergenerate
Injector
WebDB
read/write
Crawl tool
update get
read/write
University of CyprusNutch as a complete Web Search Engine
Indexer (Lucene)
Segments
Index
Searcher (Lucene)
GUI
WebDB LinkDB
(Tomcat)
University of CyprusRunning a Crawl
• The site structure for the site we are going to crawl:
• echo 'http://keaton/tinysite/A.html' > urls– file urls contains the root URL from which to populate the initial fetch list (page A).
• The Crawl tool uses a filter to decide which URLs go into the WebDB.– restrict the domain to the server on the intranet (/keaton).
• bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log– The Crawl tool uses the root URLs in urls file to start the crawl.– The results go to directory crawl-tinysite.– -depth flag tells the Crawler how many generate/fetch/update cycles to carry out
to get full page coverage.
University of CyprusExamine Results (File System)• Directories and files created after running the Crawl tool:
segments (pages)• Crawl created three segments in timestampedsubdirectories.• Each segment has its own index.
WebDB
Lucene index
A, B, C, C-duplinks to Wikipedia are not in WebDB(filter was used)
University of CyprusExamine results (Segments)• The Crawl tool created three segments in timestamped
subdirectories:
• PARSED column– Useful when running fetchers with parsing turned off, to be run
later as a separate process.• STARTED and FINISHED columns indicate the times when fetching
started and finished. – Invaluable for bigger crawls, when tracking down why crawling is
taking a long time.• COUNT column
– Shows the number of fetched pages in the segment. – E.g. last segment has two entries, corresponding to pages C and
C-duplicate.
changed into readseg in release 1.2
University of CyprusExamine results (Index&Search)• Command line searching through NutchBean:
bin/nutch.org.apache.nutch.searcher.NutchBean <keyword>where keyword is the search term.
Search results
University of CyprusExamine results (Index&Search)• GUI-based searching with Luke.• Luke is the Lucene
Index Toolbox.
• It accesses existing Lucene indexes and allows you to display and modify their contents.
• You can browse by doc number, view docs, execute search, analyze search results, retrieve ranked lists etc.
Download from: http://code.google.com/p/luke/
University of CyprusNutch Distributed File System (NDFS)• NDFS stores the crawling and indexes.• Data divided into blocks.• Blocks can be copied, replicated.• Namenode Vs Datanodes.• Datanodes hold and serve blocks.• Namenode holds metainfo.
– Filename block list– Block datanode-location
• Datanodes report to namenode every few seconds.
University of CyprusNutch & Hadoop• Hadoop is used in Nutch to manage data obtained
from the crawling process.
• MapReduce for indexing, parsing, WebDBconstruction, even fetching.
University of CyprusPlugins• Provide extensions to extension-points.• Each extension point defines an interface that
must be implemented by extension.• Some core extension points:
– IndexingFilter: add meta-data to indexed fields.– Parser: to parse a new type of document.– NutchAnalyzer: language specific analyzers.
University of CyprusGet Started with Nutch1. Download the latest Apache Nutch release (release
1.2) from: http://www.apache.org/dyn/closer.cgi/nutch/2. Set NUTCH_JAVA_HOME to the root of your JVM
installation.(* you need to set also JAVA_HOME to work).
3. Open up /conf/nutch-default.xml file, search for http.agent.name and give it value “MYNAME Spider”.
4. Create a urls file containing a list of root URLs.5. You can filter the crawling by editing the file
/conf/crawl-urlfilter.txt, replacing MY.DOMAIN.NAMEwith the name of the domain you wish to crawl.(* actually if you don’t do it, it will not work!)
University of CyprusInstalling in Tomcat1. You need to put the Nutch war file into your servlet
container.2. Assuming you've unpacked Tomcat as ~/local/tomcat, the
Nutch war file may be installed with the commands:
3. The webapp finds its indexes in ./crawl, relative to where you start Tomcat. Start Tomcat using a command like:
4. Then visit: http://localhost:8080/nutch/
mkdir ~/local/tomcat/webapps/nutchcp nutch*.war ~/local/tomcat/webapps/nutch/jar xvf ~/local/tomcat/webapps/nutch/nutch.warrm nutch-1.1.war
~/local/tomcat/bin/catalina.sh start
University of CyprusCrawl Command Vs Whole-Web Crawling• The Crawl command is more appropriate when
you intend to crawl up to around one million pages on a handful of Web servers.
• Whole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines.– More control over the crawl process.– Incremental crawling.
University of CyprusReferences• Nutch Web site: http://nutch.apache.org/• Nutch Docs: http://lucene.apache.org/nutch/• Nutch Wiki: http://wiki.apache.org/nutch/
(Support, mailing lists, tutorials, presentations)
• Prasad Pingali, CLIA consortium, Nutch Workshop, 2007.
• Tom White, Introduction to Nutch, java.net websitehttp://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html