Taming the Elephant
Debanjan MahataDepartment of Information
ScienceUniversity of Arkansas at
Little Rock
Overview• The Journey of Hadoop The Elephant
• Hadoop the Elephant Fundas• Story of Sam– How he tamed Hadoop?
• Map and Reduce• Python and the Elephant• Pig
Birth of The Elephant
Apache Nutch
Doug Cutting
“Map-reduce” 2004
“It is an important technique!”Reads paper
Extended
Joins Yahoo! at 2006
The great journey begins…
"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term." ---- Doug Cutting, Hadoop project creator
Naming the Elephant
Born to be Scaled• How do you scale up applications?
– Run jobs processing 100’s of terabytes of data
– Takes 11 days to read on 1 computer• Need lots of cheap computers
– Fixes speed problem (15 minutes on 1000 computers), but…
– Reliability problems• In large clusters, computers fail every day• Cluster size is not fixed
• Need common infrastructure– Must be efficient and reliable
• Open Source Apache Project• Hadoop Core includes:
– Distributed File System - distributes data
– Map/Reduce - distributes application• Written in Java• Runs on
– Linux, Mac OS/X, Windows, and Solaris– Commodity hardware
The Environment
As it Grew• Yahoo! deployed large scale science clusters in 2007.
• Tons of Yahoo! Research papers emerge:– WWW– CIKM– SIGIR– VLDB– ……
• Yahoo! began running major production jobs in 2008.
• Nowadays…
Nowadays…
Ads Optimizati
on
Content Optimizati
on Search Index
Content Feed
Processing
• When you visit yahoo, you are interacting with data processed with Hadoop!
Nowadays…
Ads Optimizati
on
Content Optimizati
on Search Index
Content Feed
Processing
Machine Learning (e.g. Spam filters)
• When you visit yahoo, you are interacting with data processed with Hadoop!
Nowadays…• Yahoo! has ~20,000 machines running Hadoop• The largest clusters are currently 2000 nodes• Several petabytes of user data (compressed, unreplicated)• Yahoo! runs hundreds of thousands of jobs every month
Nowadays…• Who use Hadoop?
• Amazon/A9• AOL• Facebook• Fox interactive media• Google • IBM• New York Times• PowerSet (now Microsoft)• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!• More at http://wiki.apache.org/hadoop/PoweredBy
HDFSHadoop's Distributed File System is designed to
reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.
Hadoop Distributed File System – Goals:• Store large data sets• Cope with hardware failure• Emphasize streaming data access
Typical Hadoop Structure
• Commodity hardware– Linux PCs with local 4 disks
• Typically in 2 level architecture– 40 nodes/rack– Uplink from rack is 8 gigabit– Rack-internal is 1 gigabit all-to-all
Hadoop structure• Single namespace for entire cluster
– Managed by a single namenode.– Files are single-writer and append-only.– Optimized for streaming reads of large files.
• Files are broken in to large blocks.– Typically 128 MB– Replicated to several datanodes, for reliability
• Client talks to both namenode and datanodes– Data is not sent through the namenode.– Throughput of file system scales nearly linearly with the number of nodes.
• Access from Java, C, or command line.
(map ‘( ))
( )
Sam applied his invention to all the fruits he could find in the fruit basket
Next Day
(reduce ‘( )) Classical Notion of MapReduce in Functional
Programming
A list of values mapped into another list of values, which gets reduced into a single
value
18 Years Later Sam got his first job in JuiceRUs for his talent in making juice
Now, it’s not just one basket but a whole container of fruits
Also, they produce a list of juice types separately
NOT ENOUGH !! But, Sam had just ONE
and ONE
Large data and list of values for output
Wait!
Implemented a parallel version of his innovation
Brave Sam
(<a, > , <o, > , <p, > , …)Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)Grouped by keyEach input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism)e.g. <a’, ( …)>
Reduced into a list of values
The idea of MapReduce in Data Intensive Computing
A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and
reduced into a list of values
Sam realized,◦ To create his favorite mix fruit juice he can use a combiner after the reducers
◦ If several <key, value-list> fall into the same group (based on the grouping/hashing algorithm) then use the blender (reducer) separately on each of them
◦ The knife (mapper) and blender (reducer) should not contain residue after use – Side Effect Free
◦ In general reducer should be associative and commutative
Afterwards
Now Utsav Introduces You With Our Star ElephantHey, So All
This For Me. I am really honored.
Thanks a lot.
Problems That I Wish To Solve With Hadoop
o Calculating tf_iDf scoreso Calculating feature vectors for large number of documents, blogs, users
o Sorting Indexes to Large Number of Documentso Building a Crawler for collecting event related data from social media
o Building classifierso Entity Resolutiono The list is endless…
Top Related