HADOOP BASIC CONCEPTS AND HDFS

19
HADOOP BASIC CONCEPTS AND HDFS

Transcript of HADOOP BASIC CONCEPTS AND HDFS

HADOOP BASIC CONCEPTS AND HDFS

Hadoop Basic Concepts and HDFS

In this chapter you will learn

What Hadoop is

What features the Hadoop Distributed File System (HDFS) provides

Hadoop Basic Concepts and HDFS

The Hadoop Project and Hadoop Components

!! The Hadoop Distributed File System (HDFS)

!! Hands/On Exercise: Using HDFS

!! Conclusion

Hadoop Components

Core Components: HDFS and MapReduce

! HDFS (Hadoop Distributed File System)

– Stores data on the cluster

! MapReduce

– Processes data on the cluster

A Simple Hadoop Cluster

! A Hadoop cluster: a group of machines working together to store and process data

! Any number of ‘slave’ or ‘worker’ nodes

– HDFS to store data

– MapReduce to process data

! Two ‘master’ nodes

– Name Node: manages HDFS

– Job Tracker: manages MapReduce

Hadoop Basic Concepts and HDFS

The Hadoop Project and Hadoop Components

The Hadoop Distributed File System (HDFS)

!! Hands/On Exercise: Using HDFS

!! Conclusion

HDFS Basic Concepts

! HDFS is a filesystem written in Java

– Based on Google’s GFS

! Sits on top of a naive filesystem

– Such as ext3, ext4 or xfs

! Provides redundant storage for massive amounts of data

– Using readily/available, industry/standard computers

How Files Are Stored

! Data files are split into blocks and distributed at load :me

! Each block is replicated on multiple data nodes (default 3x)

! NameNode stores metadata

Example: Storing and Retrieving Files (1)

Example: Storing and Retrieving Files (2)

Example: Storing and Retrieving Files (3)

Example: Storing and Retrieving Files (4)

HDFS NameNode Availability

! The NameNode daemon must be running at all times

– If the NameNode stops, the cluster becomes inaccessible

! High Availability mode (in CDH4 and later)

– Two NameNodes: Active and Standby

! Classic mode

– One NameNode

– One “helper” node called SecondaryNameNode

– Bookkeeping, not backup

Options for Accessing HDFS

! FsShell Command line: hadoop fs!

! Java API

! Ecosystem Projects

– Flume

Collects data from network sources

(e.g., system logs)

– Sqoop:

Transfers data between HDFS

and RDBMS

– Hue

Web/based interactive UI.

Can browse, upload, download, and view files

Hadoop Basic Concepts and HDFS

The Hadoop Project and Hadoop Components

The Hadoop Distributed File System (HDFS)

Hands/On Exercise: Using HDFS

!! Conclusion

Hands-on Exercise: Using HDFS

! In this Hands-On Exercise you will begin to get acquainted with the Hadoop tools. You will manipulate files in HDFS, the Hadoop Distributed File System

! Please refer to the Hands-On Exercise Manual

Hadoop Basic Concepts and HDFS

The Hadoop Project and Hadoop Components

The Hadoop Distributed File System (HDFS)

Hands/On Exercise: Using HDFS

Conclusion

Key Points

! The core components of Hadoop

– Data storage: Hadoop Distributed File System (HDFS)

– Data processing: MapReduce

! How HDFS works

– Files are divided into blocks

– Blocks are replicated across nodes

! Command line access to HDFS

– FsShell: hadoop fs

– Sub/commands: -get, -put, -ls, -cat, etc.