Outlines
• Big Data Overview.• Big Data Management Systems
• Hadoop• Cassandra• Datastax
• Big Data Query languauges• Hive QL• CQL
What is Big Data ?
• Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. It can serve as the basis for innovation, differentiation and growth.
Big DataIt is a collection of data sets too large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Opportunities for a New Approach to Analytics New Applications Driving Data Volume
2000’s(CONTENT & DIGITAL ASSET MANAGEMENT)
1990’s(RDBMS & DATA WAREHOUSE)
2010’s(NO-SQL & KEY/VALUE)
VOLU
ME O
F INFO
RMATIO
N
LARGE
SMALL
MEASURED INTERABYTES1TB = 1,000GB
MEASURED INPETABYTES1PB = 1,000TB
WILL BE MEASURED INEXABYTES1EB = 1,000PB
What do we Mean by Hadoop
• A framework for performing big data analytics– An implementation of the MapReduce paradigm
– Hadoop glues the storage and analytics together and provides reliability, scalability, and management
Structure of Apache hadoop platform (Hadoop ecosystem) cont.
consists of • JobTracker• TaskTracker • NameNode • DataNode• A slave or worker node acts as both a DataNode and TaskTracker.
File System
Hadoop Distributed File SystemHDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework .Name node Data node Secondary namenodeData awareness
HDFS ArchitectureNamenode
Breplication
Rack1 Rack2Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata ops Metadata(Name, replicas..)(/home/foo/data,6. ..
Block ops
Putting it all Together: MapReduce and HDFS
a
Task TrackerTask Tracker Task Tracker
Job Tracker
Hadoop Distributed File System (HDFS)
Client/Dev
Large Data Set(Log files, Sensor Data)
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job2
1
3
4
MapReduce
Cat
Bat
Dog
Other Words(size:TByte)
map
map
map
map
split
split
split
split
combine
combine
combinereduce
reduce
reduce
part0
part1
part2
Big Data Management Systems Apache Cassandra
Apache Cassandra is a massively scalable NoSQL database. Cassandra is architected to handle real-time big data workloads across multiple data centers with no single point of failure, providing enterprises with extremely high database performance and continuous availability.
Cassandra
Key features of Apache Cassandra
Cassandra is built for enterprise-class, real-time database deployments.Its feature set includes:• Big data scalability.• Peer-to-peer.• Fast linear performance• Elastic scalability • No single point of failure.• Multi-data center/cloud capable
Key features of Apache Cassandra cont.• Location independence.• Dynamic schema. • Tunable data consistency.• Data compression.• Cloud-ready.• SQL-like language (CQL).• Memory efficient.• Easy setup.
What is Data stax enterprise?
DataStax Enterprise is a big data platform built on Apache Cassandra that manages real-time, analytics, and enterprise search data.
DataStax Enterprise
Key Features of DataStax Enterprise• Production Certified Cassandra • No Single Point of Failure• Reserve Job Tracker • Multiple Job Trackers • Hadoop MapReduce using Multiple Cassandra File Systems.
• Analytics Without ETL • Elastic Workload Re-provisioning.
Key Features of DataStax Enterprise cont’• Streamlined Setup and Operations • Hive Support.• Pig Support.• Enterprise Search Capabilities • Migration of RDBMS data.• Runtime Logging.• Support for Mahout• Full Integration with DataStax OpsCenter
What’s Hive?
Apache Hive is a data warehousing solution which is built over Hadoop. It is powered by HiveQL which is a declarative SQL language compiled directly into Map Reduce jobs which are executed over the underlying Hadoop architecture.
Apache Hive
Advantages Of Using APACHE HIVE
• Fits the low level interface requirement of Hadoop perfectly.
• Supports external tables which make it possible to process data without actually storing in HDFS.
• It has a rule based optimizer for optimizing logical plans.
• Supports partitioning of data at the level of tables to improve performance.
• Metastore or Metadata store is a big plus in the architecture which makes the lookup easy.
Disadvantages Of Using APACHE HIVE• No support for update and delete.
• No support for singleton inserts. Data is required to be loaded from a file using LOAD command.
• No access control implementation.
• Correlated sub queries are not supported.
What's New at Hive QL
At Data Units: • Partitions• Buckets (or Clusters)Complex Types•Structs: the elements within the type can be accessed using the DOT (.) notation.
• Maps (key-value tuples)• Arrays (indexable lists)
Cassandra Query Language
Effectively a structured, Attempting to push as much server-side as possible, familiar syntax, User friendly API, SQL like query language.
CQL
Top Related