Apache Spark – The Game Changer

Apache Spark – The Game Changer

Author: Mohan Krishna Mannava

Apache Spark – Introduction:

Apache Spark™ is an engine for large-scale data processing that allows to run batch jobs, to query

data interactively and to process the incoming information as it enters the system. Spark became so popular because of its speed, ease of use, generality and it runs everywhere.

Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009, and open sourced in

2010.In 2013, the project was donated to the Apache Software Foundation and switched its license

to Apache 2.0. In February 2014, Spark became an Apache Top-Level Project. In November 2014, the

engineering team at Databricks used Spark and set a new world record in large scale sorting.

The Big Data Problem:

Data in the real world is growing so fast, even faster than computational speeds. The sources of this

data traffic are web, mobile, scientific computing etc. The storage also getting cheaper. In fact, the

storage size is doubling for every 18 months. But CPUs are not increasing in speed and there are many storage bottlenecks in getting data in and out of this massive storage.

The big data problem is a single machine cannot handle and no longer process this humongous

amounts of data or even hold all the data required for analysis.

The solution to this problem is distributing data across multiple machines that are part of large

clusters. But the hardware for big data, CPUs, memory chips & hard drives, are not reliable and much

harder to manage. There are some problems associated with this hardware like we may lose some of

hard drives every year, network speeds are very lower and uneven performance.

Challenges in Cluster Computing:

The main challenges in cluster computing are splitting work across machines and dealing with failures.

When the work is shared across multiple machines and if one of the machine is failed, then the

simplest solution is to launch another task either on the same machine or on different machine if it’s failed.

But, if there are slow tasks, then the slow task should be killed and new task should be launched on

a different machine because maybe the original machine was about to fail.

Map Reduce:

The concept behind Map Reduce is mapping data to different machines by dividing it into multiple

pieces that can handle by a multiple machines in a cluster, performing the desired operations on those machines and again sending the results to different /same machines in the cluster.

Partitioning everything across machines in a cluster and also participating results across machines in the cluster is known as Map Reduce.

Map Reduce Implementation handles

Execution of Map & Reduce on many machines

Shuffles data between Map and Reduce functions

Automatically recovers from machine failures and slow machines

Map Reduce and Disk I/O

Map Reduce follow distributed execution in which, each stage process through hard drives.

Initially, Map step reads data from disk/hard drive, processes it and send it back to disk before

performing shuffling operation to send data to reduce. At reduce, again reads data from disk,

processes it and writes data back to disk.

Map Reduce iterative jobs involve lot of disk I/O for each repetition.

Apache Spark Motivation:

Since using Map Reduce for complex jobs, interactive queries &online processing involves lot of disk

I/O and in case of interactive jobs disk I/O is very slow, now, the storage options changed from hard drives to memory.

Memory I/O is so fast (10 – 100 times) than network and disk I/O and keeping more data in memory

is the main motivation behind developing a new distributed engine called “Spark”.

Apache Spark provides an abstraction called “Resilient Distributed Datasets (RDDs)”.

Resilient Distributed Datasets (RDDs):

These are partitioned collections of objects spread across a cluster and stored in a memory or on

disk. All programs written on spark are in terms of RDDs. They built and manipulated through a

diverse set of parallel transformations like map, filter, join etc., and actions like count, collect, save

etc. RDDs automatically rebuilt in case of any machine failures. They are immutable and one cannot change it or transform it once it is created.

The Spark Computing Framework provides programming abstraction and parallel runtime to hide complexities of fault tolerance and slow machines.

Spark Tools:

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called

SchemaRDD, which provides support for structured and semi-structured data. Spark SQL provides a

domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. It also provides SQL

language support, with command-line interfaces and ODBC/JDBC server. In Spark version 1.3,

SchemaRDD is renamed to DataFrame.

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It

ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This

design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine.

MLlib is a distributed machine learning framework on top of Spark that because of the distributed

memory-based Spark architecture is, according to benchmarks done by the MLlib developers against

the Alternating Least Squares (ALS) implementations, nine times as fast as the Hadoop disk-based

version of Apache Mahout (before Mahout gained a Spark interface) and even scales better

than Vowpal Wabbit. It implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines, including:

summary statistics, correlations, stratified sampling, hypothesis testing, random data

generation

classification and regression: SVMs, logistic regression, linear regression, decision trees,

naive Bayes

collaborative filtering: alternating least squares (ALS)

clustering: k-means, Latent Dirichlet Allocation (LDA)

dimensionality reduction: singular value decomposition (SVD), principal component

analysis (PCA)

feature extraction and transformation

optimization primitives: stochastic gradient descent, limited-memory BFGS (L-BFGS)

GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing

graph computation that can model the Pregel abstraction. It also provides an optimized runtime for

this abstraction.

GraphX started initially as a research project at UC Berkeley AMPLab and Databricks, and was later

donated to the Spark project.

Why Spark? Why not Hadoop MapReduce?

There are many other advantages for Spark over Map Reduce like low overhead of starting jobs, less expensive shuffles, lazy evaluation, generalized patterns etc.

Spark is often faster than traditional Map Reduce implementation because spark writes all results to memory than to disks and results do not need to be serialized & do not need to be written to disk.

Programming with Spark:

There are different types of programming environments like R, Java, Scala, and Python in Spark to perform desired operations. All types of operations are performed on Spark’s abstraction RDDs.

Spark program is a combination of 2 programs.

Driver Program: It runs on driver machine

Worker Program: It runs on cluster nodes or local threads.

RDDs are distributed across workers.

A spark program first creates a SparkContext object which tells Spark how and where to access a cluster.

The master parameter for a SparkContext determines which type and size of cluster to use.

Operations on RDDs:

The two types of operations that can be performed on RDDs are “transformations” and “actions”.

Transformations are lazy in Spark which are computed not immediately instead spark remembers

the set of transformations applied to base dataset and transformed RDD is executed when actions

runs on it. An action cause spark to execute the applied transformation to source and helps to get results out of Spark.

Some of the Spark Transformations

Some of the Spark Actions

Similar to Map Reduce, Spark supports Key Value pairs. Each element of a pair RDD is a tuple. The following are the some of the Key Value transformations.

Spark Program Lifecycle:

1. Creating RDDs form external data or parallelize a collection in Driver program

2. Lazily transforming them into new RDDs

3. Cache( ) some RDDs for reuse 4. Performing actions to execute parallel computation and producing results

Spark Advantages:

Speed: Spark runs programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on

disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory

computing.

Logistic regression in Hadoop and Spark

Ease of Use: Spark allows to write applications quickly in R, Java, Scala and Python. Spark offers over

80 high-level operators that make it easy to build parallel apps and can use it interactively from the Scala, Python and R shells.

Generality: Spark combines SQL, streaming, and complex analytics. Spark powers a stack of high-

level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. One can combine these libraries seamlessly in the same application.

Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

References:

https://en.wikipedia.org/wiki/Apache_Spark

https://spark.apache.org/

http://www.spark-stack.org/getting-started/

Introduction to Big data with Apache Spark – UCBerkely – Edx

https://en.wikiversity.org/wiki/Big_Data/Spark

https://en.wikipedia.org/wiki/Apache_Spark

https://spark.apache.org/

http://www.spark-stack.org/getting-started/

https://en.wikiversity.org/wiki/Big_Data/Spark

Apache Spark – The Game Changer

Documents

Transcript of Apache Spark – The Game Changer