An Analysis of Linear Regression Algorithm implemented on Hadoop using Machine Learning Techniques
-
Upload
independent -
Category
Documents
-
view
2 -
download
0
Transcript of An Analysis of Linear Regression Algorithm implemented on Hadoop using Machine Learning Techniques
An Analysis of Linear Regression Algorithm
Implemented On HADOOP Using Machine Learning
Techniques
Abstract: Big data processing is currently becoming increasingly important in modern era due to
continuous growth of the amount of data generated by various fields such as physics, genomics,
earth observations, etc. So there is a necessity to analyse. Big data analytics is an advanced
technique for huge data sets. As the data set are overwhelming with numerous terabytes of data
advanced tools and systems are required to capture, store, manage and analyze the data sets in a
time frame. This provides greater performance by using open source implementation of the parallel
distributed programming framework Map Reduce called Hadoop. Hadoop allows distributed
processing of large data sets across clusters of computers using simple Map Reduce programming
model. Hadoop uses a distributed file structure HDFS for processing large datasets. Our work is to
compare with and without usage of Hadoop system by adopting machine learning technique. This
paper adapt Google’s MapReduce paradigm to parallel speed up and linear regression algorithm
from machine learning community to distinguish individual processor’s performance.
Keywords: Map Reduce, Hadoop, HDFS, Machine Learning, Linear Regression, Big Data, Big data
analytics
Dr. Ananthi Sheshasaayee
MCA, M Phil, Ph D, PGDET, Research Supervisor
Research Department of Computer Science & Applications
Quaid - e - Millath Goverment College for Women
Chennai, India
Mrs. J V N Lakshmi
MCA, M Sc (Statistics)
Research scholar
SCSVMV University
Kanchipuram, India
ISSN 2319-9725
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 109
1. Introduction:
Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures,
videos, posts to social media sites, intelligent sensors, purchase transaction records, cell
phone GPS signals to name a few. This is Big Data. There is a great interest both in the
commercial and in the research communities around Big Data.
Big Data is a new label given to a diverse field of data intensive informatics in which the
datasets are so large that they become hard to work with effectively [15]. The term has been
mainly used in two contexts, firstly as a technological challenge when dealing with data
intensive domains such as high energy physics, astronomy or internet search, and secondly as
a sociological problem when data about us is collected and mined by companies such as
Facebook, Google, mobile phone companies, retail chains and governments.
Organizations rely on huge scale data processing. Modern Internet processes petabyte of
structured, unstructured and semi-structured data for day to day operations [15]. Data comes
from diverse fields of technological domains and sociological problems etc., Big Data usually
includes data sets with the sizes beyond the ability of common software tools to analyse,
manage and process the data within certain time frames. Big data analytics is the process of
examining large amounts of different data types, or big data, in an effort to uncover hidden
patterns, unknown correlations and other useful information [3]. Big Data tools are being
developed to handle various aspects of large quantities of data. Map Reduce is a generic
parallel programming model for processing such large data sets [14]. Hadoop is an open
source framework that supports the running of applications on large clusters. Hadoop
implements a computational paradigm named Map Reduce and uses a distributed storage
system HDFS. This article suggests use of Hadoop, a scalable platform for Machine Learning
Algorithms to process bigdata.
Many machine learning algorithms can be written as Map-Reduce programs. We discuss the
characteristics of the Map-Reduce programs and an example of Linear regression. First, one
Map-Reduce corresponds to one iteration of the algorithm if the algorithm is iterative. Map -
Reduce is executed many times until convergence. Second, the Reduce function is frequently
a summation . Third, the results of the Reduce function are broadcasted to all the nodes which
execute the Map function[18].
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 110
This paper is organized as below. Section 2 gives the related work to carry out the study.
Section 3 compares the system with HDFS and without HDFS using Machine Learning
technique. In Section 4 using of Linear Regression Algorithm of ML community. In Section
5 discusses on Analysis of Time complexity. This paper is concluded in Section 6.
2. Related Work:
2.1. Map Reduce – Data Flow Model:
Map Reduce is a data flow paradigm for such applications [4]. It’s simple, explicit data flow
programming model, favored over the traditional high level data base approaches. Map
Reduce paradigm parallelize huge data sets using clusters or grids. A Map Reduce program
comprises of two functions, a Map( ) Procedure and a Reduce( ) Procedure.
A Map( ) procedure performs filtering and sorting by managing the “infrastructure”.
Whereas Reduce( ) procedure summarizes the operations designing the “framework”.
Therefore, the system is referred as “infrastructure framework”. This system runs several
tasks in parallel on distributed servers, managing data transfer across various parts of the
system.
This model [8] is inspired by the map and reduce functions used in functional programming.
Thus, the reducers combine information of different records having the same key and
aggregate the intermediate results of the mappers. The results are stored back to the
distributed file system. Figure[1] depicts the flow of data between Map phase and Reduce
phase.
Figure 1: A Map Reduce Programming Model
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 111
Figure 1 depicts the data from various sources is taken as input and the is mapped by calling a
Map() procedure. Then mapped data gives an output that data is reduced by calling the
Reduce() procedure. This results the output data.
2.1.1. Map - Shuffle – Reduce [14]:
1. Prepare the Map() input – the "MapReduce system" designates Map processors,
assigns the K1 input key value each processor would work on, and provides that
processor with all the input data associated with that key value.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value,
generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors – the MapReduce system
designates Reduce processors, assigns the K2 key value each processor would work
on, and provides that processor with all the Map-generated data associated with that
key value.
4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key
value produced by the Map step.
5. Produce the final output – the MapReduce system collects all the Reduce output, and
sorts it by K2 to produce the final outcome.
The popular Map reduce data processing framework however in practice leads to several
problems [8]. Firstly it does not directly support complex n step data flows which often arise
in practice. Secondly, it lacks in processing multiple data sets in aspect of Knowledge
Discovery and finally filtering, aggregation must be coded [10].
Consequently users move their demands to some other processing data flows [6]. This
practice of Map Reduce paradigm slows down data analytics, makes processing of data
difficult and impede automated optimization. So alternative to store and process extremely
large data sets, development of the popular open source implementation Hadoop Map Reduce
is derived.
2.2. Hadoop – High Availability Distributed Object Oriented Platform:
Hadoop is a open source framework that supports the running of applications on large
clusters. Hadoop implements a computational paradigm named Map Reduce, where the
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 112
application is divided into many small fragments of work, each of which is executed on any
node in the cluster. It provides a distributed file system that stores data on nodes providing
very high aggregate bandwidth across the cluster.
Two major components of Hadoop are:
1. Hadoop Distributed File System (HDFS) for distributed storage.
2. Map Reduce for parallel processing
2.3. HDFS – Hadoop Distributed File System
Dong Cutting states that Hadoop creates clusters of machines which coordinate the work and
built in such a way to operate no loss in data or an interrupt in work [5]. HDFS manages
storage on clusters by maintaining files into pieces called blocks and storing each of the
blocks redundantly across the pool of servers. HDFS maintains two nodes called master node
(Job Tracker) and a slave node (Task Tracker).
They are scalable to clusters of many computing nodes, which are easily expanded by new
nodes. They are fault-tolerant: If one of the computing nodes fails during the execution of the
program, the work of the other nodes is not affected or discarded. The records that were
currently processed on the failing node have to be processed again by another node. This fault
tolerance particularly supports running Hadoop on commodity hardware [9].
Russom agrees HDFS has advantage over DBMS. As a file system, HDFS can handle any
file or document type containing data that ranges from structured to unstructured, says
Russom. “When HDFS and Map Reduce are combined, Hadoop easily parses and indexes the
full range of data types. Furthermore, as a distributed system, HDFS scales well and has a
certain amount of fault tolerance based on data replication, even when deployed atop
commodity hardware. For these reasons, HDFS and Map Reduce can complement existing
Data Warehousing systems that focus on structured, relational data”.
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 113
Source : [16] HDFS Architecture
Figure 2: Hadoop Distributed File System Architecture
In Figure 2 Job Tracker communicates with the name node and a assigns parts of a job to
Task Tracker. Task Tracker runs on each data node. A Task is a single map or a reduce
operation over a piece of data. Hadoop divides the input to Map or Reduce jobs into equal
splits. Job Tracker reschedules any failed Task Trackers.
1. Data is organized into files and directories into uniformed sized blocks.
2. Blocks are replicated and distributed to handle hardware failures.
3. HDFS exposes block replacement so that computation can be distributed [13].
3. Hadoop Using Machine Learning:
This section presents a model with and without using a Hadoop system. Analysis on time
complexity using Hadoop is depicted.
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 114
Serial implementation without using Hadoop distributed file system will be furnished as
below model[1]. This model has less productivity when compared with Hadoop.
3.1. A Model On Machine Learning Algorithm Without Hadoop:
Source : [1]
This model depicts Machine Learning serial implementation without Hadoop and we intend
to find efficiency of time using it. Using Hadoop, Machine Learning can be parallelized on a
single core system achieving a linear speed up in execution and performance.
The success of the Hadoop platform makes the developers to build powerful tools, which data
scientists and engineers can exploit to extract massive amounts of data. On developing a
parallel programming technique for multicore processors, speeds up machine learning
applications rather than search for specialized optimizations [11]. Therefore focusing on
Machine Learning technique enables statistical regularities that can be distilled into models
that are capable for making future predictions [7].
Tom M Mitchell suggests Machine Learning has enormous application in the field of data
warehousing. It is explicitly used for huge data sets to parallelize the algorithms to run in
certain time frames. A general Machine Learning programming approach shares memory of
machines [17].
Input Data
Serial Implementation
Using ML Algorithm
Finding efficiency
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 115
3.2. A Model On Machine Learning Algorithm With Hadoop:
Source : [1]
This model depicts Machine Learning using Map Reduce paradigm with Hadoop and we
intend to find efficiency of time. Many standard Machine Learning algorithms follow one of
a canonical data processing [2]. A large subset of these can be phrased on Map Reduce task.
Illuminating the benefits of Map Reduce framework offers to the Machine Learning
community. This paper chooses linear regression algorithm (summation method) for analysis.
4. Linear Regression Algorithm:
Algorithm calculated gradient for summing up data points for easy distribution over
multicores [11]. Dividing the data sets into as many pieces as there are cores, give each core
its share of the data to sum the equations and aggregate the results at the end. This form of
algorithm is referred as “summation form” [11].
When fitting model parameters for classification or regression, the Map involves with certain
data sets given in the current model parameters. Similarly we can get a subset of parameters
from previous iterations for inference. However, the reduce stage typically involves
summing over parameters when changes.
Each data node is a n dimensional vector xi = (xi1, xi2 … xin) associated with a real value
target labels yi = (y1, y2, .... yn). A data set D = {(x,y)} of m such data nodes defines a m x n
dimensional matrix x and an m dimensional vector y.
Storing Data on HDFS
Machine Learning
Algorithm Using Map
Reduce paradigm
Finding Efficiency
Input Data
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 116
Linear regression parameter vector * is typically defined for solving the design such that a
matrix whose rows contains the training instances [12].
y = TX
*=
Compute XTY =∑
and X
T X= ∑
i = 1, 2 … m
This algorithm reduces to the case of ordinary least square fit.
5. Time Complexity Analysis:
The dimension of the inputs is n and m training examples, and there are P cores. The
complexity of iterative algorithm is analysed and running time is slower when compared
without usage of HDFS. Running time analysis for Linear regression is O(mn2+n
3) and on
multicore the complexity is O( mn2 / p + n
3 / p’ + n
2 log(p)) [11] . Finally, multicore
simulates faster performance with less cost.
6. Conclusion:
In this paper, by adopting summation form of Linear regression Algorithm in a Map-Reduce
framework, we could parallelize 1.9 times speed up on a dual processor. The speed up is
achieved on using HDFS with Machine Learning Algorithm. Time complexity analysis of
Linear Regression on legacy system is compared with that of Hadoop distributed system. It
was found that the algorithm performs productively for Big Data. Hence time efficiency is
achieved.
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 117
References:
1. Building Machine Learning Algorithms on Hadoop for Bigdata Asha T, Shravanthi
U.M, Nagashree N, Monika M Department of Information Science & Engineering
Bangalore Institute of Technology, Bangalore, INDIA 2013.
2. Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc.
2012.
3. Mastiff: A MapReduce-based System for Time-based Big Data Analytics 2012 Sijie
Guo, Jin Xiong, Weiping Wang.
4. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in
large-scale data analytics. MAPREDUCE Workshop, 2011.
5. Doug Cutting et al. About hadoop. http://lucene.apache.org/hadoop/about.html.
6. J . Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan &
Claypool, 2010.
7. SystemML: Declarative Machine Learning on MapReduce 2010 Amol Ghoting #,
Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald Vikas Sindhwani,
Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan IBM Watson Research
Center IBM Almaden Research Center.
8. Mapreduce: A flexible data processing tool, ACM Communications, Volume 53, pp
72-77, 2010.
9. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience
2009 Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M.
Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh
Srivastava Yahoo!, Inc.
10. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J.. Rasin, A., and Silberschatz, A.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
Analytical from Workloads. PVLDB 2(1), 2009.
11. Cheng-Tao Chu et al. Map-Reduce for machine learning on multicore. In NIPS, 2007.
12. Linear Regression Cite as: Whitney Newey, course materials for 14.386 New
Econometric Methods, Spring 2007.MIT Open Course Ware (http://ocw.mit.edu),
Massachusetts Institute of Technology.
13. Apache Hadoop. http://wiki.apache.org/hadoop.
14. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large
clusters. In ACM OSDI, 2004.
July, 2014 www.ijirs.com Vol3 Issue 7
International Journal of Innovative Research and Studies Page 118
15. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system.
In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George,
New York, 2003.
16. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.
17. A Brief Introduction into Machine Learning Gunnar Ratsch Friedrich Miescher
Laboratory.
18. An Adaptive Machine Learning on Map-Reduce Framework for Improving
Performance of Large- Scale Data Analysis on EC2 Walisa Romsaiyud1 and Wichian
Premchaiswadi2