An Analysis of Linear Regression Algorithm implemented on Hadoop using Machine Learning Techniques

An Analysis of Linear Regression Algorithm

Implemented On HADOOP Using Machine Learning

Techniques

Abstract: Big data processing is currently becoming increasingly important in modern era due to

continuous growth of the amount of data generated by various fields such as physics, genomics,

earth observations, etc. So there is a necessity to analyse. Big data analytics is an advanced

technique for huge data sets. As the data set are overwhelming with numerous terabytes of data

advanced tools and systems are required to capture, store, manage and analyze the data sets in a

time frame. This provides greater performance by using open source implementation of the parallel

distributed programming framework Map Reduce called Hadoop. Hadoop allows distributed

processing of large data sets across clusters of computers using simple Map Reduce programming

model. Hadoop uses a distributed file structure HDFS for processing large datasets. Our work is to

compare with and without usage of Hadoop system by adopting machine learning technique. This

paper adapt Google’s MapReduce paradigm to parallel speed up and linear regression algorithm

from machine learning community to distinguish individual processor’s performance.

Keywords: Map Reduce, Hadoop, HDFS, Machine Learning, Linear Regression, Big Data, Big data

analytics

Dr. Ananthi Sheshasaayee

MCA, M Phil, Ph D, PGDET, Research Supervisor

Research Department of Computer Science & Applications

Quaid - e - Millath Goverment College for Women

Chennai, India

Mrs. J V N Lakshmi

MCA, M Sc (Statistics)

Research scholar

SCSVMV University

Kanchipuram, India

ISSN 2319-9725

July, 2014 www.ijirs.com Vol3 Issue 7

International Journal of Innovative Research and Studies Page 109

1. Introduction:

Every day, 2.5 quintillion bytes of data are created. This data comes from digital pictures,

videos, posts to social media sites, intelligent sensors, purchase transaction records, cell

phone GPS signals to name a few. This is Big Data. There is a great interest both in the

commercial and in the research communities around Big Data.

Big Data is a new label given to a diverse field of data intensive informatics in which the

datasets are so large that they become hard to work with effectively [15]. The term has been

mainly used in two contexts, firstly as a technological challenge when dealing with data

intensive domains such as high energy physics, astronomy or internet search, and secondly as

a sociological problem when data about us is collected and mined by companies such as

Facebook, Google, mobile phone companies, retail chains and governments.

Organizations rely on huge scale data processing. Modern Internet processes petabyte of

structured, unstructured and semi-structured data for day to day operations [15]. Data comes

from diverse fields of technological domains and sociological problems etc., Big Data usually

includes data sets with the sizes beyond the ability of common software tools to analyse,

manage and process the data within certain time frames. Big data analytics is the process of

examining large amounts of different data types, or big data, in an effort to uncover hidden

patterns, unknown correlations and other useful information [3]. Big Data tools are being

developed to handle various aspects of large quantities of data. Map Reduce is a generic

parallel programming model for processing such large data sets [14]. Hadoop is an open

source framework that supports the running of applications on large clusters. Hadoop

implements a computational paradigm named Map Reduce and uses a distributed storage

system HDFS. This article suggests use of Hadoop, a scalable platform for Machine Learning

Algorithms to process bigdata.

Many machine learning algorithms can be written as Map-Reduce programs. We discuss the

characteristics of the Map-Reduce programs and an example of Linear regression. First, one

Map-Reduce corresponds to one iteration of the algorithm if the algorithm is iterative. Map -

Reduce is executed many times until convergence. Second, the Reduce function is frequently

a summation . Third, the results of the Reduce function are broadcasted to all the nodes which

execute the Map function[18].



This paper is organized as below. Section 2 gives the related work to carry out the study.

Section 3 compares the system with HDFS and without HDFS using Machine Learning

technique. In Section 4 using of Linear Regression Algorithm of ML community. In Section

5 discusses on Analysis of Time complexity. This paper is concluded in Section 6.

2. Related Work:

2.1. Map Reduce – Data Flow Model:

Map Reduce is a data flow paradigm for such applications [4]. It’s simple, explicit data flow

programming model, favored over the traditional high level data base approaches. Map

Reduce paradigm parallelize huge data sets using clusters or grids. A Map Reduce program

comprises of two functions, a Map( ) Procedure and a Reduce( ) Procedure.

A Map( ) procedure performs filtering and sorting by managing the “infrastructure”.

Whereas Reduce( ) procedure summarizes the operations designing the “framework”.

Therefore, the system is referred as “infrastructure framework”. This system runs several

tasks in parallel on distributed servers, managing data transfer across various parts of the

system.

This model [8] is inspired by the map and reduce functions used in functional programming.

Thus, the reducers combine information of different records having the same key and

aggregate the intermediate results of the mappers. The results are stored back to the

distributed file system. Figure[1] depicts the flow of data between Map phase and Reduce

phase.

Figure 1: A Map Reduce Programming Model



Figure 1 depicts the data from various sources is taken as input and the is mapped by calling a

Map() procedure. Then mapped data gives an output that data is reduced by calling the

Reduce() procedure. This results the output data.

2.1.1. Map - Shuffle – Reduce [14]:

1. Prepare the Map() input – the "MapReduce system" designates Map processors,

assigns the K1 input key value each processor would work on, and provides that

processor with all the input data associated with that key value.

2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value,

generating output organized by key values K2.

3. "Shuffle" the Map output to the Reduce processors – the MapReduce system

designates Reduce processors, assigns the K2 key value each processor would work

on, and provides that processor with all the Map-generated data associated with that

key value.

4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key

value produced by the Map step.

5. Produce the final output – the MapReduce system collects all the Reduce output, and

sorts it by K2 to produce the final outcome.

The popular Map reduce data processing framework however in practice leads to several

problems [8]. Firstly it does not directly support complex n step data flows which often arise

in practice. Secondly, it lacks in processing multiple data sets in aspect of Knowledge

Discovery and finally filtering, aggregation must be coded [10].

Consequently users move their demands to some other processing data flows [6]. This

practice of Map Reduce paradigm slows down data analytics, makes processing of data

difficult and impede automated optimization. So alternative to store and process extremely

large data sets, development of the popular open source implementation Hadoop Map Reduce

is derived.

2.2. Hadoop – High Availability Distributed Object Oriented Platform:

Hadoop is a open source framework that supports the running of applications on large

clusters. Hadoop implements a computational paradigm named Map Reduce, where the



application is divided into many small fragments of work, each of which is executed on any

node in the cluster. It provides a distributed file system that stores data on nodes providing

very high aggregate bandwidth across the cluster.

Two major components of Hadoop are:

1. Hadoop Distributed File System (HDFS) for distributed storage.

2. Map Reduce for parallel processing

2.3. HDFS – Hadoop Distributed File System

Dong Cutting states that Hadoop creates clusters of machines which coordinate the work and

built in such a way to operate no loss in data or an interrupt in work [5]. HDFS manages

storage on clusters by maintaining files into pieces called blocks and storing each of the

blocks redundantly across the pool of servers. HDFS maintains two nodes called master node

(Job Tracker) and a slave node (Task Tracker).

They are scalable to clusters of many computing nodes, which are easily expanded by new

nodes. They are fault-tolerant: If one of the computing nodes fails during the execution of the

program, the work of the other nodes is not affected or discarded. The records that were

currently processed on the failing node have to be processed again by another node. This fault

tolerance particularly supports running Hadoop on commodity hardware [9].

Russom agrees HDFS has advantage over DBMS. As a file system, HDFS can handle any

file or document type containing data that ranges from structured to unstructured, says

Russom. “When HDFS and Map Reduce are combined, Hadoop easily parses and indexes the

full range of data types. Furthermore, as a distributed system, HDFS scales well and has a

certain amount of fault tolerance based on data replication, even when deployed atop

commodity hardware. For these reasons, HDFS and Map Reduce can complement existing

Data Warehousing systems that focus on structured, relational data”.



Source : [16] HDFS Architecture

Figure 2: Hadoop Distributed File System Architecture

In Figure 2 Job Tracker communicates with the name node and a assigns parts of a job to

Task Tracker. Task Tracker runs on each data node. A Task is a single map or a reduce

operation over a piece of data. Hadoop divides the input to Map or Reduce jobs into equal

splits. Job Tracker reschedules any failed Task Trackers.

1. Data is organized into files and directories into uniformed sized blocks.

2. Blocks are replicated and distributed to handle hardware failures.

3. HDFS exposes block replacement so that computation can be distributed [13].

3. Hadoop Using Machine Learning:

This section presents a model with and without using a Hadoop system. Analysis on time

complexity using Hadoop is depicted.



Serial implementation without using Hadoop distributed file system will be furnished as

below model[1]. This model has less productivity when compared with Hadoop.

3.1. A Model On Machine Learning Algorithm Without Hadoop:

Source : [1]

This model depicts Machine Learning serial implementation without Hadoop and we intend

to find efficiency of time using it. Using Hadoop, Machine Learning can be parallelized on a

single core system achieving a linear speed up in execution and performance.

The success of the Hadoop platform makes the developers to build powerful tools, which data

scientists and engineers can exploit to extract massive amounts of data. On developing a

parallel programming technique for multicore processors, speeds up machine learning

applications rather than search for specialized optimizations [11]. Therefore focusing on

Machine Learning technique enables statistical regularities that can be distilled into models

that are capable for making future predictions [7].

Tom M Mitchell suggests Machine Learning has enormous application in the field of data

warehousing. It is explicitly used for huge data sets to parallelize the algorithms to run in

certain time frames. A general Machine Learning programming approach shares memory of

machines [17].

Input Data

Serial Implementation

Using ML Algorithm

Finding efficiency



3.2. A Model On Machine Learning Algorithm With Hadoop:

Source : [1]

This model depicts Machine Learning using Map Reduce paradigm with Hadoop and we

intend to find efficiency of time. Many standard Machine Learning algorithms follow one of

a canonical data processing [2]. A large subset of these can be phrased on Map Reduce task.

Illuminating the benefits of Map Reduce framework offers to the Machine Learning

community. This paper chooses linear regression algorithm (summation method) for analysis.

4. Linear Regression Algorithm:

Algorithm calculated gradient for summing up data points for easy distribution over

multicores [11]. Dividing the data sets into as many pieces as there are cores, give each core

its share of the data to sum the equations and aggregate the results at the end. This form of

algorithm is referred as “summation form” [11].

When fitting model parameters for classification or regression, the Map involves with certain

data sets given in the current model parameters. Similarly we can get a subset of parameters

from previous iterations for inference. However, the reduce stage typically involves

summing over parameters when changes.

Each data node is a n dimensional vector xi = (xi1, xi2 … xin) associated with a real value

target labels yi = (y1, y2, .... yn). A data set D = {(x,y)} of m such data nodes defines a m x n

dimensional matrix x and an m dimensional vector y.

Storing Data on HDFS

Machine Learning

Algorithm Using Map

Reduce paradigm

Finding Efficiency

Input Data



Linear regression parameter vector * is typically defined for solving the design such that a

matrix whose rows contains the training instances [12].

y = TX

*=

Compute XTY =∑

and X

T X= ∑

i = 1, 2 … m

This algorithm reduces to the case of ordinary least square fit.

5. Time Complexity Analysis:

The dimension of the inputs is n and m training examples, and there are P cores. The

complexity of iterative algorithm is analysed and running time is slower when compared

without usage of HDFS. Running time analysis for Linear regression is O(mn2+n

3) and on

multicore the complexity is O( mn2 / p + n

3 / p’ + n

2 log(p)) [11] . Finally, multicore

simulates faster performance with less cost.

6. Conclusion:

In this paper, by adopting summation form of Linear regression Algorithm in a Map-Reduce

framework, we could parallelize 1.9 times speed up on a dual processor. The speed up is

achieved on using HDFS with Machine Learning Algorithm. Time complexity analysis of

Linear Regression on legacy system is compared with that of Hadoop distributed system. It

was found that the algorithm performs productively for Big Data. Hence time efficiency is

achieved.



References:

1. Building Machine Learning Algorithms on Hadoop for Bigdata Asha T, Shravanthi

U.M, Nagashree N, Monika M Department of Information Science & Engineering

Bangalore Institute of Technology, Bangalore, INDIA 2013.

2. Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc.

2012.

3. Mastiff: A MapReduce-based System for Time-based Big Data Analytics 2012 Sijie

Guo, Jin Xiong, Weiping Wang.

4. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in

large-scale data analytics. MAPREDUCE Workshop, 2011.

5. Doug Cutting et al. About hadoop. http://lucene.apache.org/hadoop/about.html.

6. J . Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan &

Claypool, 2010.

7. SystemML: Declarative Machine Learning on MapReduce 2010 Amol Ghoting #,

Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald Vikas Sindhwani,

Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan IBM Watson Research

Center IBM Almaden Research Center.

8. Mapreduce: A flexible data processing tool, ACM Communications, Volume 53, pp

72-77, 2010.

9. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience

2009 Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M.

Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh

Srivastava Yahoo!, Inc.

10. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J.. Rasin, A., and Silberschatz, A.

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for

Analytical from Workloads. PVLDB 2(1), 2009.

11. Cheng-Tao Chu et al. Map-Reduce for machine learning on multicore. In NIPS, 2007.

12. Linear Regression Cite as: Whitney Newey, course materials for 14.386 New

Econometric Methods, Spring 2007.MIT Open Course Ware (http://ocw.mit.edu),

Massachusetts Institute of Technology.

13. Apache Hadoop. http://wiki.apache.org/hadoop.

14. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large

clusters. In ACM OSDI, 2004.


International Journal of Innovative Research and Studies 8

15. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system.

In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George,

New York, 2003.

16. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.

17. A Brief Introduction into Machine Learning Gunnar Ratsch Friedrich Miescher

Laboratory.

18. An Adaptive Machine Learning on Map-Reduce Framework for Improving

Performance of Large- Scale Data Analysis on EC2 Walisa Romsaiyud1 and Wichian

Premchaiswadi2

An Analysis of Linear Regression Algorithm implemented on Hadoop using Machine Learning Techniques

Documents

Transcript of An Analysis of Linear Regression Algorithm implemented on Hadoop using Machine Learning Techniques