AVAILABILITY OF THE JOBTRACKER MACHINE IN HADOOP ...

84
AVAILABILITY OF THE JOBTRACKER MACHINE IN HADOOP/MAP-REDUCE IMPLEMENTATIONS A Thesis Submitted to the Graduate Faculty in Partial Fulfilment of the Requirement for the Award of Degree of Master of Science in Computer Science by Mensah Kwabena Patrick Academic Advisor: Dr. Ekpe Okorafor November 27, 2011

Transcript of AVAILABILITY OF THE JOBTRACKER MACHINE IN HADOOP ...

AVAILABILITY OF THEJOBTRACKER MACHINE

IN

HADOOP/MAP-REDUCE

IMPLEMENTATIONS

A Thesis Submitted to the Graduate Faculty inPartial Fulfilment of the Requirement for the

Award of Degree of Master of Science inComputer Science

by

Mensah Kwabena Patrick

Academic Advisor:Dr. Ekpe Okorafor

November 27, 2011

Declaration

I hereby declare with honesty that the material in this thesis is as a resultof a study carried out by me on the topic ”Availability of the JobTrackerMachine in Hadoop/MapReduce Implementations” under the supervision ofDr. Ekpe Okorafor at the African University of Science and Technology, andthat it has not been submitted elsewhere for the award of a degree. Whenand wherever the work is based on other scientific findings, references aremade and duly acknowledged.

——————————————-Mensah Kwabena Patrick

i

Acknowledgement

Sincere thanks goes to God Almighty for keeping me through this thesis.I am grateful to my supervisor, Dr. Ekpe Okorafor for his guidance andcorrections. I also acknowledge the support and concern of my family espe-cially my sister, Ruth Yaa Menka, and everyone who contributed in diverseways for the successful completion of this work. May God richly bless youall.

ii

Contents

Declaration i

Acknowledgement ii

List of Tables vi

List of Figures viii

Abstract ix

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 4

2 Cloud Computing and Fault Tolerance 52.1 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Types of Clouds . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Virtualization in the Cloud . . . . . . . . . . . . . . . . . . . 7

2.3.1 Advantages of virtualization . . . . . . . . . . . . . . . 72.4 Fault, Error and Failure . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 Faults Types . . . . . . . . . . . . . . . . . . . . . . . 82.5 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5.1 Fault-tolerance Properties . . . . . . . . . . . . . . . . 92.5.2 K Fault Tolerant Systems . . . . . . . . . . . . . . . . 122.5.3 Hardware Fault Tolerance . . . . . . . . . . . . . . . . 132.5.4 Software Fault Tolerance . . . . . . . . . . . . . . . . 14

2.6 Properties of a Fault Tolerant Cloud . . . . . . . . . . . . . . 152.6.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . 152.6.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

3 Hadoop/MapReduce Architecture 183.1 Hadoop/MapReduce . . . . . . . . . . . . . . . . . . . . . . . 183.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Hadoop/MapReduce versus other Systems . . . . . . . . . . . 21

3.3.1 Relational Database Management Systems (RDBMS) 213.3.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . 223.3.3 Volunteer Computing . . . . . . . . . . . . . . . . . . 23

3.4 Features of MapReduce . . . . . . . . . . . . . . . . . . . . . 233.4.1 Automatic Parallelization and Distribution of Work . 233.4.2 Fault Tolerance in Hadoop/MapReduce . . . . . . . . 233.4.3 Cost Efficiency . . . . . . . . . . . . . . . . . . . . . . 243.4.4 Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Limitations of Hadoop/MapReduce . . . . . . . . . . . . . . 253.6 Apache’s ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . 25

3.6.1 ZooKeeper Data Model . . . . . . . . . . . . . . . . . 263.6.2 Zookeeper Guarantees . . . . . . . . . . . . . . . . . . 273.6.3 Zookeeper Primitives . . . . . . . . . . . . . . . . . . . 283.6.4 Zookeeper Fault Tolerance . . . . . . . . . . . . . . . . 29

3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Availability Model 324.1 JobTracker Availability Model . . . . . . . . . . . . . . . . . . 32

4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . 324.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . 334.3 Markov Model for a Multi-Host System . . . . . . . . . . . . 33

4.3.1 The Parameter λs(t) . . . . . . . . . . . . . . . . . . . 354.4 Markov Model for a Three-Host (N = 3)

Hadoop/MapReduce Cluster UsingZookeeper as Coordinating Service . . . . . . . . . . . . . . . 35

4.5 Numerical Solution to the System ofDifferential Equations . . . . . . . . . . . . . . . . . . . . . . 414.5.1 Interpretation of Availability plot of the JobTracker . 41

4.6 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . 444.6.1 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . 44

5 Conclusion and Future Work 515.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Appendix 53Appendix A:Differential Equations for Boundary Conditions . . . . 53Appendix B:Differential Equations for a Cluster of Four Servers

(N = 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

iv

Appendix C:Differential Equations for a Cluster of Five Servers(N = 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Appendix D: MATLAB solution to the (N = 3) System of Kol-mogrov Differential Equations . . . . . . . . . . . . . . . . . . 60

Appendix E: How to set up Hadoop/MapReduce on Ubuntu 11.04 61Single Node Hadoop Cluster . . . . . . . . . . . . . . . . . . . 61Multi-Node (N = 3) Hadoop Cluster . . . . . . . . . . . . . . 64Submitting Word Count job to the Hadoop/MapReduce cluster 66

Appendix F:How to Install and Run Zookeeper on Ubuntu 11.04 . 68Deploying Zookeeper Ensemble on a Single Machine . . . . . 68Deploying Zookeeper Ensemble across a Network . . . . . . . 70

References 72

v

List of Tables

1.1 Outages in different cloud services[7] . . . . . . . . . . . . . . 3

4.1 Values of k for multi-host differential equations . . . . . . . . 374.2 Probabilities of the first solution to the ten differential equations 424.3 Probabilities (Continued) . . . . . . . . . . . . . . . . . . . . 42

vi

List of Figures

1.1 Data growth at Facebook [3] . . . . . . . . . . . . . . . . . . 2

2.1 Domino effect for two processes . . . . . . . . . . . . . . . . . 102.2 Livelock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 N-modular redundancy voter system (N is odd) . . . . . . . . 122.4 Redundant voter system . . . . . . . . . . . . . . . . . . . . . 13

3.1 The Hadoop Framework . . . . . . . . . . . . . . . . . . . . . 193.2 Google’s implementation of MapReduce [4] . . . . . . . . . . 223.3 Automatic Parallelism of MapReduce job . . . . . . . . . . . 233.4 Zookeeper hierarchical namespace[24] . . . . . . . . . . . . . . 263.5 Zookeeper Leader Election Service [25] . . . . . . . . . . . . . 283.6 JobTracker state transition diagram [28] . . . . . . . . . . . . 30

4.1 N-hosts HDSHS general model [33] . . . . . . . . . . . . . . . 344.2 The proposed cluster . . . . . . . . . . . . . . . . . . . . . . . 364.3 Markov model for N = 3 . . . . . . . . . . . . . . . . . . . . . 364.4 Availability plot of the JobTracker . . . . . . . . . . . . . . . 434.5 Availability of a k Fault Tolerant system. . . . . . . . . . . . 444.6 The effect of i on Availability . . . . . . . . . . . . . . . . . . 454.7 The effect of Initial Fault on Availability . . . . . . . . . . . . 464.8 Effect of λ on Availability . . . . . . . . . . . . . . . . . . . . 464.9 Effect of λh on Availability . . . . . . . . . . . . . . . . . . . 474.10 Effect of µh on Availability . . . . . . . . . . . . . . . . . . . 484.11 Effect of µs on Availability . . . . . . . . . . . . . . . . . . . 484.12 Effect of N on Availability . . . . . . . . . . . . . . . . . . . . 49

5.1 Model for P ′0,0(t) . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Model for P ′i,0(t) . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Model for P ′0,N (t) . . . . . . . . . . . . . . . . . . . . . . . . . 545.4 Model for P ′N,0(t) . . . . . . . . . . . . . . . . . . . . . . . . . 555.5 Model for P ′0,j(t) . . . . . . . . . . . . . . . . . . . . . . . . . 555.6 Markov State Diagram for a Cluster of Four Servers . . . . . 565.7 Markov State Diagram for a Cluster of Five Servers . . . . . 585.8 MATLAB Solution to the Differential Equations . . . . . . . 60

vii

5.9 Starting Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . 615.10 Submitting a job to the Hadoop cluster . . . . . . . . . . . . 665.11 Output of WordCount job . . . . . . . . . . . . . . . . . . . . 675.12 Zookeeper ensemble . . . . . . . . . . . . . . . . . . . . . . . 69

viii

Abstract

Due to the growing demand for Cloud Computing services, the need andimportance of Distributed Systems cannot be underestimated. However,it is difficult to use the traditional Message Passing Interface (MPI) ap-proach to implement synchronization, coordination,and prevent deadlocksin distributed systems. This difficulty is lessened by the use of Apache’sHadoop/MapReduce and Zookeeper to provide Fault Tolerance in a Homo-geneously Distributed Hardware/Software environment.In this thesis, a mathematical model for the availability of the JobTracker inHadoop/MapReduce using Zookeeper’s Leader Election Service is examined.Though the availability is less than what is expected in a k Fault Tolerancesystem for higher values of the hardware failure rate, this approach makescoordination and synchronization easy, reduces the effect of Crash failures,and provides Fault Tolerance for distributed systems.The availability model starts with a Markov state diagram for a generalcase of N Zookeeper servers followed by specific cases of 3,4,and 5 servers.Both software and hardware faults are considered in addition to the effectof hardware and software repair rates. Comparisons show that, the systemavailability changes with change in the number of Zookeeper servers, with 3servers having the highest availability.The model presented in this study can be used to decide on how manyservers are optimal for maximum availability and from which vendor theymust be purchased. It can also help determine what time to use a Zookeepercoordinated Hadoop cluster to perform critical tasks.

ix

Chapter 1

Introduction

The effectiveness of most modern information (data) processing involvesthe ability to process huge datasets in parallel to meet stringent time con-straints and organizational needs. A major challenge facing organizationstoday is the ability to organize and process large data generated by cus-tomers. According to Nielson Online[1] there are more than 1,733,993,741internet users. How much data these users are generating and how it is pro-cessed largely determines the success of the organization concerned. Con-sider the social networking site Facebook; as at August 2011, it has over750 million active users[2] who spend 700 billion minutes per month on thenetwork. They install over 20 million applications every day and interactwith 30 billion pieces of content (web links, news stories, blog posts, notes,photo albums, etc.) each month. Since April 2010 when social plugins werelaunched, an average of 10,000 new websites has integrated with Facebook.The amount of data generated in Facebook is estimated as follows [3]:

• 12 TB of compressed data added per day

• 800 TB of compressed data scanned per day

• 25,000 map-reduce jobs per day

• 65 million files in HDFS

• 30,000 simultaneous clients to the HDFS NameNode

It was a similar demand to process large datasets in Google that inspiredEngineers in Google to introduce MapReduce [4]. At Google MapReduce isused to build Index for Google Search, Article clustering for Google Newsand perform Statistical machine translations. At Yahoo!, it is used to buildIndex for Yahoo! Search and spam detection. And at Facebook, MapReduceis used for Data mining, Ad optimization, and Spam detection [5]. MapRe-duce is designed to use commodity nodes (runs on cheaper machines) thatcan fail at any time. Its performance does not reduce significantly due to

1

Figure 1.1: Data growth at Facebook [3]

network latency. It exhibits high fault tolerance and is easy to use by pro-grammers who have no prior experience in parallel programming. ApachesHadoop[6] is an open source implementation of Googles MapReduce. Itis made up of MapReduce and Hadoop Distributed File System (HDFS).A client submits a Job to the Master node. The Master node consists ofthe NameNode and JobTracker daemonds running on a single machine ordifferent machines depending on the size of the cluster. The JobTrackerdistributes the client job to selected slave machines that are running theTaskTracker and DataNode daemonds. Each slave node must periodicallysend a heartbeat signal to the JobTracker machine. If a Master does notreceive a heartbeat signal from a slave, it assumes the slave is down andmust consequently re-schedule the task assigned to the dead slave node toanother node that is idle. Hadoop/MapReduce makes it possible to pro-cess huge datasets in time as compared to other platforms such as DatabaseManagement Systems. This is due to the fact that slow improvements indrive seek-time, makes applications that need to analyze a whole dataset forbatch processing experience high latency in DBMS.

1.1 Problem Statement

The availability of cloud computing services can be enhanced if proper FaultTolerance mechanisms are implemented in the Data Centers. Cloud relianceproblems can cause serious consequences for both the provider and customerswhen time is money. Every year, many cloud service providers battle withservice outages (Table1.1). A major concern is how to minimize service

2

Table 1.1: Outages in different cloud services[7]

down-time for a cloud provider such as Amazon or Facebook that have thou-sands of clients connected at any given point in time. Hadoop/MapReducewas developed to achieve maximum performance, high fault tolerance, avail-ability and transparency as much as possible. However, these objectives canbe elusive if the following issues are left unattended:

1. Hadoop/MapReduce is currently implemented as Master-Slave archi-tecture; this makes both the Hadoop Distributed File System Masternode (NameNode) and the MapReduce Master Node (Job Tracker)single point of failures. The failure of a Slave node (DataNode or Task-Tracker) does not pose serious challenge since the Master node simplere-assigns tasks that were to be processed by the failed node to anothernode. This implies that, the failure of either the JobTracker or the Na-meNode makes the service unavailable until they are up and runningagain. However, Hadoop provides a standby NameNode implemen-tation called the AvatarNode, which can be implemented as Primaryavatar or Secondary avatar. For instance, during failover, PrimaryAvatarNode on machine M1 is killed and the Standby AvatarNode onmachine M2 is instructed to assume a Primary avatar status. Thisis practically instantaneous, making the recovery a matter of a fewseconds.However, failure of the JobTracker machine on the other hand, makesthe service unavailable until it is restarted. Between the time of failureand restart, clients must be made to wait, which is undesirable.

3

2. How available is the solution that is proposed for problem 1 above.

Our concern then is how to make the cluster available when the active Job-Tracker goes down and to also determine mathematically how much avail-ability the JobTracker has. This is needed to avoid unnecessary down-time.

1.2 Objectives

The major objective of this Thesis is to determine the availability of aproposed automatic fail-over(recovery) mechanism to address the issue ofthe Hadoop/MapReduce JobTracker being a single point of failure. Thisimplementation is based on the Leader Election Framework mentioned inZookeeper [8]. That is:

• Providing automatic failover mechanism for the JobTracker .

• Maintaining only one active JobTracker in the cluster.

• Letting only the active JobTracker serve JobClients and TaskTrackers.

• Facilitating redirection for JobClient and TaskTracker to the new ac-tive JobTracker.

• Determining the Availability of the JobTracker.

• Determining how sensitive the Availability of the JobTracker is tochanges in model parameters.

1.3 Thesis Organization

This Thesis is organized as follows: Chapter One introduces the topic un-der discussion and defines the problem at stake. It also tries to clarifywhat this thesis aims to achieve. Chapter Two starts with literature re-view on concepts of Cloud Computing and Fault Tolerance. These are areaswhere Availability is vital for high performance. Chapter Three introducesHadoop/MapReduce and Zookeeper which are used to implement the pro-posed cluster. Chapter Four is a mathematical model of the cluster proposedin Chapter Three. The model is aimed at determining how available the clus-ter is. The work is concluded in Chapter Five and proposed Future areas ofinterest given.

4

Chapter 2

Cloud Computing and FaultTolerance

2.1 Cloud Computing

Cloud Computing involves the provisioning of Information Technology re-lated capabilities as a Service through the Internet[9]. Public Clouds gener-ally provides everything as a service (XaaS): Platform as a service (PaaS),Infrastructure as a service (IaaS), Hardware as a service (HaaS), Devel-opment/Database/Desktop as a service (DaaS), Organization as a service(OaaS), Business as a service (BaaS), Storage as a Service (SaaS), and Soft-ware as a Service (SaaS). Users of cloud services stand to benefit from nu-merous advantages among which include low in-house cost of hardware dueto low capital investment, low cost that is use-based, high speed of deploy-ment, access to latest technology in terms of hardware and software, and theuse of non-pirated software amongst others. However, security of user dataremains a major concern to users, although most cloud providers have strin-gent measures in place to protect user information. According to Google[10], there are six key properties of Cloud Computing:

1. User-centric:Once a user logs on to the cloud, whatever is availablethere belongs to the user. Services on the cloud are also tuned tosatisfy the computing requirements of most users, and improved tothe extent that, individual pages are customized to conform to thetaste of the user (eg. Amazon EC).

2. Task-centric:When a user is in the Cloud, s/he need not worry aboutwhich application can better serve their purpose, this is a task for thecloud provider. The user must only worry about how to accomplishtheir task effectively. This implies that, focus is on the user task andnot on applications (eg. Spread sheet, email, etc.) which are becomingless important than the documents they create.

5

3. Cloud Computing is Powerful:Harnessing the resources of thou-sands of computers connected together in the cloud is a wealth ofpower that is difficult to provide with a single supercomputer and toeven mention a single desktop computer. The processing power of thiscollection is enormous.

4. Accessibility:The resources of thousands of computers are made avail-able to the user through the cloud. Users can have access to their data24 hours from wherever they are.

5. Intelligence of the Cloud:With all the numerous computers in thecloud, information can be retrieved intelligently by means of data min-ing analysis.

6. Cost Effectiveness:Cost effectiveness means that cloud computing isless expensive than solutions deployed in traditional data centers wherehardware, software, and human resources have to be maintained.

2.2 Types of Clouds

When a cloud provides utility services in a pay-as-you-go manner to users,it is referred to as a public cloud (eg. Amazon EC, Google App Engine,and Windows Azure). In private clouds, the data centres are for internalorganizational use only, and are not made available to the public. Dataand processes are managed within the organization and the restrictions onbandwidth, legal requirements, and security are limited in a private cloud.In Hybrid Clouds, the environment consists of multiple internal and/or ex-ternal service providers [7]. The services provided by the cloud are calledSaaS, while the data centre hardware and software together forms the cloud.Companies (users or SaaS providers) can deploy SaaS on the cloud withoutnecessarily owning a data centre, in so doing, it becomes the responsibilityof the data centre owner (Cloud Provider) to ensure auto-scaling of theseservices. Most of the computing applications used today provide a greatopportunity for the growth of cloud computing [11]. Mobile interactiveapplications need services that must respond in real time, as provided byparallel batch processing found in cloud infrastructures. The rise of businessanalytic is shifting computing priority from compute-intensive processing tobusiness analytic where computing resources are used to study customer be-haviour trends to help organizations take decisions towards satisfying theircustomers.Upon all the advantages and opportunities for the growth of cloud com-puting, Michael Armbrust et al [11] lists a number of obstacles that willpotentially hinder the growth of cloud computing. These include Availabil-ity of service at all times, user data lock-in that makes it difficult for users to

6

migrate their data from one cloud platform to another, data confidentiality,auditability and security, data transfer bottleneck due to high cost of band-width, scalable storage, performance unpredictability, and software licensingissues among others.

2.3 Virtualization in the Cloud

Virtualization abstracts the logical coupling between hardware and the Op-erating System [7]. Server virtualization is an example of virtualization thatcan help improve agility and flexibility in cloud environments.

2.3.1 Advantages of virtualization

It saves cost and also good for program development. However, it has prob-lems with security, scalability and bleed-over where the contents of one serveraffect the other.

2.4 Fault, Error and Failure

A major concern in cloud computing is the ability to satisfy customers byensuring that faults are handled transparently from the user. The cloudmust be available, scalable, secure, reliable, interoperable, and have low la-tency among others. To build a good cloud, the above qualities must beachieved together with an effective fault tolerance system. According to V.Cortellessa et al [12], Laprie defines failure, error and fault in the followingterms:

A system failure occurs when the delivered service deviates from fulfillingthe system function, the latter being what the system is intended for. Anerror is that part of the system state which is liable to lead to subsequentfailure; an error affecting the service is an indication that a failure occursor has occurred. The adjudged or hypothesized cause of an error is a fault.

The behaviour of a given system is known, when the system deviates fromthis known behaviour, we say a failure has occurred in the system. Thisfailure by itself is caused by an error in the system. Per the system statespecifications, if a system assumes an invalid state, one that is not foundin its specifications, we say an error has occurred in the system. This erroris in turn caused by a system fault or defect. An error then signifies thepresence of a fault. Practically, system faults are a result of one or many ofthe following; network failures, application bugs, human errors, operatingsystem bugs, disk failures, memory failures, power supply failures, maliciousinputs [13], and loose connections. The presence of errors or faults does not

7

necessarily imply system failure [14], and a fault may (or may not) necessar-ily result in (multiple) errors. An active fault is one that eventually resultsinto an error, and the time between the occurrence of the fault and its initialactivation as an error is called its Latency.

2.4.1 Faults Types

System faults may fall under one of the following categories based on:

1. Duration of the Fault:

• Permanent Faults:The fault stays until the system is repaired.It is the easiest fault to diagnose and repair. A common methodto solve this fault after diagnoses is to replace the faulty compo-nent with a working one.

• Intermittent Faults: The fault dies away and reappears withtime. This may be as a result of loose connection among others.It is the most difficult form of fault to diagnose.

• Transient fault:After a given time period, the fault dies away.Such faults are mostly caused by a combination of environmentalfactors such as temperature, radiation, and so on.

2. Cause of the Fault:

• Design Faults:These are faults that are caused by a flaw in thedesign of the system. In as much as we want to design an efficientsystem, we cannot attain 100% efficiency, thus leaving room forerrors. For instance, it is not possible to develop a 100% fullproof reliable software, no matter how much resources we putinto the development process, there is always the possibility thatsomething can go wrong. One way to deal with such faults is todesign the system in such a way that it will be able to toleratesuch faults should they occur.

• Operational Faults:These are a result of physical causes suchas human operator mistakes, disk or processor errors, etc.

3. Behaviour of Faulty Components:

• Partition Faults:This occurs when two or more processes thatare supposed to communicate with each other are not able to do sodue to a break in the connection that links them. This break maybe as a result of network congestion, or a broken communicationwire.

• Omission Faults:The component fails completely to performits functions as assigned to it. This fault may be a result ofanother fault such as a design fault.

8

• Timing Faults:The component does not meet its time require-ments for completing a service. This may lead to so many time-outs that will effectively reduce the efficiency of the system (as-suming it is a TCP implementation, so many re-transmissionswill occur). Such Faults are not good for mission critical task, asthey will always cause disaster.

• Byzantine Faults:These are faults of an arbitrary nature thatcorrupts data. They have the potential of causing entire systemoutput error if not detected and corrected early. This type offault can be masked to avert its negative effect.

2.5 Fault Tolerance

Fault tolerance is the property of a system which enables it to provide ser-vices in the presence of faulty components [15]. Depending on the imple-mentation, services provided under faulty conditions may be at a reducedlevel in terms of latency; however, efficiency and availability cannot be com-promised under any circumstances.In a fault-tolerant cloud, errors are bound to occur, but the occurrence ofsuch errors must not corrupt data, time the service out, or degrade perfor-mance.

2.5.1 Fault-tolerance Properties

Error Detection

Error Detection is the first and most important step in the fault-toleranceprocess. The system must be able to determine that an on-going activity isnot part of the system specifications, and hence can be classified as an error.Replication is a technique that is used to detect errors. Copies of the samejob are passed through a common exit and their outcome compared, shouldthere be differences between the outputs, then an error is imminent. Insituations where the outputs are from more than two sources, the verificationis done through voting.Another error detection mechanism is the use of Time-outs. The time withwhich a task must be completed is known, and a module to carry out thistask is monitored. Should the time allocated expire without the moduleproducing any output, then we expect an error as the cause. A variant ofthis technique is the watchdog timer [13].Diagnosis is yet another error detection mechanism that is applied to anentire system, due to cost and time limitation it is not advisable to apply itto individual modules in the system. Tests whose results are already knownare fed into the system and the outcome of the system is compared to theknown results to determine any differences.

9

Figure 2.1: Domino effect for two processes

Error Recovery

Recovery is the basis for Fault-tolerance Algorithms. Once a fault has beendetected, the ability to recover from the fault without loss or corruptionof data is an essential part of a Fault-Tolerance system. Fault-tolerancesystems can perform either backward or forward recovery depending on therequirements of the system.

1. Backward Recovery (Fail-stop) or Checkpoint/Restart:Most ofthe Fault-tolerance algorithms mentioned above operate with the ba-sic principle of Checkpoint/Restart. In checkpoint/restart algorithms,consistent successful states of the system are saved in history to en-able roll back to the last successful state in case a fault occurs in thesystem. Backward recovery systems are transaction based, implyingthat their operations are atomic (they either complete or fail totally).Completed operations are saved unto persistent storage to allow theserver to recover according to an all-or-nothing semantics. There aretwo main methods for check-pointing (local and global), in the firstmethod, each process takes checkpoints independently and commitsthe outcome to permanent storage. In cases where one or more ofthe processes fail, they need to communicate to find a consistent statewhere they can all roll-back to. In the second method, every processneed to communicate with all other processes before taking a check-point, this ensures consistency in the system at any point in time. Lackof synchronization in local check-point creations may result in incon-sistent roll-backs causing Livelock or Domino effect. Livelock (Figure2.2)and Domino effect (Figure 2.1) are described by Priya Venkitakr-ishnan [16] as follows:

When process P1 fails at point k, it will roll-backs to check-point

10

Figure 2.2: Livelock

chp13, but process P2 has a record of the receipt of the message whoseeffect has been undone in P1.So P2 will also roll-back to undo the effectof the undone message, to chp22, this leaves P1 with an undone processwhich triggers P1 to also roll-back, and the process keeps repeating,until both processes roll-back to their initial checkpoints resulting in aconsistent global state. This Domino effect causes unnecessary delaysin the total completion time of the system.

Livelock is caused in systems where there is no synchronization be-tween processes during check- pointing. If a process fails and rollsback to its check-point, and then requests all affected processes tofollow suit, a livelock will occur. Consider the following scenario:

(a) P1 sends a message m1 and fails before it receives the replymessage m2 from P2.

(b) P1 rolls back to its recent checkpoint and recovers; then it re-ceives m2 from P2 and sends m3 to P2.

(c) P1 has no record of sending m1 while P2 has a record of receivingm1. Hence P2 also rolls back and notifies P1.

(d) The global state is inconsistent, as P2 has no record of sendingm1 while P1 has a record of its receipt.

Under this circumstance, the processes may be forced to roll-back for-ever even though they may not be faulty.

Preventing livelock and domino effect: Priya Venkitakrishnan[16]describes a Richard Koo and Sam Toueg algorithm by which processescan be synchronised to avoid the above mentioned problems. In this

11

Figure 2.3: N-modular redundancy voter system (N is odd)

algorithm, processes are made to send messages. One of the pro-cesses serves as the coordinator. The coordinator takes a checkpointand requests all other processes to do same. Each of these processessends their decision on either they accept to take a checkpoint or not.Based on their response, the coordinator decides either to maintainits checkpoint or discard it. The coordinator then sends its final deci-sion to the rest of the processes, which will in turn be used by themto determine either to discard or maintain their checkpoints. How-ever, Checkpoint/Restart is becoming less efficient due to increasingerror rates, increasing memory, and a non-paralleled increasing I/Ocapabilities[17]. A possible solution to this problem is the use of pro-cess migration in proactive fault-tolerance implementations. ProactiveFault-tolerance uses the concept of preventive maintenance to avoidthe overhead incurred due to restarts from consistent states.

2. Forward Recovery:Backward recovery also called fail-stop has a lim-itation in that it is not able to meet strict timing requirements whichare characteristic of real time systems. In forward recovery, the erroris masked by the outputs of other components in a redundancy systemrunning parallel components. The outputs of all the components arevoted on, and the majority is taken as the correct output (Figure 2.3).This enables the system to neglect/mask a component that producesan erroneous output.

3-modular redundancy is the most commonly used. It is possible that avoter can fail. To avoid this scenario, redundant voters can be providedto mask the effect of a single voter failure (Figure 2.4). However, votingtakes time.

2.5.2 K Fault Tolerant Systems

A system is k fault tolerant if it can tolerate k faulty systems and still meetits correct output specifications. For a fail-stop fault tolerant system, k fault

12

Figure 2.4: Redundant voter system

tolerance can be provided if we have k+1 component, in other words, if allthe k components are faulty, the output from the 1 left can be used. Toaddress crash failures, 2k+1 components are required to achieve k fault tol-erance.

2.5.3 Hardware Fault Tolerance

Real time clouds must function well under hardware fault conditions. This ismade possible due to several techniques that are used to reduce the effect ofhardware faults on the system. Among the numerous techniques redundancyis the most widely used:It is the provision of multiple identical instances of the same system and theability to switch to one of the remaining instances in case of a failure/fail-over.Types of Redundancy:

• Hardware redundancy:Spare parts are provided for hardware com-ponents that are capable of failure. Failed components are replacedby their spares with system functionality receiving little or no inter-ruption.

• Time redundancy:The timing of the system is allotted such that,there is enough time to perform retrials or re-executions on failed com-ponents before timing up. However, such systems may not be viablefor time-critical tasks, since they tend to use too much time performingthe same operation in case of failures, and hence impacting adverselyon latency.

• Software redundancy:The N-versions software technique is a formof software redundancy. It must be noted that running N copies of thesame software on N processors does not provide any form of software

13

redundancy, since any error will be propagated through the N copiesof the software.

• Data redundancy:Data can be replicated and copies stored on dif-ferent servers or clients in different locations. This is to ensure that,when one or many copies of the data get corrupted, there will stillbe a chance of getting copies of the same data intact. This techniqueis widely used in distributed and cloud computing systems (example,Google File System [18, 19]) to protect client data from getting lost.

Redundancy may be implemented in one of the following ways:

• One for One Redundancy:A single instance of the hardware com-ponent is maintained, and made to monitor the component at all times,immediately it senses a failure in the working component, it is acti-vated to take over from the failed component. This technique main-tains a higher level of availability; however, each component must existin double, increasing the cost of hardware.

• N+X Redundancy (where N > X):X components are used tobackup N functional components. A higher level module is made tomonitor the health of the system. In case of any failure, the higherlevel component selects one of the X components to replace the failedcomponent. One for One Redundancy is a special case of N+X re-dundancy. Hardware cost is minimized in this case, since we do notneed each individual component to be backed-up. However, in case ofmultiple failures, the system becomes less available.

• Load Sharing:A higher order module distributes tasks/load to indi-vidual sub-components to perform; it also monitors the health of thesystem. Once a fault occurs and a given component ceases to functionwell, the monitoring component starts to distribute the load to therest of the functional components. Extra hardware cost is not an issuesince no backup components are deployed.

• Fault Masking:A component or circuitry can be triplicated and allthree copies allowed to execute the same function and produce resultwhich is voted on to eliminate (mask) errors created by a faulty com-ponent. A clear deterrent is the cost of triplicates; otherwise it makesthe system very reliable.

2.5.4 Software Fault Tolerance

This is the ability of Software to tolerate design faults. Software faultsmay be in the form of bugs, or programmer errors. Techniques commonlyemployed to handle such faults include the following:

14

• Recovery Blocks:Multiple implementations of the same algorithmare written in the form of primary, secondary and exception handlercode together with a piece of code (coordinator code) that determineswhich of the implemented algorithms should be executed. In enteringa unit, the coordinator first executes the primary alternative, amongN alternatives. If the primary block fails, the coordinator tries toroll back the state of the system and tries the secondary alternative.Should the coordinator refuse to accept the results of any of the alter-nates, it then invokes the exception handler.

• N-Version Programming:Different versions of the same softwareare used to execute a task. The results of their output is voted on,this enables the system to mask faults that occur in the system [12].N- Version programming can be implemented in one of two ways:

1. All N-versions are running in parallel and voting is performed ontheir output.

2. Only one version out of the N-versions is running, when it failsanother version takes over after recovery.

It is worth emphasizing that replicating the same software N times and usingit on N processors does not provide any form of software redundancy, sinceany error will be propagated through the N copies of the software.Preventive maintenance also called Proactive Fault tolerance has the advan-tage of preventing future occurrences of faults through routine maintenanceprocedures. This approach saves the organization precious time and cost.

2.6 Properties of a Fault Tolerant Cloud

2.6.1 Availability

Availability of a system ensures that access to data is made consistent 24hours a day, and 7 days a week. There are two main variants of availability;Planned and Unplanned availability.In planned availability, focus is laid on continuous operations. For example,should a cloud be scheduled for maintenance, workload shifting can be doneby starting instances of severs that are shut down on other machines toprovide continuous service to clients, also load balancing can be employedto reduce cloud latency by eliminating excessive load on a given server bystarting new server instances and sharing the load to them (the opposite isalso true; when the load reduces, idle server instances can be killed and theload shared to the few uptime machines: this technique is called Autoscal-ing).Unplanned availability also called High availability solutions focuses on

15

fault-tolerance, recovery from disaster, and data integrity. Availability solu-tions partially solve security problems since one of the major security con-cerns is the availability of company information, which can be replicated,recovered and restored during disasters. The percentage availability is givenby:

%Availability = System Usage T imeScheduled time ∗ 100

whereSystem Usage T ime = Scheduled time− System DowntimeScheduled time = 24 hours per day,System Downtime = time used for maintenance and repair of majorfailures that are not transparent to the user.

2.6.2 Reliability

In [15], Dimitris N. Chorafas defines reliability as:

The probability that a given system will operate without failure over a prede-termined time period, under environmental and operational conditions de-fined in advance.

When a system fails, it must fail well, in other words, the failure of thesystem must not impact adversely on its performance, and for cloud environ-ments, this failure must be transparent to the user in terms of performance.Reliability also speaks of the intrinsic failure rates of the system hardwareand will impact availability, but still remains a fractional variable in the en-tire availability solution. Planning the availability of a system will includereliability as a factor. According to the Weibull distribution [20], reliabilitybased on weapon system research [15]is given by:

R = e−t/T

whereT = mean time between failures (MTBF ) which is based on statistics(facts and not hypothesis),t = projected operational time for which reliability is computed ande = Naperian logarithm.The constituents of system reliability include hardware, software and op-erational reasons. Cloud providers are well aware that anything less than99.99% reliability is not good for providing cloud services.

16

2.6.3 Scalability

Scalability allows cloud users to increase or decrease their need on demand.Performance, throughput and efficiency must not suffer due to increase indemand or vice versa. It would be waste of resources to introduce a servicewith thousands of servers at a go, the optimal solution is to start with somefew resources, and allow the system to scale up as demand grows. Auto-scaling is the ultimate goal for public cloud providers. Its ability to increaseor decrease usage of available resources depending on the respective increaseor decrease in user demand is a major advantage for cloud providers.

17

Chapter 3

Hadoop/MapReduceArchitecture

3.1 Hadoop/MapReduce

Hadoop is an open source framework which allows parallel programmers towrite and run distributed applications that process large amounts of data.Hadoop is hosted by the Apache Software Foundation which is into theprovisioning of open source software projects such as [21]:

• Apache HTTP Server : Used for web applications that make use ofan HTTP server

• Avro: A serialization system for cross-language Remote ProcedureCalls, and persistent data storage

• MapReduce : A distributed data processing model and execution en-vironment that runs on large Clusters of commodity machines

• HDFS : A distributed file system that runs on large clusters of com-modity machines

• Pig : A data flow language and execution environment for exploringvery large datasets. Pig runs on HDFS and MapReduce clusters.

• Hive : A distributed data warehouse. Hive manages data stored inHDFS and provides a query language based on SQL (which is trans-lated by the runtime engine to MapReduce jobs) for querying the data

• Hbase : A distributed, column-oriented database. HBase uses HDFSfor its underlying storage, and supports both batch-style computationsusing MapReduce and point queries (random reads)

18

Figure 3.1: The Hadoop Framework

• ZooKeeper : A distributed, available coordination service. ZooKeeperprovides primitives such as distributed locks that can be used for build-ing distributed applications

Hadoop is implemented with the capability to run MapReduce programswritten in Java, C++, Python, Ruby, etc. on top of the Hadoop Dis-tributed File System (HDFS). MapReduce is used to divide user applicationsinto small blocks which are then replicated and stored on multiple nodes inthe cluster by means of HDFS. These applications are executed in paral-lel by MapReduce on individual nodes that are assigned a map or reducetask. Hadoop is made up of several daemons, some of which either exist onthe same machine or on different servers depending on the mode in whichHadoop is being implemented. These daemons are the major constituents ofthe Hadoop framework as shown in Figure 3.1.We explain below, the func-tions of each of the daemons.NameNode: Hadoop is Master-Slave architecture and uses a DistributedFile System called Hadoop Distributed File System (HDFS). The NameNode is the Master of the Distributed File System. Its task is to direct theslave Data Node daemons on how to carry out low level input/output (I/O)tasks. The name node monitors and controls the storage, use, and health ofthe HDFS in a cluster (a typical cluster contains 4000 machines). It keepstrack of the file metadata; which files are currently in the system and howeach file is broken down into file blocks. The task of the Name Node isI/O and memory intensive; hence the machine on which it resides does notdouble as a Data Node and a Name Node at the same time.Secondary NameNode: Each Hadoop cluster contains one secondary

19

Name Node that serves as an assistant daemon to the Name Node. Itsfunction is to constantly monitor the Name Node and keep a checkpointof HDFS metadata at given intervals. When a Name Node fails, the filesystem can be recovered from the checkpoints of the Secondary Name Node(it is not automatic recovery, it needs to be reconfigured in order to use theSecondary Name Node as the primary Name Node for the cluster).DataNode: The Data Node daemon is hosted by each slave machine inthe cluster. It reads and writes the HDFS blocks to actual files on the localfile system. If a client wants to read or write a HDFS, it must ask for thelocation (ie. which Data Node is hosting that file block) from the NameNode, and then communicates directly with that Data Node. It is also pos-sible for Data Nodes to communicate with each other when replication ofdata is needed for redundancy. Upon initialization, each Data Node reportsthe status of its file system blocks to the Name Node, after that, the DataNode constantly polls the Name Node to inform it of new changes and toalso receive instructions on how to proceed with delete, modify, or writeinstructions that will effect local persistent storage.JobTracker: The JobTracker daemon works in a Master-Slave architecture.It is a single point of failure in a typical Hadoop framework. Client jobs aresubmitted to the JobTracker (as jar files). It is the responsibility of the Job-Tracker to determine which job must be executed first (default uses FIFO),which task to assign to which machine, and must monitor the progress ofthe execution on each of the slave machines. Should a TaskTracker daemoncurrently executing a job fail, the JobTracker need to launch an instance ofa TaskTracker on a different machine and submit the same job that failedfor execution. This is transparent to the user and must also auto-scale.TaskTracker: The Task Tracker daemon oversees the execution of indi-vidual task handed over to it by the JobTracker. Although there is a singleTaskTracker per slave machine, each TaskTracker can spawn multiple JavaVirtual Machines (JVMs) to handle many map or reduce tasks in parallel[22].Both JobTracker and TaskTracker machines are Hadoops implementation ofMapReduce.

3.2 MapReduce

There are many different possible implementations of MapReduce. Google[4] implements MapReduce according to Figure 3.2 as follows:

1. The user job contains a MapReduce library that will split the inputfiles into 64MB (this size can be controlled by the user via an optionalparameter). It then spawns many copies of the program on differentmachines in the cluster.

20

2. One of the copies of the program is made the Master, and the othersmade slaves that will be assigned work by the Master. There are Mmap task and R reduce tasks that the Master must assign to (idle)slaves.

3. A slave that is assigned a map task must read the input (key, value)pairs and pass them to the corresponding user defined Map function.The intermediate key/value pairs (output of the Combiner) are savedin memory.

4. Periodically, the saved key/value pairs are written to permanent stor-age (local disk) that is partitioned into R regions by the Partitioningfunction. The locations of these pairs on the disk are parsed backto the master to enable the Master to forward these locations to thereduce workers.

5. When a reduce worker is notified by the master of the locations of thekey/value pairs, it uses remote procedure calls (RPC) to read the pairsfrom the disk, after which it sorts and groups all the values accordingto values that belongs to the same key. Should the amount of data betoo large to fit into memory, an external sort must be used.

6. For each unique intermediate key, the reduce worker iterates over itssorted values and passes the key and its corresponding values to theuser defined reduce function. The output of the reduce function isappended to a final output file for that particular reduce partition.

7. When all Map and Reduce tasks completes, the Master wakes up theuser process to invoke the MapReduce call in the user program toreturn to the user code.

3.3 Hadoop/MapReduce versus other Systems

3.3.1 Relational Database Management Systems (RDBMS)

Due to slow improvements in drive seek-time, applications that need toanalyze a whole dataset for batch processing tend to experience high la-tency in DBMS. MapReduce operates like a batch query processor system,with the ability to run a query on the whole dataset and obtain results ingood time. However, for an indexed dataset, DBMS performs better thanHadoop/MapReduce. MapReduce is suitable for situations where data iswritten once and read many times, whereas DBMS is good for datasets thatare continually updated. MapReduce works well on semi-structured andunstructured datasets since it is the person analyzing the data who chooses

21

Figure 3.2: Google’s implementation of MapReduce [4]

the key/value pairs. RDBMS only works well on structured datasets (Struc-tured data is data that is organized into entities that have a defined format,such as XML documents or database tables that conform to a particular pre-defined schema [21]). In the near future, the line between MapReduce andDBMS will be blurred since RDBMS will start incorporating some featuresof MapReduce and vice versa (eg. pig and hive are built on MapReduce,and are able to use some features of DBMS).

3.3.2 Grid Computing

High Performance Computing (HPC) and Grid Computing use MessagePassing Interface (MPI) to perform large-scale data processing. The criticalresource in such platforms is network bandwidth. HPC tries to distributethe job on a cluster of machines which accesses a shared file system ona Storage Area Network (SAN). For a compute intensive job, everythinggoes on well, but for data intensive jobs, bandwidth becomes a limitation.Hadoop/MapReduce solves this problem by localizing the data on the com-puting nodes to avoid the use of unnecessary bandwidth. In contrast toMapReduce, programmers have total control over MPI applications by let-ting their programs control and manage their own check pointing and recov-ery and hence making them more complex to write than MapReduce. Thecomplexity of writing MPI applications is manifested as the programmer is

22

Figure 3.3: Automatic Parallelism of MapReduce job

required to implement fault tolerance algorithms, whereas such algorithmsare already implemented in the Hadoop/MapReduce framework.

3.3.3 Volunteer Computing

Volunteer Computing is a project in which volunteers make available theirCPU time at idle computer periods to analyse data. The problem beingsolved is broken down into chunks called work units, and is sent into theworld for idle volunteered CPUs to process. A few megabytes of work maytake a typical home computer hours or days to analyze, after which it will re-turn the results to the server. Volunteer Computing has no data locality likeHadoop/MapReduce; it also performs its tasks on un-trusted non-dedicatedmachines, which may number as many as 50000 or more.

3.4 Features of MapReduce

3.4.1 Automatic Parallelization and Distribution of Work

This feature of Hadoop/MapReduce provides a convenient platform to ex-ecute large sets of data within a short period of time on cheap commoditymachines. The user of the MapReduce library must express the computa-tions as two functions; Map and Reduce (Figure 3.3). The Map functionis written by the user to accept an input key/value pair that subsequentlyproduces an intermediate key/value pairs. The Reduce function is also writ-ten by the user to accept an intermediate key, and a set of values that areassociated to that key. These values are merged together to form a smallerset of values and then supplied to the Reduce function through an Iterator.This allows the MapReduce framework to accept a set of values that are toolarge to fit into memory.

3.4.2 Fault Tolerance in Hadoop/MapReduce

Since a Hadoop/MapReduce cluster contains 4000 commodity machines,failure of a component can occur at any time. The framework should beable to accommodate such failures in a graceful manner.

23

Worker Failure:When a worker fails, it will not be able to send a responseto the Masters ping message that is sent periodically to worker nodes. If agiven period of time expires without the worker responding to the Master,the Master will consider that worker as failed. If the worker failed at apoint when the Master was assigning map and reduce task to slaves, thetask meant for the failed worker is rescheduled, and assigned to a differentworker node that is in good condition. If a worker fails after completing amap task, the task is rescheduled and assigned to a new worker, since theoutput of the map task was saved to the local drive of the failed worker.Similarly, if a worker fails during the progress of a map or reduce task,the task must be rescheduled and reassigned. However, rescheduling is notnecessary if a node fails after completing a reduce task; since the output ofthe reduce phase is saved to a global file. All workers are notified of the newlocation of rescheduled tasks.Handling Bugs in User code: Each worker process contains a signalhandler that is used to catch bus errors and segmentation violations. TheMapReduce library stores the sequence number of the argument of the userMap and Reduce operations in a global variable before invoking them. If theuser code generates a signal, the signal handler will send a UDP packet (lastgasp) containing the sequence number to the master. When the Master findsout that it has seen this failure from a particular record more than once, itwill instruct the worker to skip the record the next time a Map or Reducetask is to be carried out on it.Master Failure: The Master writes periodic checkpoint to enable it torestart when it eventually fails. However, before the Master restarts uponfailure, all map/reduce tasks are aborted, and clients need to retry again.

3.4.3 Cost Efficiency

Hadoop/MapReduce employs commodity machines that are cheap, but un-reliable. It makes use of the locality of the distributed file system to reduceexcessive use of network bandwidth (that is most input data is read locallyfrom the HDFS, otherwise, it is read from the closest machine with theavailable data).MapReduce offers automatic fault-tolerance, hence reducing the cost ofmaintenance and downtime. Since it is easier to program and use, fewerprogrammers and administrators may be required to maintain a good work-ing system.

3.4.4 Simplicity

Hadoop allows programmers to quickly write efficient parallel code for pro-cessing of large jobs (eg. Hadoop/MapReduce is used in Facebook to processhuge user tasks).

24

3.5 Limitations of Hadoop/MapReduce

Upon all the advantages of Hadoop/MapReduce, there exist some limita-tions.Current implementation of the framework does not allow the program-mer to control the order in which the maps or reductions are run.A factorof sequential execution also comes into play, since no Reduce operation cantake place unless all Map operations have failed, been skipped, or completed.The Master node is a single point of failure. Upon failure, all queued andrunning jobs are killed, hence jobs will have to be resubmitted and resched-uled. Master and job restart is not simple and can be very tricky due totheir complex states. The JobTracker machine is currently coarsely syn-chronized. It acts as the cluster resource manager and at the same timemanages application life cycle. Clearly, a split of these two functions willimprove MapReduce the more.Hadoop/MapReduce has a scalability problem. The number of nodes percluster for current implementations is 4000 with maximum concurrent tasksof 40000. Any load beyond this limit will impact negatively on the perfor-mance and efficiency of Hadoop/MapReduce implementation.Hadoop/MapReduce does not support alternate paradigms such as databasemanagement systems. Iterative applications developed by MapReduce areten times slower (eg. a database with an index will always be faster than aMapReduce job on un-indexed data). MapReduce lacks wired-compatibilityprotocol since both clients and cluster must be of same version and appli-cations and workflows cannot migrate to different clusters (See Appendix Eon How to install and run Hadoop/MapReduce).

3.6 Apache’s ZooKeeper

ZooKeeper is a high-performance available and scalable open-source softwarepackage that is used to coordinate distributed systems. It is a stripped-downfile system that exposes some primitives on which distributed systems canbuild to implement higher level services such as synchronization, naming,and configuration management.It is difficult to implement coordination in distributed systems especially ifthese coordinating services are to be written from scratch. For instance, itis difficult and time consuming to write services that will prevent deadlocks,network partitions, configuration changes, and other failures as part of thedistributed system development process. According to the ZooKeeper doc-umentation [23], coordinating distributed systems is a zoo, hence the nameZooKeeper (See Appendix F on how to install and run ZooKeeper).

25

Figure 3.4: Zookeeper hierarchical namespace[24]

3.6.1 ZooKeeper Data Model

ZooKeeper is organized in a hierarchical tree of nodes called znodes. Znodesare created by clients. Each znode can have children and data associatedwith it. The size of the data is limited to 1MB per node since ZooKeepernodes are not meant for data storage but to keep data in memory to achievehigh throughput and low latency. However, there are no renames, soft orhard links and no append semantics.Hunt Patrick et al [24] defined the terms client, server, and znode as follows:

A client denotes a user of the ZooKeeper service, server denotes a pro-cess providing the ZooKeeper service, and znode denotes an in-memory datanode in the ZooKeeper data, which is organized in a hierarchical namespacereferred to as the data tree.

The root node is the parent node, with all other nodes as its children.A node can either be ephemeral or permanent. They can also be sequential(append incremental number) in which case each node has a unique numberfor purposes of ordering. All znodes can have children except ephemeral zn-odes that disappears as soon as the client that created it closes its session.A client may manipulate a regular node by creating and/or deleting it ex-plicitly, as opposed to ephemeral nodes where the client can either decide todelete the node or let the service remove it automatically when the clientssession expires or a failure occurs. All nodes are seen by all clients, andapart from ephemeral nodes, all nodes can be deleted by any other clientincluding the client that created the node. Each znode is associated with anAccess Control List (ACL) which determines the rights a given client hasover a particular node. Znodes maintains version numbers for data modi-fications such as timestamps, and ACL changes, to allow cache validations

26

and coordinated updates. For instance, anytime a znode data changes, theversion number is incremented, and any client that accesses this node musttake note of its updated version number. Applications manipulate znodesto coordinate the actions of their processes. User applications store datadirectly to nodes, or use the names of nodes to indicate some event of theapplication.

3.6.2 Zookeeper Guarantees

Zookeeper provides some features that make it suitable for coordinating dis-tributed systems. It facilitates loosely coupled interactions, where processesin a cluster do not need to know of each others existence before communi-cating or interacting with each other (rendezvous). It is simple to use andyet provides a rich library of APIs that can be used to build a large class ofcoordination data structures and protocols. The following, among others,are some guarantees of zookeeper:

1. Atomicity:Atomicity occurs in zookeeper data access. A client musteither have access to the entire data stored at a node or it will obtainnothing at all. There is no such thing as partial reads in zookeeper.Assuming a client decides to read from a server that failed and hencenot up to date, the server must either block the client to force theclient to reconnect to an up-to-date server, or it must delay the clientuntil it updates itself. When a client is writing to a node, the writemust either replace the entire data at the node, or risk failure. Thisdesign is to maintain consistency.

2. Sequential Consistency:Zookeeper uses the First In First Out (FIFO)policy to execute request that update the state of the service. Thismeans that, requests from clients are executed in the order with whichthey were received. Zookeeper uses this property to maintain orderand to ensure that all client requests are attended to.

3. Single System Image:No matter which server a client connects to,zookeeper guarantees that they see the same data. Any modified datathat is not committed to all the servers in the service is not given toa requesting client, and any server that is out of date with the otherservers must update itself before serving clients.

4. Reliability and Availability:Zookeeper is designed to work on mul-tiple machines with the ability of a client reconnecting to anotherserver should the original server serving the client goes down. InLeader Election implementations of zookeeper, a server is guaranteedto be available since leader election takes place as soon as the currentleader goes down.

27

Figure 3.5: Zookeeper Leader Election Service [25]

5. Timeliness: Updates and other operations in zookeeper are propa-gated to all concerned clients or servers in real time. This propertyprevents servers that did not fail from lagging behind. Any changesmade to a node by a client are immediately seen by all other clients.

3.6.3 Zookeeper Primitives

Programmers take advantage of the capabilities of Zookeeper’s API to im-plement some important algorithms that manage distributed systems.

1. Leader Election:Currently, zookeeper leader activation includes LeaderElection and Fast Leader Election. It is important that when a Leadergoes down, a Follower rises to the position of a new Leader to carryon the processing of client requests. For each zookeeper cluster, theremust be 2f+1 nodes to enable it to achieve failover, since zookeeperis fault-tolerant only if majority of nodes are up and running ( f isthe number of servers whose failure the cluster can tolerate). One ofthe nodes is made active during start-up and is said to be the Leader.All other servers are Followers. Nodes register to the Leader ElectionService to enable them get notification when a server goes down, andas soon as this happens, a quorum is formed, and a new Leader iselected (Figure 3.5). It is important to get a Leader because any writerequest from clients can only be processed by the Leader, however,read requests can be processed by any of the zookeeper servers. Theelected Leader waits for Followers to connect to it, it will then syncwith them by sending any updates they are missing. Should a followerbe missing too many updates, the Leader will send a snapshot of theupdates to that particular Follower. A practical way of implementingLeader Election in zookeeper is to use the sequential/ephemeral flagswhen creating znodes that represents client proposals. With the se-quence flag set, zookeeper will add a sequence number that is greaterthan the sequence number previously appended to the parent node.Theprocess that created the znode with the smallest sequence number is

28

the Leader, and it registers to watch for changes on the nodes usedto track sequence IDs. Especially for a lower numbered sequential-ephemeral node, they register to watch that node for want of beingnotified when that node goes away since with ephemeral nodes, theznode will be deleted as soon as the Leader fails. In a nutshell, oncea zookeeper Leader Election service contains 2f+1 servers, the failureof f servers will be tolerated since the remaining constitute the major-ity, anything short of this will cause the unavailability of zookeeper.Other primitives provided by the Zookeeper API include Rendezvous,Watches, and Queues.

3.6.4 Zookeeper Fault Tolerance

The failure of a Zookeeper client does not negatively affect its performance.Zookeeper will continue to serve remaining clients from which it receivesrequests. A zookeeper disconnected client will always try to reconnect untilit is successful. Practically, the client is supplied with almost all the serverports to enable it reconnect to another server should the original connectionfail. There are two possibilities involved should a Zookeeper server fail:First, if the failed server is the Leader in a Leader Election Service, thenthe rest of the servers must come together to elect a new leader. This isonly possible if the working Zookeeper servers form a quorum. In the secondcase, if the failed server is a Follower, the Leader will try to connect with it,if this fails, the Leader assumes that particular Follower is down, the LeaderElection Service must then determine either the rest of the servers form aquorum. If not, the entire Zookeeper service will go down. This impliesthat, the only time a Zookeeper coordinated cluster will go down apart fromunforeseen circumstances is when the rest of the working servers do not forma majority of the servers in the cluster.

3.7 Related Work

Fabrizio Marozzo et al [26], proposed a peer-to-peer MapReduce architecturein which each node can act either as a master or a slave. The role assignedto a node can change dynamically and depends on the characteristics of thenode at that point in time. Each master node can act as a back-up for othermaster nodes. Slave nodes must check periodically for the existence of atleast one master in the network. In case no masters are found, the slavemust promote itself to the master role. This means that, the first node join-ing the network always assumes a master role and so also is the last noderemaining in the network.In his 2008 presentation, Francesco Salbaroli [27] proposed a Fault-TolerantHadoop JobTracker in which he advocates the addition of a library (JGroups)

29

Figure 3.6: JobTracker state transition diagram [28]

to the Hadoop source code to aid in making the JobTracker highly avail-able. His proposal aimed at maintaining the current Master-Slave imple-mentation of Hadoop/MapReduce since it has a relatively small overheadand reduced coordination complexities. Though this implementation addsa little overhead (negligible due to the number of replicated JobTrackers),it can accommodate new features without modifying its current behaviour.The logical model of the replicated JobTracker is made up of a coordinatingprotocol that resides on the Distributed File System (DFS), and performscoordination between replicated JobTrackers and Slave machines. JGroupsperform the discovery of members, health checking, implementation of elec-tion protocol, and communication between components. If a master fails,a Fault Tolerant Manager discovers its failure and triggers a new electionaccordingly.In 2011 Devarajulu K. [28] proposed an implementation of Hadoop/MapReduce that will make the JobTracker highly available. His implemen-tation was based on the Leader Election Framework suggested in Zookeeper.In the Zookeeper Leader Election framework, 2f + 1 zookeeper servers arestarted with only one of them acting as the Master and the rest Followers.This service can tolerate the failure of f servers since the remaining serversmust form a quorum (majority). According to [28], JobTracker can be inone of three states, ACTIVE, STAND-BY, or NEUTRAL. The transitionof a given JobTracker to any of the three states is dictated by the LeaderElection Protocol. Based on which JobTracker is started in ACTIVE orSTAND-BY mode, the Leader Election Service can trigger the transitionsmethods between JobTracker states shown in Figure 3.6. A JobTracker cango to any of the above states depending on the given scenario; for instance,

30

when multiple JobTrackers are started together, the Leader Election Servicemust initiate the election of one of the JobTrackers as a master, and the restbecomes followers. The master thus implements the startInActive transitionmethod. Neutral state is assumed when Zookeeper service becomes unavail-able.This approach ([28]) was used to make the JobTracker available in this The-sis.

31

Chapter 4

Availability Model

4.1 JobTracker Availability Model

There has been growing interests in designing redundant computer systemsthat are highly available especially for performing safety critical tasks. Suchsystems require availability analysis not only for purposes of correct func-tionality but also to help identify components and issues that have the po-tential to reduce availability. The availability of a system component isthe probability that the component is still operating at time t, given thatit was operating at time zero. In this section, we propose a mathematicalmodel that is based on Markov State chains to analyze the availability ofthe JobTracker in a Hadoop/MapReduce cluster.

4.1.1 Related Work

Much research on the availability model of redundant systems has been car-ried out by many researchers. Oliviera et al [29] proposed the developmentof a Markov based model for the analysis of the availability of a telecommu-nications management system. The model was used to define the availabilityof parts of the system, identifying hardware and software components re-sponsible for reducing the availability of the system, and to define actionsthat can mitigate the low availability.Laprie [30] proposed a dependability model for software systems during theiroperational life. He used a Markov model for a single machine that consid-ered the dependability of both software and hardware on the system. Hismodel considered the system as being partitioned into a software part anda hardware part, each of which can fail and can be repaired. This researchdemonstrated that availability models can be generalized to engulf both soft-ware and hardware faults.Goel and Soenjoto [31] used stochastic models to analyze the performance ofa combined software and hardware system. Dai et al [32] proposed a modelfor determining the reliability and availability of centralized heterogeneous

32

systems. They presented a model that implements a system availability func-tion for a virtual machine along with an application example to illustratetheir method and to demonstrate its feasibility. They defined centralizedheterogeneous distributed system as distributed systems consisting of Mservers (M ≥ 1) that can support a virtual machine which in turn can man-age and control programs and data from heterogeneous sub-systems throughvirtual nodes. These heterogeneous sub-distributed systems are made up ofdifferent types of computers with different operating systems connected bydifferent topologies of networks. Lai et al [33] also proposed a Markov modelfor determining the availability of a k Fault Tolerance (f+1) homogeneouslydistributed software/hardware system (HDSHS). They define HDSHS as adistributed system in which all hosts are of the same type; such as machinesfrom the same vendor. It will be noted that, this definition of HDSHSdoes not take into account the topology and configuration of the distributedsystem.

4.2 Model Assumptions

1. Each machine on the cluster runs a copy of the same software. Thissoftware has a failure rate (λs(t)) determined by the Jelinski-Morandamodel [34].

2. All failures in the cluster (either hardware or software failures) aremutually independent. If more than one failure occur at the sametime, they can either be considered as a single failure or independentfailures with a time interval of zero.

3. All the machines in the cluster have hardware failure rate (λh) resultingfrom an exponential distribution.

4. A fault is corrected instantaneously without introducing new faultsinto the software/hardware. The correction time follows an exponen-tial distribution with parameter µs for software failure and µh forhardware failure.

5. Both the software and hardware have only two sates; working stateand faulty state.

4.3 Markov Model for a Multi-Host System

The Markov model shown in Figure 4.1 describes a Homogenous DistributedHardware/Software System with N hosts [33]. We let i,j be a state wherei hosts suffer hardware failure and j hosts suffer software failure.The corre-sponding Kolmogrov differential equation is given fori, j 6= 0, N ; i+ j ≤ N .

33

Figure 4.1: N-hosts HDSHS general model [33]

P′i,j(t) = µhPi+1,j(t) + (N − i− j + 1)λhPi−1,j(t) + (N − i− j + 1)λs(t)Pi,j−1(t)

+ µsPi,j+1(t)−Xi,jPi,j(t)

(4.1)

whereXi,j = µs + (N − i− j)λh + (N − i− j)λs(t) + µh

The initial conditions are P0,0(0) = 1, and Pi,j(0) = 0 for i, j 6= 0. Bound-ary conditions are at (See Appendix A on how to derive these boundaryconditions):

P′0,0(t) = µhP1,0(t) + µsP0,1(t)−N [λs(t) + λh]P0,0(t) (4.2)

P′0,j(t) = µsP0,j+1(t) + µhP1,j(t) + (N − j + 1)λs(t)P0,j−1(t)

− [µs + (N − j)(λh + λs(t)]P0,j(t)

for j = 1, 2, 3, ..., N − 1

(4.3)

P′i,0(t) = µsPi,1(t) + µhPi+1,0(t) + (N − i+ 1)λh(t)Pi−1,0(t)

− [µh + (N − i)(λh + λs(t)]Pi,0(t)

for i = 1, 2, 3, ..., N − 1

(4.4)

P′N,0(t) = λhPN−1,0(t)− µhPN,0(t) (4.5)

P′0,N (t) = λs(t)P0,N−1(t)− µsP0,N (t) (4.6)

34

4.3.1 The Parameter λs(t)

One of the earliest proposed software reliability models which is still inwidespread use is the Jelinski-Moranda model [34]. In this model, theelapsed time between software failures is taken to follow an exponentialdistribution with parameter that is proportional to the number of remain-ing faults in the software. The model was developed under the followingassumptions:

1. The total number of faults is finite (N0)

2. No fault is introduced while correcting detected faults: each detectedfault is corrected before new executions

3. Faults are independent of each other and their manifestation rate isconstant

4. All failures are observed.

From the assumptions the following parameters and relation are defined:

N0 = total number of faultsλ = fault manifestation rateλ(i) = failure rate of the i− th failure

λ(i) = λ[N0 − (i− 1)], 0 ≤ i ≤ N0 (4.7)

The quantity λ is the proportionality constant and N0 is the total numberof faults in the software from the initial point in time when the softwareis monitored. N0 − (i − 1) is the number of remaining software faults inthe system, it is a decreasing function of time since after every debugging,one or more faults are corrected eventually reducing the number of faultsremaining. (i − 1) is a function of time since the (i − 1)th fault occurs atsome point in time.

4.4 Markov Model for a Three-Host (N = 3)Hadoop/MapReduce Cluster UsingZookeeper as Coordinating Service

This model uses [33] as the basis for proposing a Markov based availabilitymodel for the JobTracker Machine in a Hadoop/MapReduce implementa-tion. The following reasons reinforce the choice of [33]:

1. The proposed system is a Fault-Tolerant system.

35

Figure 4.2: The proposed cluster

Figure 4.3: Markov model for N = 3

2. A cluster of the proposed system is made up of at least three serversfrom the same vendor (for example HP) and have the same systemspecifications.

3. The same type of distributed application software (Hadoop/MapReduceand Zookeeper) will be run on similar computers/servers in the cluster.

Figure 4.2 is a graphical representation of the proposed Hadoop/MapReduceFault-Tolerant cluster. The corresponding Markov state transition diagramfor the three Job Tracker Machines is shown in Figure 4.3. A host is func-tional only if both software and hardware are in working states. At state0, all software and hardware are in working state and functioning normally.At state 1, one software is down and system continuous to function withthe remaining hosts. At state 2, two softwares are down causing systemto go down since Zookeeper needs majority to form a quorum in order toimplement Leader Election. The same scenario applies to states 4, and 5,but with respect to hardware. At state 5, two hardwares and softwares aredown leading to system failure. At states 6 and 8, the cluster has alreadystopped working since three hardwares and softwares are already down. Ifwe let the probability of the system in state i at time t be Pi(t), then the

36

Table 4.1: Values of k for multi-host differential equations

corresponding set of Kolmogrovs differential equations for N = 3 can beobtained from the Markov model for multi-hosts system as follows (See Ap-pendix B and Appendix C for N = 4 and N = 5 respectively):

Let k be the set of all possible transitions between states i and j forN = 3. Notice that i+ j ≤ N , and i, j ∈ {0, 1, 2, 3}.This implies that

k = {(0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 1),(1, 2), (1, 3), (2, 0), (2, 1), (2, 2),

(2, 3), (3, 0), (3, 1), (3, 2), (3, 3)}Applying the above conditions, k reduces to:

k = {(0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (3, 0)}

Table 4.1 is generated by assigning values of k to the differential equationsin the Markov model for the multi-hosts system. We can then generate thefollowing differential equations:State 0

Applying equation 4.2 to k = (0, 0), we obtain the Kolmogrov differentialequation for state 0:

P′0,0(t) = µhP1,0(t) + µsP0,1(t)− 3[λs(t) + λh]P0,0(t)

From Table 4.1 , k = (1, 0) is state 4, k = (0, 1) is state 1, and k = (0, 0) isstate 0. By substitution, we have

P′0(t) = µhP4(t) + µsP1(t)− 3[λs(t) + λh]P0(t) (4.8)

37

State 1Applying equation 4.3 to k = (0, 1), we obtain

P′0,1(t) = µsP0,2(t) + µhP1,1(t) + (3− 1 + 1)λs(t)P0,0(t)

− [µs + (3− 1)(λh + λs(t)]P0,1(t)

P′0,1(t) = µsP0,2(t) + µhP1,1 + 3λs(t)P0,0(t)− [µs + 2(λh + λs(t)]P0,1(t)

From Table 4.1 , k = (1, 1) is state 5, k = (0, 1) is state 1,k = (0, 2) is state2, etc. By substitution, we have

P′1(t) = µsP2(t) + µhP5(t) + 3λs(t)P0(t)− [µs + 2(λh + λs(t)]P1(t) (4.9)

State 2Applying equation 4.3 to k = (0, 2), we obtain:

P ′0,2(t) = µsP0,2+1(t) + µhP1,2(t) + (3− 2 + 1)λs(t)P0,2−1(t)

− [µs + (3− 2)(λh + λs(t)]P0,2(t)

P ′0,2(t) = µsP0,3(t) + µhP1,2(t) + 2λs(t)P0,1(t)

− [µs + λh + λs(t)]P0,2(t)

From Table 4.1,we obtain equation 4.10 by substitution.

P ′2(t) = µsP3(t) + µhP6(t) + 2λs(t)P1(t)− [µs + λh + λs(t)]P2(t) (4.10)

State 3Applying equation 4.6 to k = (0, 3), we obtain:

P ′0,3(t) = λs(t)P0,3−1(t)− µsP0,3(t)

P ′0,3(t) = λs(t)P0,2(t)− µsP0,3(t)

From Table 4.1,we obtain equation 4.11 by substitution.

P ′3(t) = λs(t)P2(t)− µsP3(t) (4.11)

State 4Applying equation 4.4 to k = (1, 0), we obtain:

P ′1,0(t) = µhP1+1,0(t) + (3− 1 + 1)λhP1−1,0(t) + µsP1,1(t)

− [µh + (3− 1)(λs(t) + λh]P1,0(t)

P ′1,0(t) = µhP2,0(t) + 3λhP0,0(t) + µsP1,1(t)

− [µh + 2(λs(t) + λh]P1,0(t)

P ′4(t) = µhP7(t) + 3λhP0(t) + µsP5(t)− [µh + 2(λs(t) + λh]P4(t) (4.12)

38

State 5Applying equation 4.1 to k = (1, 1), we obtain:

P ′1,1(t) =µhP1+1,1(t) + (3− 1− 1 + 1)λhP1−1,1(t)

+ (3− 1− 1 + 1)λs(t)P1,1−1(t) + µsP1,1+1(t)−X1,1P1,1(t)

P ′1,1(t) = µhP2,1(t) + 2λhP0,1(t) + 2λs(t)P1,0(t) + µsP1,2(t)−X1,1P1,1(t)

where

X1,1(t) = µs + (3− 1− 1)λh + (3− 1− 1)λs(t) + µh

X1,1(t) = µs + λh + λs(t) + µh

Therefore we have:

P ′1,1(t) = µhP2,1(t) + 2λhP0,1(t) + 2λs(t)P1,0(t) + µsP1,2(t)

− [µs + λh + λs(t) + µh]P1,1(t)

P ′5(t) = µhP8(t) + 2λhP1(t) + 2λs(t)P4(t) + µsP6(t)

− [µs + λh + λs(t) + µh]P5(t)(4.13)

State 6Applying equation 4.1 to k = (1, 2), we obtain:

P ′1,2(t) =µhP1+1,2(t) + (3− 1− 2 + 1)λhP1−1,2(t)

+ (3− 1− 2 + 1)λs(t)P1,2−1(t) + µsP1,2+1(t)−X1,2P1,2(t)

P ′1,2(t) = µhP2,2(t) + λhP0,2(t) + λs(t)P1,1(t) + µsP1,3(t)−X1,2P1,2(t)

From i+ j ≤ N we have:

P ′1,2(t) = λhP0,2(t) + λs(t)P1,1(t)−X1,2P1,2(t)

where

X1,2 = µs + (3− 1− 2)λh + (3− 1− 2)λs(t) + µh

X1,2 = µs + µh

Therefore we have

P ′1,2(t) = λhP0,2(t) + λs(t)P1,1(t)− [µs + µh]P1,2(t)

P ′6(t) = λhP2(t) + λs(t)P5(t)− [µs + µh]P6(t) (4.14)

39

State 7Applying equation 4.4 to k = (2, 0), we obtain:

P ′2,0(t) = µhP2+1,0(t) + (3− 2 + 1)λhP2−1,0(t) + µsP2,1(t)

− [µh + (3− 2)(λs(t) + λh)]P2,0(t)

P ′2,0(t) = µhP3,0(t) + 2λhP1,0(t) + µsP2,1(t)− [µh + λs(t) + λh]P2,0(t)

P ′7(t) = µhP9(t) + 2λhP4(t) + µsP8(t)− [µh + λs(t) + λh]P7(t) (4.15)

State 8Applying equation 4.1 to k = (2, 1), we obtain:

P ′2,1(t) =µhP2+1,1(t) + (3− 2− 1 + 1)λhP2−1,1(t)

+ (3− 2− 1 + 1)λs(t)P2,1−1(t) + µsP2,1+1(t)−X2,1P2,1(t)

P ′2,1(t) = µhP3,1(t) + λhP1,1(t) + λs(t)P2,0(t) + µsP2,2(t)−X2,1P2,1(t)

From i+ j ≤ N we have:

P ′2,1(t) = λhP1,1(t) + λs(t)P2,0(t)−X2,1P2,1(t)

where

X2,1 = µs + (3− 2− 1)λh + (3− 2− 1)λs(t) + µh

X2,1 = µs + µh

Therefore we have

P ′2,1(t) = λhP1,1(t) + λs(t)P2,0(t)− [µs + µh]P2,1(t)

P ′8(t) = λhP5(t) + λs(t)P7(t)− [µs + µh]P8(t) (4.16)

State 9Applying equation 4.4 to k = (3, 0), we obtain:

P ′3,0(t) = µhP3+1,0(t) + (3− 3 + 1)λhP3−1,0(t) + µsP3,1(t)

− [µh + (3− 3)(λs(t) + λh)]P3,0(t)

P ′3,0(t) = µhP4,0(t) + λhP2,0(t) + µsP3,1(t)− [µh]P3,0(t)

however, i+ j ≤ N , which implies that

P ′3,0(t) = λhP2,0(t)− [µh]P3,0(t)

P ′9(t) = λhP7(t)− [µh]P9(t) (4.17)

40

The system of these ten Kolmogrov differential equations (from equation4.8 to 4.17) can be solved numerically with the initial conditions P0(0) =1, Pi(0) = 0, i = 1, 2, 3, ..., 5. This initial condition means that, at state 0,all hardware and software are in good working condition.Since states 0, 1, and 4 are the only working states for this model, wecan calculate the availability (A(t)) by solving the system of Kolmogrovdifferential equations using the given initial conditions.

A(t) = P0(t) + P1(t) + P4(t) (4.18)

4.5 Numerical Solution to the System ofDifferential Equations

The system of ten Kolmogrov differential equations generated for N = 3 canbe solved numerically using MATLAB. The parameter of interest is the soft-ware failure rate which is described by equation 4.7. In the said equation, thequantity (i−1) is a function of time, but in solving the differential equationswe assume that the software failure rate is independent of time since accord-ing to the Jelinski-Moranda model the failure rate remains unchanged foreach interval that the number of software faults remains unchanged. There-fore, we can make λs(t) constant for each time interval, and use the result asthe initial condition for the next interval.(See AppendixD for the MATLABode function used to solve the system of Kolmogrov differential equations4.8 to 4.17). We let λ = 0.06, N0 = 10, and solve the differential equationswith a time interval of 50 per solution. Tables 4.2 and 4.3 show portions ofthe probabilities generated from the solution to the differential equations.The next solution will use the last row

[0.1515 0.2260 0.2235 0.1100 0.0913 0.0910 0.0452 0.0363 0.0182 0.0071]as the initial condition to solve the system of differential equations over thetime interval 50-100. This process is repeated continuously until the num-ber of remaining faults N0 − (i − 1) reaches zero. Notice that the quantityN0 − (i− 1) in this particular solution is decremented in steps of one, thatis, faults are corrected one at a time.The Availability A(t) = P0(t)+P1(t)+P4(t) is plotted against time as shownin Figure 4.4.

4.5.1 Interpretation of Availability plot of the JobTracker

It can be seen from Figure 4.4 that the availability of this Hadoop/MapReduceimplementation is at its highest point immediately the cluster is started.With little time after start-up, the availability falls to the lowest point. Thisbehavior can be attributed to the fact that when the cluster is launched forthe first time, a large number of faults are detected due to say, initial start-up problems, and problems of coordination between Hadoop/MapReduce

41

Table 4.2: Probabilities of the first solution to the ten differential equations

Table 4.3: Probabilities (Continued)

42

Figure 4.4: Availability plot of the JobTracker

and Zookeeper, among others.At some point in time (t = 50) the availability of the cluster starts ris-ing. This is because detected faults are fixed gradually until a point (sayt > 600) when the cluster becomes bug free, causing the cluster availabilityto approach a certain value less than one. From Figure 4.4, it is strongly ad-vised that the cluster must not be used to solve critical tasks before the timethat it reaches its lowest availability mark. This period should be earmarkedas a testing period for the cluster, possibly to help identify the number ofremaining faults in the cluster.The Zookeeper Leader Election Service requirement that only majority serversmay form a quorum (2f + 1) has a negative impact on the availability ofthe cluster. For instance, in the above Hadoop/MapReduce cluster, whentwo Zookeeper servers go down, the system becomes unavailable since theone remaining server does not form a majority out of the 3. However, in kFault Tolerant systems, the remaining server may continue functioning with-out any problem though it may not have a back-up. For k Fault Tolerantsystems, the equivalent for equation 4.18 for N = 3 becomes:

A(t) = P0(t) + P1(t) + P2(t) + P4(t) + P6(t) + P7(t)

The resulting availability plot is shown in Figure 4.5 . It is very clear thatthe availability in Figure 4.5 is much more desirable than that of Figure 4.4.The flaw in the availability of Figure 4.4 is due to the 2f + 1 operational

43

Figure 4.5: Availability of a k Fault Tolerant system.

principle of Zookeeper.

4.6 Discussion of Results

4.6.1 Sensitivity Analysis

Assuming that a given fault leads to a failure which in turn leads to anotherfailure, then from the second assumption, we may consider these failures asindependent failures that occur at the same time with a time interval of zero.However, it is obvious that the first failure resulted in the second failure,implying that, when the first failure is corrected, the second failure will alsobe corrected as a result. In Figure 4.6, as we start correcting the faults inthe cluster (at t = 50), the availability begins to rise. If the remaining faultsin the system are strictly independent, then we assume no fault correctedwill result in the correction of other faults, and we obtain the curve i = 1(that is, faults are corrected one at a time, and their correction does nothave any effect on the remaining faults). In practice however, faults mayresult in other faults; the correction of the first fault will lead to the auto-matic correction of the second fault. This scenario is depicted by the curvesi = 2, i = 3, and i = 9. The ability to correct faults that have the potentialof causing other faults pushes the availability of the cluster to maximum

44

Figure 4.6: The effect of i on Availability

within the shortest possible time.The initial fault (number of faults remaining in the cluster) has an effect on

the availability of the cluster (Figure 4.7). For initial faults greater than 20,the availability falls below 10% at the early stages. It rises gradually, untilit reaches the maximum availability mark for the cluster. It is only naturalthat, a system with many faults is not expected to be available until thosefaults are corrected. For lower initial faults, the availability rises quicklyas it approaches maximum. A possible application of this scenario mightbe that, Zookeeper clusters must be started with a few number of servers,and the number increased gradually as we become more conversant with thetype of faults that can occur due to cluster installation and management. Itis obvious though, that the number of initial faults may or may not dependon the number of servers in the cluster.

In Figure 4.8 N0 = 10 each contributing a different λ to the softwarefailure rate (λs(t)) of the cluster. As the quantity λ increases, the resultingavailability falls at the early stages. This is so because the software failurerate must decrease as faults are diagnosed and corrected in order for theavailability to increase. However,λ is a direct proportionality constant ofλs(t), implying that any increase in λ will in turn increase λs(t) and hencedecrease the availability. After the initial fall in availability, the cluster be-gins to recover with each fault corrected, and after some time the availabilityconverges to maximum irrespective of the initial value of λ.Figure 4.9 shows how the availability of the cluster response to changes in

45

Figure 4.7: The effect of Initial Fault on Availability

Figure 4.8: Effect of λ on Availability

46

Figure 4.9: Effect of λh on Availability

the hardware failure rate. Each server in the cluster is assumed to originatefrom the same vendor. Each of these servers contributes λh as the hardwarefailure rate for the cluster of a given size. As the hardware failure rate in-creases, the resulting availability falls. This can be applied in the selectionof machines for the cluster, in other words, if the failure rate for the ma-chines of a given vendor are high, they are not good choice as servers for aHadoop/MapReduce implementation that uses ZooKeeper as a coordinatingservice. For a λh = 0, the availability approaches 100%. This phenomenoncan be attributed to the fact that, software uses hardware as a platform torun, hence the failure of the hardware in most cases may result in failure ofthe software running on the failed hardware.Assuming we have maintenance personnel on stand-by to correct any hard-

ware or software faults in the cluster, then the effects of the intensity ofhardware and software fault correction on availability are depicted inFigures 4.10 and 4.11. If the hardware faults in the cluster are left uncor-rected, the availability decreases rapidly until it reaches zero; this is truefor both software and hardware faults. However, as we increase the rate ofmaintenance of both hardware and software faults, the availability increasesin response. For hardware faults, it will reach a time that the change in rateof repair will not have much significant effect on availability since most ofthe hardware faults might have been removed due to regular and consistentmaintenance. It will also be noted (Figure 4.10) that, the effect of increasingthe hardware repair rate is very significant to the behavior of the availability.

47

Figure 4.10: Effect of µh on Availability

Figure 4.11: Effect of µs on Availability

48

Figure 4.12: Effect of N on Availability

This could be due to the fact that, a highly available hardware is the firststep for obtaining better functioning software.The above scenario applies to Figure 4.11, except that the effect of increas-ing the software repair rate does not have much impact on availability aswould the hardware repair rate if increased at the same rate.

Figure 4.12 shows the availability for different number of servers in aHadoop/MapReduce cluster using Zookeeper (see Figures 5.6 and 5.7 fortheir corresponding Markov state diagrams). For all the different values ofN, the availability initially falls until it approaches t = 50 where it startsrising. For N = 2, the cluster achieves the least availability with respect tothe others. Naturally, one would expect that as the number of Zookeeperservers increase, the availability of the cluster will increase. However, thisis not the case, since N = 3 has the highest availability compared to N = 4,and N = 5. This may be attributed to the fact that, with the Zookeeperservice requirement of 2f + 1; out of 5 servers, 2 are allowed to fail, the re-maining 3 operate as if they are a single k Fault Tolerant server, since noneof them is allowed to fail in order to keep the cluster going. The 3 remainingservers have no redundancy and operate as one, hence their contribution toAvailability is equivalent to that of one k Fault Tolerant server, but their

49

contribution to both hardware and software failure rates is thrice that of asingle server. This phenomenon decreases the availability of the cluster. Itis therefore advisable from this model that only 3 Zookeeper servers be usedwhen we require high availability for a Hadoop/MapReduce cluster. Thereis also the added advantage of reducing cost.

50

Chapter 5

Conclusion and Future Work

5.1 Conclusion

Based on the Availability model for the Hadoop/MapReduce implementa-tion using Zookeeper Leader Election service in this thesis, the followingconclusions are drawn:

1. The Availability for a Zookeeper coordinated cluster is lower than thatof a k Fault Tolerance system for higher values of the hardware failurerate. This can be attributed to the 2f+1 property of Zookeeper whichis necessary to prevent the effect of crash failures.

2. Much effort must be put into fault fixing at the early stages of thecluster life, as time goes on and most of the faults are fixed, manage-ment can decide on where to channel resources meant for maintenancein order to help reduce cost.

3. Much of the maintenance effort must be channelled into fixing hard-ware faults as their failure have adverse effect on the availability of thecluster. There should exist a policy that determines how long a givenhardware can be used, and how many times it is allowed to repair agiven hardware and still use it. This is because, as hardware growsolder, its failure rate increases which in turn reduces its availability.

4. The hardware failure rate must be used as a criterion for purchasingnew hardware for the cluster. Each hardware comes with its ownfailure rate, this rate is different for each vendor. Vendors with lowhardware failure rates must be favoured over those with high hardwarefailure rates irrespective of cost.

5. The optimal number of servers for a Hadoop/MapReduce implemen-tation that uses Zookeeper as a coordinating service is 3. One of

51

the servers is used as the Leader and the other two as Followers in aZookeeper Leader Election Service.

5.2 Future Work

Although Hadoop/MapReduce implemented with Zookeeper makes it easierto improve synchronization, eliminate deadlocks, and implement fault toler-ance automatically in distributed systems, its availability is lower than thatof a k Fault Tolerance system for higher values of λh. It is important toinvestigate this abnormality, and try to fix it.It will be worthwhile to reduce the load on the JobTracker, since it doublesas the cluster resource manager and application life cycle manager at thesame time. This may help reduce latency during execution of MapReducejobs.In addition, the model assumption that failures are independent does notmake sense in practise, for example, a hardware fault may lead to the failureof another hardware or software. It is important to investigate the detailedeffect on the model, should we assume that failures are not independent.

52

Appendix

A: Derivation of Boundary Conditions DifferentialEquations for Markov Multi-Host Model

The general Kolmogrov differential equation for a multi-host system is onlytrue if i, j 6= 0, N : and i + j ≤ N . There may be situations when theabove conditions cannot be met (boundary conditions), for instance, when(i = 0, j = 0), (i, 0), (0, j), (0, N), and (N, 0). In such situations, we mustconsider the use of the following equations to obtain the appropriate Kol-mogrov differential equation. We show below, how these differential equa-tions (boundary conditions) are derived from the Markov Multi-Host system:

P ′0,0(t)

From Figure 5.1, we can derive the differential equation for P ′0,0(t) as follows:

P ′0,0(t) = µhP1,0(t)+µsP0,1(t)− (N− i−j)λs(t)P0,0(t)− (N− i−j)λhP0,0(t)

but i = 0, and j = 0

P ′0,0(t) = µhP1,0(t) + µsP0,1(t)−N [λs(t) + λh]P0,0(t)

Figure 5.1: Model for P ′0,0(t)

53

Figure 5.2: Model for P ′i,0(t)

Figure 5.3: Model for P ′0,N (t)

P ′i,0(t)

From Figure 5.2, we can derive the differential equation for P ′i,0(t) as follows:

P ′i,0(t) = µsPi,1(t) + µhPi+1,0(t) + (N − i− j + 1)λhPi−1,0(t)

− [µh + (N − i− j)λh + (N − i− j)λs(t)]Pi,0(t)

but j = 0

P ′i,0(t) = µsPi,1(t) + µhPi+1,0(t) + (N − i+ 1)λhPi−1,0(t)

− [µh + (N − i)(λh + λs(t))]Pi,0(t)

for i = 1, 2, 3, ..., N − 1.

54

Figure 5.4: Model for P ′N,0(t)

Figure 5.5: Model for P ′0,j(t)

P ′0,N(t)

From Figure 5.3, we can derive the differential equation for P ′0,N (t) as follows:

P ′0,N (t) = λs(t)P0,N−1(t)− µsP0,N (t)

P ′N,0(t)

From Figure 5.4, we can derive the differential equation for P ′N,0(t) as follows:

P ′N,0(t) = λhPN−1,0(t)− µhPN,0(t)

P ′0,j(t)

From Figure 5.5, we can derive the differential equation for P ′0,j(t) as follows:

P ′0,j(t) = µsP0,j+1(t) + µhP1,j(t) + (N − i− j + 1)λs(t)P0,j−1(t)

− [µs + (N − i− j)λh + (N − i− j)λs(t)]P0,j(t)

55

Figure 5.6: Markov State Diagram for a Cluster of Four Servers

but i = 0

P ′0,j(t) = µsP0,j+1(t) + µhP1,j(t) + (N − j + 1)λs(t)P0,j−1(t)

− [µs + (N − j)(λh + λs(t))]P0,j(t)

for j = 1, 2, 3, ..., N − 1.

B: Differential Equations for a Cluster of Four Servers(N = 4)

Figure 5.6 shows the Markov state diagram from which the following 15Kolmogrov differential equations can be obtained:

P ′0(t) = µhP5(t) + µsP1(t)− 4[λh + λs(t)]P0(t) (5.1)

P ′1(t) = µhP6(t) + µsP2(t) + 4λs(t)P0(t)− [µs + 3(λh + λs(t))]P1(t) (5.2)

P ′2(t) = µhP7(t) + µsP3(t) + 3λs(t)P1(t)− [µs + 2(λh + λs(t))]P2(t) (5.3)

P ′3(t) = µhP8(t) + µsP4(t) + 2λs(t)P2(t)− [µs + λh + λs(t)]P3(t) (5.4)

λs(t))P3(t)− µsP4(t) (5.5)

P ′5(t) = µhP9(t) + µsP6(t) + 4λhP0(t)− [µh + 3(λh + λs(t))]P5(t) (5.6)

56

P ′6(t) = µhP10(t) + µsP7(t) + 3λhP1(t) + 3λs(t)P5(t)

− [µs + µh + 2(λh + λs(t))]P6(t)(5.7)

P ′7(t) = µhP11(t) + µsP8(t) + 2λhP2(t) + 2λs(t)P6(t)

− [µs + µh + λh + λs(t)]P7(t)(5.8)

P ′8(t) = λhP3(t) + λs(t)P7(t)− [µs + µh]P8(t) (5.9)

P ′9(t) = µhP12(t) + µsP10(t) + 3λhP5(t)− [µs + 2(λh + λs(t))]P9(t) (5.10)

P ′10(t) = µhP13(t) + µsP11(t) + 2λhP6(t) + 2λs(t)P9(t)

− [µs + µh + λh + λs(t)]P10(t)(5.11)

P ′11(t) = λhP7(t) + λs(t)P10(t)− [µh + µs]P11(t) (5.12)

P ′12(t) = µhP14(t) + µsP13(t) + 2λhP9(t)− [µh + λh + λs(t)]P12(t) (5.13)

P ′13(t) = λhP10(t) + λs(t)P12(t)− [µh + µs]P13(t) (5.14)

P ′14(t) = λhP12(t) + µhP14(t) (5.15)

The Availability for the cluster using Zookeeper as coordinating service isgiven by:

A(t) = P0(t) + P1(t) + P5(t) (5.16)

C: Differential Equations for a Cluster of Five Servers(N = 5)

Figure 5.7 shows the Markov state diagram for a cluster of 5 servers fromwhich the following 21 Kolmogrov differential equations can be obtained:

P ′0(t) = µhP6(t) + µsP1(t)− 5[λh + λs(t)]P0(t) (5.17)

57

Figure 5.7: Markov State Diagram for a Cluster of Five Servers

P ′1(t) = µhP7(t) + µsP2(t) + 5λs(t)P0(t)− [µs + 4(λh + λs(t))]P1(t) (5.18)

P ′2(t) = µhP8(t) + µsP3(t) + 4λs(t)P1(t)− [µs + 3(λh + λs(t))]P2(t) (5.19)

P ′3(t) = µhP9(t) + µsP4(t) + 3λs(t)P2(t)− [µs + 2(λh + λs(t))]P3(t) (5.20)

P ′4(t) = µhP10(t) + µsP5(t) + 2λs(t)P3(t)− [µs + λh + λs(t)]P4(t) (5.21)

P ′5(t) = λs(t)P4(t)− µsP5(t) (5.22)

P ′6(t) = µhP11(t) + µsP7(t) + 5λhP0(t)− [µh + 4(λh + λs(t))]P6(t) (5.23)

P ′7(t) = µhP12(t)+µsP8(t)+4λhP1(t)+4λs(t)P6(t)−[µh+µs+3(λh+λs(t))]P7(t)(5.24)

P ′8(t) = µhP13(t)+µsP9(t)+3λhP2(t)+3λs(t)P7(t)−[µh+µs+2(λh+λs(t))]P8(t)(5.25)

58

P ′9(t) = µhP14(t)+µsP10(t)+2λhP3(t)+2λs(t)P8(t)−[µh+µs+λh+λs(t)]P9(t)(5.26)

P ′10(t) = λhP4(t) + λs(t)P9(t)− [µh + µs]P10(t) (5.27)

P ′11(t) = µhP15(t) +µsP12(t) + 4λhP6(t)− [µh + 3(λh +λs(t))]P11(t) (5.28)

P ′12(t) = µhP16(t) + µsP13(t) + 3λhP7(t) + 3λs(t)P11(t)

− [µh + µs + 2(λh + λs(t))]P12(t)

(5.29)

P ′13(t) = µhP17(t)+µsP14(t)+2λhP8(t)+2λs(t)P12(t)−[µh+µs+λh+λs(t)]P13(t)(5.30)

P ′14(t) = λhP9(t) + λs(t)P13(t)− [µh + µs]P14(t) (5.31)

P ′15(t) = µhP18(t)+µsP16(t)+3λhP11(t)− [µh +2(λh +λs(t))]P15(t) (5.32)

P ′16(t) = µhP19(t)+µsP17(t)+2λhP12(t)+2λs(t)P15(t)−[µh+µs+λh+λs(t)]P16(t)(5.33)

P ′17(t) = λhP13(t) + λs(t)P16(t)− [µh + µs]P17(t) (5.34)

P ′18(t) = µhP20(t) + µsP19(t) + 2λhP15(t)− [µh + λh + λs(t)]P18(t) (5.35)

P ′19(t) = λhP16(t) + λs(t)P18(t)− [µh + µs]P19(t) (5.36)

P ′20(t) = λhP18(t)− µhP20(t) (5.37)

The Availability for the cluster using Zookeeper as coordinating service isgiven by:

A(t) = P0(t) + P1(t) + P2(t) + P6(t) + P7(t) + P11(t) (5.38)

59

Figure 5.8: MATLAB Solution to the Differential Equations

D: MATLAB solution to the (N = 3) System ofKolmogrov Differential Equations

The MATLAB ODE function that was used to solve the system of ten Kol-mogrov differential equations is shown in Figure 5.8.From command line:� xx = [1 0 0 0 0 0 0 0 0 0]; %initial conditions for first iteration� tspan = [0 : 50];� [t, x] = ode45(@Host3, tspan, xx)� At = x(:, 1) + x(:, 2) + x(:, 5) %A(t) = P0 + P1 + P4� plot(t, At,′ Linewidth′, 3);� xlabel(′Time′);� ylabel(′Availability′);� title(′Availability plot′);� hold on

Notice that for the next iteration, we use the last row from the resultsof the solution above as the initial condition

� xx = [0.0066 0.0495 0.2469 0.6164 0.0040 0.0201 0.0502 0.0017 0.0042 0.0004];� tspan = [50 : 100]; etc.

60

Figure 5.9: Starting Hadoop

E: How to set up Hadoop/MapReduce on Ubuntu11.04

Single Node Hadoop Cluster

For complete lessons on setting up Hadoop, there is a step-by-step tutorialby Michael Noll on how to set up Hadoop on Ubuntu 10.04:The following steps describes how to set up hadoop − 0.20.203.tar.gz onUbuntu Linux 11.04

1. Install sun-java6-jdk or java-6-openjdk.

2. Open terminal (notice that the prompt # is for root and $ is for nor-mal user),create a new group, create a new user and add the new userhadoop to the group hadoop:$ sudo addgroup hadoop$ sudo adduser - -ingroup hadoop hadoop$ sudo adduser hadoop admin

3. Switch user to hadoop and generate an SSH key$ su - hadoop

61

$ ssh-keygen -t rsa -P ””$ cat HOME/.ssh/id rsa.pub� HOME/.ssh/authorized keys( note: there is $ in front of each of the variables HOME)

4. Test the SSH set up$ ssh localhost

5. Disable IPv6, by opening the file /etc/sysctl.conf and add this lines tothe end of the file:#disable ipv6net.ipv6.conf.all.disable ipv6 = 1net.ipv6.conf.default.disable ipv6 = 1net.ipv6.conf.lo.disable ipv6 = 1

6. Restart your computer and type the following command, if the outputis 1, then IPv6 has been disabled$ cat /proc/sys/net/ipv6/conf/all/disable ipv6

7. Download hadoop−0.20.203.tar.gz, extract, rename, and change own-ership by$ cd /usr/local$ sudo tar xzf hadoop− 0.20.203.tar.gz$ sudo mv hadoop− 0.20.203 hadoop$ sudo chown -R hadoop:hadoop hadoop

8. Add the Java installation path to JAVA HOME parameter in thehadoop-env.sh file located in /usr/local/hadoop/conf/. To be ableto open and edit this file as root, execute$ gksudo nautilus and then open the file from the ensuing pop-up win-dow# The java implementation to use. Required.exportJAV A HOME = /usr/lib/jvm/java− 6− openjdk/

9. Create the directory /usr/local/hadoop-datastore/hadoop as the hadooptemporary directories base where hdfs files will be stored.$ sudo mkdir /usr/local/hadoop-datastore$ sudo mkdir /usr/local/hadoop-datastore/hadoop$ sudo chown hadoop:hadoop /usr/local/hadoop-datastore/hadoop$ sudo chmod 750 /usr/local/hadoop-datastore/hadoop

62

10. Copy and paste the ff into the hadoop/conf/core-site.xml file inbetween the< cofiguration >< /configuration > tags< property >< name > hadoop.tmp.dir < /name >< value > /usr/local/hadoop− datastore/hadoop < /value >< description > A base for other temporary directories. < /description >

< /property >< property >< name > fs.default.name < /name >< value > hdfs : //localhost : 54310 < /value >< description > The name of the default file system. < /description >

< /property >

11. Copy and paste the following into the hadoop/conf/mapred-site.xmlfile in between the < cofiguration >< /configuration > tags :< property >< name > mapred.job.tracker < /name >< value > localhost : 54311 < /value >< description > The host and port that the MapReduce job tracker runs at. </description >< /property >

12. Copy and paste the following into the hadoop/conf/hdfs-site.xml file inbetween the< cofiguration >< /configuration > tags. < property >

< name > dfs.replication < /name >< value > 1 < /value >< description > The number of hdfs replications to use < /description >

< /property >

13. Log on to the hadoop account and format the Name Node$ su - hadoop/usr/local/hadoop$ bin/hadoop namenode -format

14. Start hadoop by(the output is shown in Figure 5.9)/usr/local/hadoop$ bin/start-all.sh

63

15. Stop hadoop by:/usr/local/hadoop$ bin/stop-all.sh

When hadoop is running, we can access the web interfaces of the Job-Tracker(s), TaskTracker(s) and Name Node(s) through the following linkswith our browser:JobTrackerTaskTrackerName Node

Multi-Node (N = 3) Hadoop Cluster

1. Configure and run each machine as if it were to run as a single cluster.

2. Shut down each machine with /bin/stop-all.sh.

3. Network all the machines with a single hub or switch and assign IPaddresses 192.168.0.1 to master, 192.168.0.2 to slave1, and 192.168.0.3to slave2.

4. Update the file /etc/hosts on all machines with the following lines# /etc/hosts for master and slaves192.168.0.1 master192.168.0.2 slave1192.168.0.3 slave2

5. Configure SSH on the master to allow the user on the master node toconnect to its own user account and that of the slave$ ssh−copy − id− i HOME/.ssh/id rsa.pub hadoop@slaveWith a $ before the HOME variable.

6. Test the SSH setup by connecting from master to masterhadoop@master: $ ssh masterand then from master to slavehadoop@master: $ ssh slave

7. On the master update /conf/masters tomasterand on slaves /conf/slaves tomasterslave1slave2

64

8. On all machines change the fs.default.name variable in /conf/mapred-site.xml to host master and port 54310< value > hdfs : //master : 54310 < /value >

9. Change mapred.job.tracker variable in /conf/mapred-site.xml to hostmaster and port 54311< value > master : 54311 < /value >Note that in /conf/mapred-site.xml, the variables mapred.local.dir isthe directory in which written MapReduce data is stored, mapred.map.tasksis the number of TaskTrackers in use (default is 10 times the numberof slaves), and mapred.reduce.tasks is 2 times the number of slave pro-cessors.

10. Again on all the machines, change the dfs.replication variable in /conf.hdfs-site.xml to 3< value > 3 < /value >

11. On the master, format the NameNode:/usr/local/hadoop $ bin/hadoop namenode -format

12. To start the cluster, we must first start the HDFS (NameNode) onthe master, this will start all the DataNode daemons on all the slaves.Next, we can then start the MapReduce JobTracker daemond on themaster which will also start all the TaskTrackers on the slaves.Start the NameNode on the master:/usr/local/hadoop $ bin/start-dfs.shExamine the success or failure of the above command in the slave’slog file /logs/hadoop-hadoop-datanode-slave.log. To determine the pro-cesses that are currently running on the cluster, execute the followingon the master,:/usr/local/hadoop $ jpsThe NameNode,DataNode,SecondaryNameNode, and Jps must all berunning.Start the JobTracker on the master:/usr/local/hadoop $ bin/start-mapred.shExamine the success or failure of the above command in the slave’s logfile /logs/hadoop-hadoop-tasktracker-slave.log. Execute jps on the mas-ter to determine the Java processes that are currently running on thecluster.The NameNode,DataNode,SecondaryNameNode,JobTracker, Task-Tracker and Jps must all be running.

13. To submit a job to the cluster, follow the steps in Appendix E.

65

Figure 5.10: Submitting a job to the Hadoop cluster

14. In stoping the cluster, we must first stop the JobTracker and its Task-Trackers followed by the HDFS (NameNode and DataNodes)Issue the following commands on the master:/usr/local/hadoop $ bin/stop-mapred.sh:/usr/local/hadoop $ bin/stop-dfs.sh

Submitting Word Count job to the Hadoop/MapReduce clus-ter

WordCount is a Java based MapReduce code that is executed byHadoop/MapReduce to count the number of words found in the input file(s)supplied to MapReduce during execution (Figure 5.10). Compile the Word-Count code using NetBeans IDE to obtain the WordCount.jar file.

1. Copy WordCount.jar to hadoop installation directory (/usr/local/hadoop).

2. Create a directory in the hadoop installation directory to be used tostore the text files whose words Hadoop/MapReduce will be counting.

66

Figure 5.11: Output of WordCount job

67

Call the directory /usr/local/hadoop/inputFile.

3. Copy the files (file1.txt and file2.txt) whose words will be countedinto the directory /usr/local/hadoop/inputFile and change ownershipto hadoop user. Grant hadoop user 755 access rights to the directoryand its contents

4. Start Hadoop.

5. Suply contents of the directory inputFile as input to the Hadoop clus-terhadoop@ubuntu:/usr/local/hadoop$ bin/hadoop fs -put inputFile inputhadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -ls

6. Execute byhadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar WordCount.jar in-put outputThe last command will create a directory called output and store theoutput of the job there. To run the same job again with different out-put name typehadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar WordCount.jar in-put output2

7. To view the output files (Figure 5.11) on the distributed filesysteminside terminal:hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop fs -get output outputhadoop@ubuntu:/usr/local/hadoop$ cat output/∗

F: How to Install and Run Zookeeper on Ubuntu11.04

Zookeeper has simple primitives that distributed applications can build onto implement higher level synchronization, configuration and maintenancenecessary to prevent race conditions, network partitions and deadlock.

Deploying Zookeeper Ensemble on a Single Machine

1. Download a stable Zookeeper release from Apache Zookeeper

2. Extract it into 3 directories and rename them as/usr/local/zookeeper1

68

Figure 5.12: Zookeeper ensemble

/usr/local/zookeeper2/usr/local/zookeeper3

3. Modify the conf/zoo.cfg file for each server as follows:Server1: /usr/local/zookeeper1/conf/zoo.cfgtickT ime = 2000initLimit = 10syncLimit = 5dataDir = /var/zookeeper1clientPort = 2184server.1 = localhost : 2888 : 3888server.2 = localhost : 2889 : 3889server.3 = localhost : 2890 : 3890Server2: /usr/local/zookeeper2/conf/zoo.cfgtickT ime = 2000initLimit = 10syncLimit = 5dataDir = /var/zookeeper2clientPort = 2185server.1 = localhost : 2888 : 3888server.2 = localhost : 2889 : 3889server.3 = localhost : 2890 : 3890Server3: /usr/local/zookeeper3/conf/zoo.cfgtickT ime = 2000

69

initLimit = 10syncLimit = 5dataDir = /var/zookeeper3clientPort = 2186server.1 = localhost : 2888 : 3888server.2 = localhost : 2889 : 3889server.3 = localhost : 2890 : 3890

4. Create a server id file (myid)for each Server to enable the server iden-tify itself at start-up/var/zookeeper1/myid write 1 in this file/var/zookeeper2/myid write 2 in this file/var/zookeeper3/myid write 3 in this file

5. To start Zookeeper, open 3 terminals and cd to the 3 Zookeeper Servers(folders), for examplecd /usr/local/zookeeper1bin/zkServer.sh startDo this for all the 3 Servers. Figure 5.12 shows that, Server2 is theLeader (LEADING) and the rest are Followers (FOLLOWING).

6. To connect as a client to any of the servers, execute the followingcommand from a Followerbin/zkCli.sh -server localhost:2184Where 2184 is the client port. We can now create nodes and performother operations.

7. To disconnect the client from the server, type the following from theclient machineQuit

8. To stop Zookeeper, execute the following command from the 3 Serversbin/zkServer.sh stop

Deploying Zookeeper Ensemble across a Network

1. Download and unpack Zookeeper on each machine.

2. On each Server, create and configure the conf/zoo.cfg file. A sampleis shown:tickT ime = 2000initLimit = 10syncLimit = 5

70

dataDir = /var/zookeeperclientPort = 2181server.1 = 192.168.145.110 : 2888 : 3888server.2 = 192.168.145.111 : 2888 : 3888server.3 = 192.168.145.112 : 2888 : 3888

3. Create the myid file in the dataDir directory for each Server as follows:At 192.168.145.110 /var/zookeeper contains 1At 192.168.145.111 /var/zookeeper contains 2At 192.168.145.112 /var/zookeeper contains 3

4. To start Zookeeper, execute the following on each machinecd /usr/local/zookeeper − 3.3.1/bin/zkServer.sh start

5. Connect to Zookeeper bybin/zkCli.sh − server < hostname >:< clientPort >Or more specifically,bin/zkCli.sh − server 192.168.145.110 : 2181

6. We can disconnect from Zookeeper using Quit and stop Zookeeperusing bin/zkServer.sh stop

71

References

[1] http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html, http://www.nielsen.com/us/en.html

[2] http://www.facebook.com/press/info.php?statisticshttp://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html

[3] Matei Zaharia, A Presentation on Cloud Computing with MapReduceand Hadoop, UC Berkeley AMP Lab, 2010.

[4] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified DataProcessing on Large Clusters, Google, Inc. 2004.

[5] Grant Mackey, Saba Sehrish, John Bent, Julio Lopez, Salman Habib,Jun Wang. Introducing Map-Reduce to High End Computing, Uni-versity of Central Florida, Los Alamos National Lab, Carnegie MelonUniversity.

[6] APAche Hadoop, http://hadoop.apache.org/

[7] Ian Lumb, Eunmi Choi, Bhaskar Prasad Rimal; A Taxonomy andSurvey of Cloud Computing Systems, 2009.

[8] http://cwiki.apache.org/confluence/display/ZOOKEEPER/EurosysTutorial. Flavio Junqueira,Benjamin Reed. Zookeeper Tutorial,Yahoo! Research, Eurosys 2011.

[9] Roger Jennings; Cloud Computing with the Windows Azure platform.Wiley Publishing, Inc. 2009.

[10] Michael Miller. Cloud Computing: Web-Based Applications ThatChange the Way You Work and Collaborate Online, Que Publishing,August 2008.

[11] Michael Armbrust, Armando Fox,Rean Griffith,Anthony D.Joseph,Randy H. Katz,Andrew Konwinski,Gunho Lee,David A.Patterson,Ariel Rabkin,Ion Stoica,Matei Zaharia. Above the Clouds:

72

A Berkeley View of Cloud Computing,Electrical Engineering andComputer Sciences University of California at Berkeley,February 10,2009

[12] V. Cortellessa. Relational Characterizations of System Fault Toler-ance, June 7, 2004.

[13] Bruno L. C. Ramos. Challenging Malicious Inputs with Fault Toler-ance Techniques, Black Hat Europe 2007.

[14] Naima Aksu. Fault Tolerance in Distributed Systems, Term Project.

[15] Dimitris N. Chorafas. Cloud Computing Strategies, Taylor and FrancisGroup 2011.

[16] Priya Venkitakrishnan. Rollback and Recovery Mechanisms In Dis-tributed Systems, Department of Computer Science, University ofTexas at Arlington.

[17] Antonina Litvinova, Christian Engelmann, Stephen L. Scott. A Proac-tive Fault-tolerance Framework for High-Performance Computing.

[18] Sanjay Ghemawat, Howard Gobioff,Shun-Tak Leung. The Google FileSystem, Bolton Landing,New York, USA.October 1922, 2003.

[19] Naushad UzZaman; Survey on Google File System, Fall 2007.

[20] http:docs.google.com/viewer?a=v& q=cache:kbrCrmMU5IAJ:www.math.uah.edu/stat/special/Weibull.pdf+Weibull+distribution.

[21] Tom White, Hadoop The Definitive Guide, Second Edition, OReillyMedia, Inc., 2011.

[22] Chuck Lam, Hadoop In Action, Manning Publications Co., 2011.

[23] http://zookeeper.apache.org/doc/r3.3.3/

[24] Hunt Patrick, Mahadev Konar, Flavio P. Junqueira, Benjamin Reed.ZooKeeper: Wait-free coordination for Internet-scale systems. Yahoo!Grid. 2010.

[25] http://zookeeper.sourceforge.net/index.sf.shtml

[26] Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio, Adapt-ing MapReduce for Dynamic Environments Using a Peer-to-PeerModel,DEIS,University of Calabria,Via P.Bucci 41C,87036 Rende,Italy.

73

[27] Francesco Salbaroli, Enhancing the Hadoop MapReduce framework byadding fault tolerant capabilities; Proposal for a fault tolerant HadoopJobTracker, IBM Innovation centre, 2008.

[28] Devaraju K, High Availability for JobTracker.https://issues.apache.org/jira/browse/MAPREDUCE, June 2011.

[29] Patrcia A. Oliveira , Jos Marcos Nogueira, Germn Goldszmidt, Avail-ability in Telecommunication Management Distributed Systems, IBMResearch - Hawthorne, NY, USA.

[30] J. C. Laprie, Dependability evaluation of software systems in opera-tion, IEEE Transaction on software engineering SE-10(6), 1984.

[31] A.L. Goel, J. Soenjoto. Models for hardware/software operational per-formance evaluation, IEEE Transaction on Reliability R-30(1992)232-239.

[32] Y.S. Dai*, M. Xie, K.L. Poh, G.Q. Liu. A study of service reliabilityand availability for distributed systems, Department of Industrial andSystems Engineering, National University of Singapore, Kent RidgeCrescent, Singapore, Singapore 119 260, 2002.

[33] C. D. Lai, M. Xie, K. L. Poh, Y. S. Dai, P. Yang. A model for avail-ability analysis of distributed software/hardware systems, NorthernTelecom, Toronto, Canada, 2002.

[34] Z. Jelinski, P.B. Moranda, Software Reliability Research, in: W.Freiberger (Ed.), Statistical Computer Performance Evaluation, Aca-demic Press, New York, 1982.

74