A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers

32
A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers Yousi Zheng * , Ness B. Shroff *† , Prasun Sinha * Department of Electrical and Computer Engineering Department of Computer Science and Engineering The Ohio State University, Columbus, OH 43210 Abstract With the rapid increase in size and number of jobs that are being processed in the MapReduce framework, efficiently scheduling jobs under this framework is becoming increasingly important. We consider the problem of minimizing the total flow-time of a sequence of jobs in the MapReduce framework, where the jobs arrive over time and need to be processed through both Map and Reduce procedures before leaving the system. We show that for this problem, no on-line algorithm can achieve a constant competitive ratio (defined as the ratio between the completion time of the online algorithm to the completion time of the optimal non-causal off-line algorithm). We then construct a slightly weaker metric of performance called the efficiency ratio. An online algorithm is said to achieve an efficiency ratio of γ when the flow-time incurred by that scheduler divided by the minimum flow-time achieved over all possible schedulers is almost surely less than or equal to γ. Under some weak assumptions, we then show a surprising property that for the flow-time problem any work-conserving scheduler has a constant efficiency ratio in both preemptive and non-preemptive scenarios. More importantly, we are able to develop an online scheduler with a very small efficiency ratio (2), and through simulations we show that it outperforms the state-of-the-art schedulers. I. I NTRODUCTION MapReduce is a framework designed to process massive amounts of data in a cluster of machines [1]. Although it was first proposed by Google [1], today, many other companies including Microsoft, Yahoo, and Facebook also use this framework. Currently this framework is widely used for applications such as search indexing, distributed searching, web statistics generation, and data mining. MapReduce has two elemental processes: Map and Reduce. For the Map tasks, the inputs are divided into several small sets, and processed by different machines in parallel. The output of Map tasks is a set of pairs in <key, value> format. The Reduce tasks then operate on this intermediate data, possibly running the operation on multiple machines in parallel to generate the final result. Each arriving job consists of a set of Map tasks and Reduce tasks. The scheduler is centralized and is responsible for making decisions on which task will be executed at what time and on which machine. The key metric considered in this paper is the total delay in the system per job, which includes both processing delay and the delay in waiting before the first task in the job begins to get processed.

Transcript of A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers

A New Analytical Technique for Designing

Provably Efficient MapReduce Schedulers

Yousi Zheng∗, Ness B. Shroff∗†, Prasun Sinha†

∗Department of Electrical and Computer Engineering

†Department of Computer Science and Engineering

The Ohio State University, Columbus, OH 43210

Abstract

With the rapid increase in size and number of jobs that are being processed in the MapReduce framework, efficiently scheduling

jobs under this framework is becoming increasingly important. We consider the problem of minimizing the total flow-timeof

a sequence of jobs in the MapReduce framework, where the jobsarrive over time and need to be processed through both Map

and Reduce procedures before leaving the system. We show that for this problem, no on-line algorithm can achieve a constant

competitive ratio (defined as the ratio between the completion time of the online algorithm to the completion time of the optimal

non-causal off-line algorithm). We then construct a slightly weaker metric of performance called the efficiency ratio.An online

algorithm is said to achieve an efficiency ratio ofγ when the flow-time incurred by that scheduler divided by the minimum

flow-time achieved over all possible schedulers isalmost surelyless than or equal toγ. Under some weak assumptions, we

then show a surprising property that for the flow-time problem any work-conserving scheduler has a constant efficiency ratio in

both preemptive and non-preemptive scenarios. More importantly, we are able to develop an online scheduler with a very small

efficiency ratio (2), and through simulations we show that itoutperforms the state-of-the-art schedulers.

I. I NTRODUCTION

MapReduce is a framework designed to process massive amounts of data in a cluster of machines [1]. Although it was first

proposed by Google [1], today, many other companies including Microsoft, Yahoo, and Facebook also use this framework.

Currently this framework is widely used for applications such as search indexing, distributed searching, web statistics generation,

and data mining.

MapReduce has two elemental processes: Map and Reduce. For the Map tasks, the inputs are divided into several small sets,

and processed by different machines in parallel. The outputof Map tasks is a set of pairs in<key, value> format. The Reduce

tasks then operate on this intermediate data, possibly running the operation on multiple machines in parallel to generate the

final result.

Each arriving job consists of a set of Map tasks and Reduce tasks. Thescheduleris centralized and is responsible for making

decisions on which task will be executed at what time and on which machine. The key metric considered in this paper is the

total delay in the systemper job, which includes both processing delay and the delay in waiting before the first task in the job

begins to get processed.

2

A critical consideration for the design of the scheduler is the dependencebetween the Map and Reduce tasks. For each

job, the Map tasks need to be finished before starting any of its Reduce tasks1 [1], [3]. Various scheduling solutions has been

proposed within the MapReduce framework [2]–[6], but analytical bounds on performance have been derived only in some

of these works [2], [3], [6]. However, rather than focusing directly on the flow-time, for deriving performance bounds, [2],

[3] have considered a slightly different problem of minimizing total completion time and [6] has assumed speed-up of the

machines. Further discussion of these schedulers is given in Section II.

In this paper, we directly analyze the performance of total delay (flow-time) in the system. In an attempt to minimize this,

we introduce a new metric to analyze the performance of schedulers called efficiency ratio. Based on this new metric, we

analyze and design several schedulers, which can guaranteethe performance of the MapReduce framework.

The contributions of this paper are as follows:

• For the problem of minimizing the total delay (flow-time) in the MapReduce framework, we show that no on-line algorithm

can achieve a constant competitive ratio. (Sections III andIV)

• To directly analyze the total delay in the system, we proposea new metric to measure the performance of schedulers,

which we call theefficiency ratio. (Section IV)

• Under some weak assumptions, we then show a surprising property that for the flow-time problem any work-conserving

scheduler has a constant efficiency ratio in both preemptiveand non-preemptive scenarios (precise definitions provided in

Section III). (Section V)

• We present an online scheduling algorithm called ASRPT (Available Shortest Remaining Processing Time) with a very

small (less than 2) efficiency ratio (Section VI), and show that it outperforms state-of-the-art schedulers through simulations

(Section VII).

II. RELATED WORK

In Hadoop [7], the most widely used implementation, the default scheduling method is First In First Out (FIFO). FIFO

suffers from the well known head-of-line blocking problem,which is mitigated in [4] by using the Fair scheduler.

In the case of the Fair scheduler [4], one problem is that jobsstick to the machines on which they are initially scheduled,

which could result in significant performance degradation.The solution of this problem given by [4] is delayed scheduling.

However, the fair scheduler could cause a starvation problem (referred to [4], [5]). In [5], the authors propose a Coupling

scheduler to mitigate this problem, and analyze its performance.

In [3], the authors assume that all Reduce tasks are non-preemptive. They design a scheduler in order to minimize the

weighted sum of the job completion times by determining the ordering of the tasks on each processor. The authors show that

this problem is NP-hard even in the offline case, and propose approximation algorithms that work within a factor of 3 of the

optimal. However, as the authors point out in the article, they ignore the dependency between Map and Reduce tasks, which

is a critical property of the MapReduce framework. Based on the work of [3], the authors in [2] add a precedence graph to

1Here, we consider the most popular case in reality without the Shuffle phase. For discussion about the Shuffle phase, see Section II and [2].

3

describe the precedence between Map and Reduce tasks, and consider the effect of the Shuffle phase between the Map and

Reduce tasks. They break the structure of the MapReduce framework from job level to task level using the Shuffle phase,

and seek to minimize the total completion time of tasks instead of jobs. In both [3] and [2], the schedulers use an LP based

lower bound which need to be recomputed frequently. However, the practical cost corresponding to the delay (or storage)of

jobs are directly related to the total flow-time, not the completion time. Although the optimal solution is the same for these

two optimization problems, the efficiency ratio obtained from minimizing the total flow-time will be much looser than the

efficiency ratio obtained from minimizing the total completion time. For example, if the flow time of each job is increasing

with its arriving time, then the total flow time (delay) do nothave constant efficiency ratio, but the completion time may have

a constant efficiency ratio. That means that even the total completion time has constant efficiency ratio, it cannot guarantee

the performance of real delay.

In [6], the authors study the problem of minimizing the totalflow-time of all jobs. They propose anO(1/ǫ5) competitive

algorithm with (1 + ǫ) speed for the online case, where0 < ǫ ≤ 1. However, speed-up of machines is necessary in the

algorithm; otherwise, there is no guarantee on the competitive ratio (if ǫ decreases to 0, the competitive ratio will increase to

infinity correspondingly).

III. SYSTEM MODEL

Consider a data center withN machines. We assume that each machine can only process one job at a time. A machine

could represent a processor, a core in a multi-core processor or a virtual machine. Assume that there aren jobs arriving into

the system. We assume the scheduler periodically collects the information on the state of jobs running on the machines, which

is used to make scheduling decisions. Such time slot structure can efficiently reduce the performance degeneration caused by

data locality (see [4] [8]). We assume that the distributionof job arrivals in each time slot is i.i.d., and the arrival rate is λ.

Each jobi bringsMi units of workload for its Map tasks andRi units of workload for its Reduce tasks. Each Map task has 1

unit of workload2, however, each Reduce task can have multiple units of workload. Time is slotted and each machine can run

one unit of workload in each time slot. Assume that{Mi} are i.i.d. with expectationM , and{Ri} are i.i.d. with expectation

R. We assume that the traffic intensityρ < 1, i.e.,λ < N

M+R. Assume the moment generating function of workload of arriving

jobs in a time slot has finite value in some neighborhood of 0. In time slott for job i, mi,t andri,t machines are scheduled for

Map and Reduce tasks, respectively. As we know, each job contains several tasks. We assume that jobi containsKi tasks, and

the workload of the Reduce taskk of job i is R(k)i . Thus, for any jobi,

Ki∑k=1

R(k)i = Ri. In time slott for job i, r

(k)i,t machines

are scheduled for the Reduce taskk. As each Reduce task may consist of multiple units of workload, it can be processed in

either preemptive or non-preemptive fashion based on the type of scheduler.

Definition 1. A scheduler is calledpreemptive if Reduce tasks belonging to the same job can run in parallel on multiple

machines, can be interrupted by any other task, and can be rescheduled to different machines in different time slots.

2Because the Map tasks are independent and have small workload [5], such assumption is valid.

4

A scheduler is callednon-preemptive if each Reduce task can only be scheduled on one machine and, once started, it must

keep running without any interruption.

In any time slott, the number of assigned machines must be less than or equal tothe total number of machinesN , i.e.,n∑

i=1

(mi,t + ri,t) ≤ N, ∀t.

Let the arrival time of jobi be ai, the time slot in which the last Map task finishes execution bef(m)i , and the time slot in

which all Reduce tasks are completed bef(r)i .

For any jobi, the workload of its Map tasks should be processed by the assigned number of machines between time slot

ai andf(m)i , i.e.,

f(m)i∑

t=ai

mi,t = Mi, ∀i ∈ {1, ..., n}.

For any jobi, if t < ai or t > f(m)i , thenmi,t = 0. Similarly, for any jobi, the workload of its Reduce tasks should be

processed by the assigned number of machines between time slot f(m)i + 1 andf

(r)i , i.e.,

f(r)i∑

t=f(m)i +1

ri,t = Ri, ∀i ∈ {1, ..., n}.

Since any Reduce task of jobi cannot start before the finishing time slot of the Map tasks, if t < f(m)i + 1 or t > f

(r)i ,

thenri,t = 0.

The waiting and processing timeSi,t of job i in time slott is represented by the indicator function1{ai≤t≤f

(r)i

} . Thus, we

define the delay cost as∞∑

t=1

n∑i=1

Si,t, which is equal ton∑

i=1

(f

(r)i − ai + 1

). The flow-timeFi of job i is equal tof (r)

i − ai +1.

The objective of the scheduler is to determine the assignment of jobs in each time slot, such that the cost of delaying the jobs

or equivalently the flow-time is minimized.

For the preemptive scenario, the problem definition is as follows:

minmi,t,ri,t

n∑

i=1

(f

(r)i − ai + 1

)

s.t.

n∑

i=1

(mi,t + ri,t) ≤ N, ri,t ≥ 0, mi,t ≥ 0, ∀t,

f(m)i∑

t=ai

mi,t = Mi,

f(r)i∑

t=f(m)i +1

ri,t = Ri, ∀i ∈ {1, ..., n}.

(1)

In the non-preemptive scenario, the Reduce tasks cannot be interrupted by other jobs. Once a Reduce task begins execution on

a machine, it has to keep running on that machine without interruption until all its workload is finished. Also, the optimization

problem in this scenario is similar to Eq. (1), with additional constraints representing the non-preemptive nature, asshown

below:

5

minmi,t,r

(k)i,t

n∑

i=1

(f

(r)i − ai + 1

)

s.t.n∑

i=1

(mi,t +

Ki∑

k=1

r(k)i,t ) ≤ N, ∀t,

f(m)i∑

t=ai

mi,t = Mi, mi,t ≥ 0, ∀i ∈ {1, ..., n},

f(r)i∑

t=f(m)i +1

r(k)i,t = R

(k)i , ∀i ∈ {1, ..., n}, ∀k ∈ {1, ..., Ki},

r(k)i,t = 0 or 1, r

(k)i,t = 1 if 0 <

t−1∑

s=0

r(k)i,s < R

(k)i .

(2)

We show that the offline scheduling problem is strongly NP-hard by reducing 3-PARTITION to the decision version of the

scheduling problem.

Consider an instance of 3-PARTITION: Given a setA of 3m elementsai ∈ Z+, i = 1, ..., 3m, and a boundN ∈ Z+, such

that N/4 < ai < N/2, ∀i and∑

i ai = mN . The problem is to decide ifA can be partitioned intom disjoint setsA1, ..., Am

such that∑

ai∈Akai = N, ∀k. Note that by the range ofai’s, every suchAk much contain exactly 3 elements.

Given an instance of 3-PARTITION, we consider the followinginstance of the scheduling problem. There areN machines

in the system (without loss of generality, we assumeN mod 3 = 0), and3m jobs,J1, ..., J3m, where each job arrives at time

0, andMi = ai, Ri = 23N − ai. Furthermore, letω = 3m2 + 3m. The problem is to decide if there is a feasible schedule with

a flow time no more thanω. This mapping can clearly be done in polynomial time.

Let S denote an arbitrary feasible schedule to the above problem,andF (S) its flow-time. LetnSk denote the number of jobs

that finish in thek-th time slot in scheduleS. We then haveF (S) =∑∞

k=1 k × nSk . We drop the superscript in the following

when there is no confusion. It is clear thatn1 = 0. Let Nk =∑k

j=1 nj denote the number of jobs finished by timek. The

following lemma provides a lower bound onNk.

Lemma 1. (1) N2k ≤ 3k andN2k+1 ≤ 3k+1, ∀k. (2)∑2m

k=1 Nk ≤ 3m2 and the lower bound is attained only if the following

condition holds:

N2k = 3k, ∀k ∈ {1, ..., m}, N2k+1 = 3k, ∀k ∈ {1, ..., m− 1}. (3)

Proof: Since Mi + Ri = 2N/3, at most3k jobs can finish by time slot2k, henceN2k ≤ 3k. Similarly, N2k+1 ≤

⌊3(2k+1)/2⌋ = 3k+1. Now consider the second part. First if Eq. (3) holds, we have∑2m

k=1 Nk =∑m

k=1 3k+∑m−1

k=1 3k = 3m2.

Now assumeN2k+1 = 3k + 1 for somek ∈ {1, ..., m− 1}. Then we must haveN2k < 3k. Otherwise the first2k time slots

will be fully occupied by the first3k jobs by the fact thatMi + Ri = 2N/3, and hence both theM -task andR-task of the

(3k+1)-th job have to be done in the(2k+1)-th time slot, a contradiction. By a similar argument, we also haveN2k+2 < 3k.

Hence ifN2k+1 = 3k + 1 for one or morek ∈ {1, ..., m− 1}, we have∑2m

k=1 Nk < 3m2. Therefore,3m2 is a lower bound

of∑2m

k=1 Nk, which is attained only if Eq. (3) holds.

6

Lemma 2. F (S) ≥ 3m2 + 3m and the lower bound is attained only if(3) holds.

Proof: We note thatF (S) =∑∞

k=1 k×nk =∑∞

k=1[(2m+1)− (2m+1− k)]×nk ≥ (2m+1)∑∞

k=1 nk−∑2m

k=1(2m+

1 − k)nk = (2m + 1)3m−∑2m

k=1 Nk ≥ 6m2 + 3m − 3m2 = 3m2 + 3m, where the last inequality follows from Lemma 1.

The lower bound is attained only ifnk = 0 for k > 2m and∑2m

k=1 Nk = 3m2, which holds only when Eq. (3) holds.

Theorem 1. The scheduling problem (both preemptive and non-preemptive) is NP-complete in the strong sense.

Proof: The scheduling problem is clearly in NP. We reduce 3-Partition to the scheduling problem using the above mapping.

If the answer to the partition problem is ‘yes’, thenA can be partitioned into disjoint setsA1, ..., Am each containing 3 elements.

Consider the scheduleS where fork = 1, ..., m, theM -tasks of the three jobs corresponding toAk are scheduled in thek-th

time slot, and theR-tasks of them are scheduled in the(k + 1)-th time slot. The schedule is clearly feasible. Furthermore,

(3) holds in this schedule. Therefore,F (S) = 3m2 + 3m by Lemma 2. On the other hand, if there is a scheduleS for the

scheduling problem such thatF (S) = 3m2 + 3m, then (3) holds by Lemma 2. An induction onk then shows that for each

k = 1, ..., m, the (2k − 1)-th time slot must be fully occupied by theM -tasks of three jobs, and the2k-th time slot must

be fully occupied by the correspondingR-tasks, which corresponds a feasible partition toA. Since 3-PARTITION is strongly

NP-complete, so is the scheduling problem.

IV. EFFICIENCY RATIO

The competitive ratio is often used as a measure of performance in a wide variety of scheduling problems. For our problem,

the scheduling algorithmS has a competitive ratio ofc, if for any total timeT , any number of arrivalsn in the timeT , any

arrival timeai of each jobi, any workloadMi andRi of Map and Reduce tasks with respect to each arrival jobi, the total

flow-time FS(T, n, {ai, Mi, Ri; i = 1...n}) of scheduling algorithmS satisfies the following:

FS(T, n, {ai, Mi, Ri; i = 1...n})

F ∗(T, n, {ai, Mi, Ri; i = 1...n})≤ c, (4)

where

F ∗(T, n, {ai, Mi, Ri; i = 1...n})

=minS

FS(T, n, {ai, Mi, Ri; i = 1...n}),(5)

is the minimum flow time of an optimal off-line algorithm.

For our problem, we construct an arrival pattern such that noscheduler can achieve a constant competitive ratioc in the

following example.

Consider the simplest example of non-preemptive scenario in which the data center only has one machine, i.e.,N = 1.

There is a job with 1 Map task and 1 Reduce task. The workload ofMap and Reduce tasks are1 and2Q + 1, respectively.

Without loss of generality, we assume that this job arrives in time slot1. We can show that, for any given constantc0, and

any scheduling algorithm, there exists a special sequence of future arrivals and workload, such that the competitive ratio c is

greater thanc0.

7

(a) Case I: SchedulerS (b) Case I: SchedulerS1

(c) Case II: SchedulerS (d) Case II: SchedulerS1

Fig. 1. The Quick Example

For any schedulerS, we assume that the Map task of the job is scheduled in time slot H + 1, and the Reduce task is

scheduled in the(H + L + 1)st time slot (H, L ≥ 0). Then the Reduce task will last for2Q + 1 time slots because the

Reduce task cannot be interrupted. This scheduler’s operation is given in Fig. 1(a). Observe that any arbitrary scheduler’s

operation can be represented by choosing appropriate values for H and L possibilities:H + L > 2(c0 − 1)(Q + 1) + 1 or

H + L ≤ 2(c0 − 1)(Q + 1) + 1.

Now let us construct an arrival patternA and a schedulerS1. WhenH + L > 2(c0 − 1)(Q + 1) + 1, the arrival patternA

only has this unique job arrive in the system. WhenH +L ≤ 2(c0−1)(Q+1)+1, A hasQ additional jobs arrive in time slots

H + L + 2, H + L + 4, ..., H + L + 2Q, and the Map and Reduce workload of each arrival is1. The schedulerS1 schedules

the Map task in the first time slot, and schedules the Reduce task in the second time slot whenH +L > 2(c0−1)(Q+1)+1.

When H + L ≤ 2(c0 − 1)(Q + 1) + 1, S1 schedules the Map function of the first arrival in the same time slot asS, and

schedules the lastQ arrivals before scheduling the Reduce task of the first arrival. The operation of the schedulerS1 is depicted

in Figs. 1(b) and 1(d).

Case I: H + L > 2(c0 − 1)(Q + 1) + 1.

In this case, the schedulerS andS1 are shown in Figs. 1(a) and 1(b), respectively. The flow-timeof S is L + 2Q + 1, and

the flow-time ofS1 is 2Q + 2. So, the competitive ratioc of S must satisfy the following:

c ≥H + L + 2Q + 1

2Q + 2> c0. (6)

Case II: H + L ≤ 2(c0 − 1)(Q + 1) + 1.

Then for the schedulerS, the total flow-time is greater than(H + L + 2Q + 1) + (2Q + 1)Q, no matter how the lastQ

jobs are scheduled. In this case, the schedulerS andS1 are shown in Fig. 1(c) and 1(d), respectively.

8

Then, for the schedulerS, the competitive ratioc satisfies the following:

c ≥(H + L + 2Q + 1) + (2Q + 1)Q

2Q + (H + L + 2Q + 2Q + 1)

≥(2Q + 1)(Q + 1)

6Q + 1 + (2(c0 − 1)(Q + 1) + 1)

>(2Q + 1)(Q + 1)

2(c0 + 2)(Q + 1)>

Q

c0 + 2.

(7)

By selectingQ > c20 + 2c0, using Eq. 7, we can getc > c0.

Thus, we show that for any constantc0 and schedulerS, there are sequences of arrivals and workloads, such that the

competitive ratioc is greater thanc0. In other words, in this scenario, there does not exist a constant competitive ratioc. This

is because the scheduler does not know the information of future arrivals, i.e., it only makes causal decisions. In fact,even if

the scheduler only knows a limited amount of future information, it can still be shown that no constant competitive ratiowill

hold by increasing the value ofQ.

We now introduce a slightly weaker notion of performance, called theefficiency ratio.

Definition 2. We say that the scheduling algorithmS has an efficiency ratioγ, if the total flow-timeFS(T, n, {ai, Mi, Ri; i =

1...n}) of scheduling algorithmS satisfies the following:

limT→∞

FS(T, n, {ai, Mi, Ri; i = 1...n})

F ∗(T, n, {ai, Mi, Ri; i = 1...n})≤ γ, with probability 1. (8)

Later, we will show that for the quick example, a constant efficiency ratioγ still can exist (e.g., the non-preemptive scenario

with light-tailed distributed Reduce workload in Section V).

V. WORK-CONSERVING SCHEDULERS

In this section, we analyze the performance of work-conserving schedulers in both preemptive and non-preemptive scenarios.

A. Preemptive Scenario

We first study the case in which the workload of Reduce tasks ofeach job is bounded by a constant, i.e., there exists a

constantRmax, s.t.,Ri ≤ Rmax, ∀i.

Consider the total scheduled number of machines in all the time slots. If all theN machines are scheduled in a time slot,

we call this time slot a “developed” time slot; otherwise, wecall this time slot a “developing” time slot. We call the timeslot a

“no-arrival” time slot, if there is no arrival in this time slot. The probability that a time slot is no-arrival time slot only depends

on the distribution of arrivals. Since the job arrivals in each time slot are i.i.d., the probability that a time slot is noarrival

time slot is equal top0, andp0 < 1. We define thejth “interval” to be the interval between the(j − 1)th developing time

slot and thejth developing time slot. (We define the first interval as the interval from the first time slot to the first developing

time slot.) Thus, the last time slot of each interval is the only developing time slot in the interval. LetKj be the length of the

jth interval, as shown in Fig. 2.

Firstly, we show that any moment of the lengthKj of interval is bounded by some constants, as shown in Lemma 3.Then, we

show the expression of first and second moment ofKj in Corollary 1. For the artificial system with dummy Reduce workload,

9

the similar results can be achieved as shown in Corollary 2. Moreover, the internal length are i.i.d. in the artificial system as

shown in Remark 4. Based on all above results, we can get Theorem 2, which show the performance of any work-conserving

scheduler.

Lemma 3. If the algorithm is work-conserving, then for any given number H , there exists a constantBH , such thatE[KHj ] <

BH , ∀j.

Proof: SinceKj is a positive random variable, thenE[KHj ] ≤ 1 if H ≤ 0. Thus, we only focus on the case thatH > 0.

In the jth interval, consider thetth time slot in this interval. (Assume t=1 is the first time slot in this interval.) Assume that

the total workload of the arrivals in thetth time slot of this interval isWj,t. Also assume that the unavailable workload of

the Reduce function at the end of thetth time slot in this interval isRj,t, i.e., the workload of the Reduce functions whose

corresponding Map functions are finished in thetth time slot in this interval isRj,t. The remaining workload of the Reduce

function left by the previous interval isRj,0. Then, the distribution ofKj can be shown as below.

P (Kj = k) = P (Wj,1 + Rj,0 −Rj,1 ≥ N,

Wj,1 + Wj,2 + Rj,0 −Rj,2 ≥ 2N,

...

k−1∑

t=1

Wj,t + Rj,0 −Rj,k−1 ≥ (k − 1)N,

k∑

t=1

Wj,t + Rj,0 −Rj,k < kN)

(9)

By the assumptions, for each job,Ri ≤ Rmax, for all i. Also, the unit of allocation is one machine, thus, in the last slot

of the previous interval, the maximum number of finished Map functions isN − 1. Thus,Rj,0 ≤ (N − 1)Rmax. Now, let’s

consider the case ofk > 1.

P (Kj = k) ≤ P

(k−1∑

t=1

Wj,t + Rj,0 −Rj,k−1 ≥ (k − 1)N

)

≤ P

(k−1∑

t=1

Wj,t + (N − 1)Rmax ≥ (k − 1)N

)

= P

k−1∑t=1

Wj,t

k − 1≥ N −

(N − 1)Rmax

k − 1

.

(10)

By the assumptions, the total (both Map and Reduce) arrivingworkloadWj,t in each time slot is i.i.d., and the distribution

does not depend on the interval indexj or the index of time slott in this interval. Then, for any intervalj, we know that

P (Kj = k) ≤ P

k−1∑s=1

Ws

k − 1≥ N −

(N − 1)Rmax

k − 1

, (11)

where{Ws} is a sequence of i.i.d. random variables with the same distribution asWj,t.

10

Since the traffic intensityρ < 1, i.e., E[Ws] = λ(M + R) < N .

For anyǫ > 0, there exists ak0, such that for anyk > k0, (N−1)Rmax

k−1 ≤ ǫ. Then for anyk > k0, we have

P (Kj = k) ≤ P

k−1∑s=1

Ws

k − 1≥ N − ǫ

, (12)

We takeǫ < N − E[Ws] and correspondingk0. For anyk > k0 = 1 + ⌈ (N−1)Rmax

ǫ⌉, by Chernoff’s theorem in the large

deviation theory [9], we can achieve that

P

k−1∑s=1

Ws

k − 1≥ N − ǫ

≤ e−kl(N−ǫ), (13)

where the rate functionl(a) is shown as below:

l(a) = supθ≥0

(θa− log(E[eθWs ])

). (14)

Since the moment generating function of workload in a time slot has finite value in some neighborhood around 0, for any

0 < ǫ < N − λ(M + R), we know thatl(N − ǫ) > 0. Thus, we can achieve

E[KHj ] =

∞∑

k=0

kHP (Kj = k)

=

k0∑

k=0

kHP (Kj = k) +

∞∑

k=k0+1

kHP (Kj = k)

≤kH0 +

∞∑

k=2

kHe−kl(N−ǫ)

=

(1 + ⌈

(N − 1)Rmax

ǫ⌉

)H

+

∞∑

k=2

kHe−kl(N−ǫ).

(15)

Since we know thatl(N − ǫ) > 0 in this scenario,∞∑

k=2

kHe−kl(N−ǫ) is bounded for givenH .

Thus, givenH , E[KHj ] is bounded by a constantBH , for any j, whereBH is given as below.

BH , minǫ∈(0,N−λ(M+R))

(1 + ⌈

(N − 1)Rmax

ǫ⌉

)H

+

∞∑

k=2

kHe−kl(N−ǫ)

. (16)

Corollary 1. E[Kj ] is bounded by a constantB1, E[K2j ] is bounded by a constantB2, for anyj. The expressions ofB1 and

B2 are shown as below, where the rate functionl(a) is defined in Eq. (21).

B1 = minǫ∈(0,N−λ(M+R))

1 + ⌈(N − 1)Rmax

ǫ⌉

+2el(N−ǫ) − 1

el(N−ǫ)(el(N−ǫ) − 1)2

, (17)

11

B2 = minǫ∈(0,N−λ(M+R))

(1 + ⌈

(N − 1)Rmax

ǫ⌉

)2

+4e2l(N−ǫ) − 3el(N−ǫ) + 1

el(N−ǫ)(el(N−ǫ) − 1)3

. (18)

Proof: By substitutingH = 2 andH = 1 in Eq. (16), we directly achieve Eqs. (19) and (20).

We add some dummy Reduce workload in the beginning time slot of each interval, such that the remaining workload of the

previous interval is equal to(N − 1)Rmax. The added workload are only scheduled in the last time slot of each interval in

the original system. Then, in the new artificial system, the total flow time of the real jobs doesn’t change. However, the total

number of intervals may decrease, because the added workload may merge some adjacent intervals.

Corollary 2. In the artificial system, the length of new intervals{Kj, j = 1, 2, ...} also satisfies the same results of Lemma 3

and Corollary 1.

Proof: The proof is similar to the proofs of Lemma 3 and Corollary 1.

Remark 1. In the artificial system constructed in Corollary 2, the length of each interval has no effect on the length of

other intervals, and it is only dependent on the job arrivalsand the workload of each job in this interval. Based on our

assumptions, the job arrivals in each time slot are i.i.d.. Also, the workload of jobs are also i.i.d.. Thus, the random variables

{Kj, j = 1, 2, ...} are i.i.d..

Theorem 2. In the preemptive scenario, any work-conserving schedulerhas a constant efficiency ratio

B2+B21

max{2,1−p0

λ, ρ

λ}max{1, 1N(1−ρ)}

, where p0 is the probability that no job arrives in a time slot, andB1 and B2 are

given in Eqs. (19) and (20).

B1 = minǫ∈(0,N−λ(M+R))

1 + ⌈(N − 1)Rmax

ǫ⌉

+2el(N−ǫ) − 1

el(N−ǫ)(el(N−ǫ) − 1)2

, (19)

B2 = minǫ∈(0,N−λ(M+R))

(1 + ⌈

(N − 1)Rmax

ǫ⌉

)2

+4e2l(N−ǫ) − 3el(N−ǫ) + 1

el(N−ǫ)(el(N−ǫ) − 1)3

, (20)

where the rate functionl(a) is defined as

l(a) = supθ≥0

(θa− log(E[eθWs ])

). (21)

Proof: Consider the artificial system constructed in Corollary 2. Since we have more dummy Reduce workload in the

artificial system, the total flow timeF of the artificial system is equal to or greater than the total flow time F of the work-

conserving scheduling policy. (The total flow time is definedin Section III.)

Observe that in each interval, all the job arrivals in this interval must finish their Map functions in this interval, and all

their Reduce functions will finish before the end of the next interval. In other words, for all the arrivals in thejth interval,

12

Fig. 2. An example schedule of a work-conserving scheduler

the Map functions are finished inKj time slots, and the Reduce functions are finished inKj + Kj+1 time slots, as shown in

Fig. 2. (The vertical axis represents the total number of scheduled machines in each time slot.) And this is also true for the

artificial system with dummy Reduce workload, i.e., all the arrivals in thejth interval of the artificial system can be finished

in Kj + Kj+1 time slots.

Let Aj be the number of arrivals in thejth interval. Since the algorithm is work-conserving, all the available Map functions

which arrive before the interval must have finished before this interval. And all the Reduce function which are generatedby

the arrivals in this interval must finish before the end of thelast time slot of the next interval. So the total flow timeFj of

work-conserving algorithm for all the arrivals in thejth interval is less thanAj(Kj + Kj+1). Similarly, the total flow timeFj

for all the arrivals in thejth interval of the artificial system is less thanAj(Kj + Kj+1), whereAj is the number of arrivals

in the jth interval of the artificial system.

T ,m∑

j=1

Kj is equal to the finishing time of all then jobs. For any scheduling policy, the total flow time is not less than

the number of time slotsT that the number of scheduled machines is not equal to 0. Thus,for the optimal scheduling method,

its the total flow timeF ∗j should be not less thanT .

For any scheduling policy, each job needs at least 1 time slotto schedule its Map function and 1 time slot to schedule its

Reduce function. Thus, the lower bound of the each job’s flow time is 2. Thus, for the optimal scheduling method, its total

flow time F ∗j should not be less than2Aj .

Let F be the total flow time of any work-conserving algorithm, andF ∗ be the total flow time of optimal scheduling policy.

Assume that there are a total ofm intervals for then arrivals. Correspondingly, there are a total ofm intervals in the artificial

system. Also, based on the description of the artificial system, m ≥ m andF = F =m∑

j=1

Fj .

SinceE(Mi + Ri) > 0, ρ < 1, then limn→∞

m =∞ and limn→∞

m =∞.

13

limn→∞

F

F ∗= lim

n→∞

F

F ∗= lim

n→∞

m∑j=1

Fj

m∑j=1

F ∗j

≤ limm→∞

m∑j=1

Aj(Kj + Kj+1)

T

≤lim

m→∞

m∑j=1

AjKj

m+ lim

m→∞

m∑j=1

AjKj+1

m

limm→∞

Tm

.

(22)

In the artificial system, the distribution ofAj is the number of arrivals injth interval. Since{Kj, j = 1, 2, ...} are i.i.d.

distributed random variables, then{AjKj, j = 1, 2, ...} are also i.i.d..Based on Strong Law of Large Number (SLLN), we

can achieve that

limm→∞

m∑j=1

AjKj

m

w.p.1==== E

[AjKj

]. (23)

For {AjKj+1, j = 1, 2, ...}, they are identically distributed, but not independent. However, all the odd term are independent,

and all the even term are independent. Based on SLLN, we can achieve that

limm→∞

m∑j=1

AjKj+1

m

= limm→∞

∑{j≤m,j is odd}

AjKj+1 +∑

{j≤m,j is even}AjKj+1

m

= limm→∞

∑{j≤m,j is odd}

AjKj+1

⌈ m2 ⌉

⌈ m2 ⌉

m+

limm→∞

∑{j≤m,j is even}

AjKj+1

⌊ m2 ⌋

⌊ m2 ⌋

m

w.p.1====

1

2E[AjKj+1] +

1

2E[AjKj+1] = E[AjKj+1].

(24)

Based on Eqs. 23 and 24, we can achieve that

14

limm→∞

m∑j=1

AjKj

m+

m∑j=1

AjKj+1

mw.p.1

==== E[AjKj] + E[AjKj+1]

= E[AjKj ] + E[Aj ]E[Kj+1]

= E[E[AjKj |Kj]] + E[E[Aj |Kj ]]E[Kj+1]

= E[KjE[Aj |Kj]] + E[E[Aj |Kj ]]E[Kj+1]

= E[Kj(λKj)] + E[λKj ]E[Kj+1]

= λ(E[Kj

2] + E[Kj ]

2).

(25)

For thesen jobs, there arem developing time slots before the timeT , and for each developing time slot, there are at most

N − 1 machines are assigned. Then, we can have

N

m∑

j=1

(Kj − 1) + (N − 1)m ≥∑

s∈{Arrivals beforeT}

Ws. (26)

Thus,

N −

∑s∈{Arrivals beforeT}

Ws

T

m∑j=1

Kj

m≥ 1 (27)

Since the workloadWs of each job is i.i.d., by SLLN, we have

(N − λ(M + R)

)lim

m→∞

m∑j=1

Kj

m≥ 1 w.p.1, (28)

i.e.,

limm→∞

m∑j=1

Kj

m≥

1

N(1− ρ)w.p.1. (29)

Thus,

limm→∞

T

m= lim

m→∞

m∑j=1

Kj

m≥ max

{1,

1

N(1− ρ)

}w.p.1. (30)

For any work-conserving scheduler,T − T is less then the total number of no-arrival time slots, we have

limm→∞

T

m≥ (1− p0) lim

m→∞

T

mw. p. 1, (31)

At the same time, for each time slot that the number of scheduled machines is greater than 0, there are at mostN machines

are scheduled. Then, we can get

NT ≥

n∑

i=1

(Mi + Ri). (32)

15

Then, we have

limm→∞

T

m≥

λ(M + R)

Nlim

m→∞

T

mw.p.1.= ρ lim

m→∞

T

m(33)

Thus,

limn→∞

F

F ∗≤

λ(E[Kj

2] + E[Kj ]

2)

max{ρ, 1− p0}max{

1, 1N(1−ρ)

} w.p.1.

≤λ(B2 + B2

1)

max{ρ, 1− p0}max{1, 1

N(1−ρ)

} w.p.1.

(34)

Similarly,

limn→∞

F

F ∗= lim

n→∞

m∑j=1

Fj

m∑j=1

F ∗j

limm→∞

m∑j=1

Aj(Kj + Kj+1)

limm→∞

m∑j=1

2Aj

≤lim

m→∞

m∑j=1

Aj(Kj+Kj+1)

m

limm→∞

m∑j=1

2Aj

m

(35)

With Strong Law of Large Number (SLLN), we can achieve that

limm→∞

m∑j=1

Aj

m= lim

m→∞

m∑j=1

Aj

m∑j=1

Kj

limm→∞

m∑j=1

Kj

m

m∑j=1

Kj

mw. p. 1.≥ λmax

{1,

1

N(1− ρ)

}

(36)

Thus,

limn→∞

F

F ∗≤

λ(B2 + B21)

2λmax{

1, 1N(1−ρ)

} w.p.1. (37)

Thus, based on the Eqs. 34 and 37, we can achieve that the efficient ratio of any work-conserving scheduling policy is not

greater than B2+B21

max{2,1−p0

λ,

ρλ}max{1, 1

N(1−ρ)}with probability 1, whenn goes to infinity.

B. Non-preemptive Scenario

The definitions ofB1, B2, p0, F , F , F ∗, Fj , Fj , Kj, Kj , Aj , Aj , Wj,t, andRj,t in this section are same as Section V-A.

Lemma 4. Lemma 3, Corollary 1, Corollary 2 and Remark 4 in Section V-A are also valid in non-preemptive Reduce scenario.

Proof: Different from migratable Reduce scenario in Section V-A, the constitution ofRj,t are different, because it conations

both two parts. First, it contains the unavailable Reduce functions which are just released by the Map functions which are

finished in the current time slot, the last time slot of each interval. Second, it also contains the workload of Reduce functions

16

which cannot be processed in this time slot, even if there areidle machines, because the Reduce functions cannot migrate

among machines.

However, we can also achieve that the remaining workload of Reduce functions from previous interval satisfiesRj,0 ≤

(N − 1)Rmax, because the total number of remaining unfinished Reduce functions and new available Reduce functions in this

time slot is less thanN . All the following steps are same.

Theorem 3. If the Reduce functions are not allowed to migrate among different machines, then for any work-conserving

scheduling policy, the efficient ratio is not greater than B2+B21+B1(Rmax−1)

max{2,1−p0

λ,

ρλ}max{1, 1

N(1−ρ)}with probability 1, whenn goes to

infinity. B1 and B2 can be achieved by Corollary 1, andp0 is the probability that a time slot is no-arrival time slot.

Proof: In the non-preemptive Reduce scenario, in each interval, all the Map functions of job arrivals in this interval are

also finished in this interval, and all their Reduce functions will be started before the end of the next interval. In otherwords,

for all the arrivals in thejth interval, the Map functions are finished inKj time slots, and the Reduce functions are finished in

Kj + Kj+1 + Rmax − 1 time slots, as shown in Fig. 2. (The vertical axis representsthe total number of scheduled machines

in each time slot.) And this is also true for the artificial system with dummy Reduce workload, i.e., all the arrivals in thejth

interval of the artificial system can be finished inKj + Kj+1 + Rmax − 1.

We also assume that there areAj arrivals in thejth interval. Since the algorithm is work-conserving, all the available Map

functions which arrive before the interval are already finished before this interval. And all the Reduce function which are

generated by the arrivals in this interval can be finished before the end of the time slot next to this interval plusRmax − 1.

So the total flow timeFi of work-conserving algorithm in thejth interval is less thanAj(Kj + Kj+1 +Rmax− 1). Similarly,

the the total flow timeFi in the jth interval of the artificial system is less thanAj(Kj + Kj+1 + Rmax− 1), whereAj is the

number of arrivals in thejth interval of the artificial system.

Compared to Section V-A, the flow time of each arrival has one more termRmax − 1 because of non-preeptive of Reduce

functions in this section. Thus, we can use the same proving method in Theorem 2 to achieve that

limn→∞

F

F ∗≤ lim

m→∞

m∑j=1

Aj(Kj + Kj+1) +n∑

i=1

(Ri − 1)

m∑j=1

(Kj − Sj)

=

limm→∞

1m

m∑j=1

Aj(Kj + Kj+1) + limm→∞

nm

limn→∞

1n

n∑i=1

(Ri − 1)

limm→∞

1m

m∑j=1

(Kj − Sj)

≤λ[B2 + B2

1 + B1(R− 1)]

max {ρ, 1− p0}max{1, 1

N(1−ρ)

} w.p.1,

(38)

and

limn→∞

F

F ∗≤

λ[B2 + B21 + B1(R − 1)]

2λmax{

1, 1N(1−ρ)

} w.p.1. (39)

17

Thus, based on the Eqs. 38 and 39, we can achieve that any work-conserving scheduler has the efficient ratio

B2+B21+B1(R−1)

max{2,1−p0

λ,

ρλ}max{1, 1

N(1−ρ)}.

Theorem 4. In the non-preemptive scenario, any work-conserving scheduler has a constant efficiency ratio

B2+B21+B1(R−1)

max{2,1−p0

λ,

ρλ}max{1, 1

N(1−ρ)}, wherep0 is the probability that no job arrives in a time slot, andB1 and B2 are given in

Eqs. (19) and (20).

Proof: In this scenario, for all the arrivals in thejth interval, the Map functions are finished inKj time slots, and the

Reduce functions are finished inKj + Kj+1 + R − 1 time slots, as shown in Fig. 3. Since the number of arrivals ineach

interval is independent of the workload of Reduce functions, different from Eqs. 38 and 39, we can get

limn→∞

F

F ∗≤ lim

m→∞

m∑j=1

Aj(Kj + Kj+1 + R− 1)

m∑j=1

Kj − Sj

≤λ[B2 + B2

1 + B1(R − 1)]

max {ρ, 1− p0}max{1, 1

N(1−ρ)

} w.p.1,

(40)

and

limn→∞

F

F ∗≤

λ[B2 + B21 + B1(R − 1)]

2λmax{

1, 1N(1−ρ)

} w.p.1. (41)

Thus, we can achieve that the efficient ratio of any work-conserving scheduling policy is not greater than

B2+B21+B1(R−1)

max{2,1−p0

λ, ρ

λ}max{1, 1N(1−ρ)}

with probability 1, whenn goes to infinity. Similarly, if the workload of Reduce functions

is light-tailed distributed, then the efficient ratio also holds if we useB′1 andB′

2 instead ofB1 andB2.

Remark 2. In Theorems 2 and 4, we can relax the assumption of boundedness for each Reduce job and allow them to follow

a light-tailed distribution, i.e., a distribution onRi such that∃r0, such thatP (Ri ≥ r) ≤ α exp (−βr), ∀r ≥ r0, ∀i, where

α, β > 0 are two constants. We obtain similar results to Theorem 2 and4 with different expressions.

Lemma 5. If the scheduling algorithm is work-conserving, then for any given numberH , there exists a constantBH , such

that E[KHj ] < BH , ∀j.

Proof: Similarly to Eqs. 9 and 10, we can achieve

P (Kj = k) ≤ P

(k−1∑

t=1

Wj,t + Rj,0 ≥ (k − 1)N

). (42)

18

Fig. 3. The jobi, which arrives in thejth interval, finishes its Map tasks inKj time slots and Reduce tasks inKj + Kj+1 + Ri − 1 time slots.

Let R′ be the maximal remaining workload of jobs in the previous interval, then

P (Kj = k) ≤ P

(k−1∑

t=1

Wj,t + (N − 1)R′ ≥ (k − 1)N

)

=

∞∑

r=0

[P

(k−1∑

t=1

Wj,t + (N − 1)R′ ≥ (k − 1)N∣∣∣R′ = r

)

P (R′ = r)

]

=

∞∑

r=0

P

k−1∑t=1

Wj,t

k − 1≥ N −

(N − 1)r

k − 1

P (R′ = r)

.

(43)

Thus, we can achieve

E[KHj ] =

∞∑

k=0

kHP (Kj = k)

=

∞∑

k=0

{kH

∞∑

r=0

[P(

k−1∑t=1

Wj,t

k − 1≥ N −

(N − 1)r

k − 1

)P (R′ = r)

]}

=∞∑

r=0

{P (R′ = r)

∞∑

k=0

[kHP

(k−1∑t=1

Wj,t

k − 1≥ N −

(N − 1)r

k − 1

)]}.

The next steps are similar to the proof of Lemma 3, we skip the details and directly show the following result. For any

19

0 < ǫ < N − λ(M + R), we know thatl(N − ǫ) > 0. Thus, we can achieve

E[KHj ]

∞∑

r=0

{P (R′ = r)

[(1 + ⌈

(N − 1)r

ǫ⌉)H

+

∞∑

k=2

kHe−kl(N−ǫ)]}

=

∞∑

k=2

kHe−kl(N−ǫ) +

∞∑

r=0

{P (R′ = r)

(1 + ⌈

(N − 1)r

ǫ⌉)H}

∞∑

k=2

kHe−kl(N−ǫ) +

∞∑

r=0

{P (R′ ≥ r)

(1 + ⌈

(N − 1)r

ǫ⌉)H}

∞∑

k=2

kHe−kl(N−ǫ) +

∞∑

r=0

{P (R ≥ r)

(1 + ⌈

(N − 1)r

ǫ⌉)H}

∞∑

k=2

kHe−kl(N−ǫ) +

r0−1∑

r=0

{(1 + ⌈

(N − 1)r

ǫ⌉)H}

+

∞∑

r=r0

{αe−βr

(1 + ⌈

(N − 1)r

ǫ⌉)H}

.

Since we know thatl(N − ǫ) > 0 and β > 0 in this scenario,∞∑

k=2

kHe−kl(N−ǫ),∞∑

r=r0

{αe−βr

(1 + ⌈ (N−1)r

ǫ⌉)H}

and

r0−1∑r=0

{(1 + ⌈ (N−1)r

ǫ⌉)H}

are bounded for givenH .

Thus, givenH , E[KHj ] is bounded by a constantBH , for any j, whereBH is given as below.

BH , minǫ∈(0,N−λ(M+R))

∞∑

k=2

kHe−kl(N−ǫ)+

r0−1∑

r=0

(1 + ⌈

(N − 1)r

ǫ⌉

)H

+

∞∑

r=r0

αe−βr

(1 + ⌈

(N − 1)r

ǫ⌉

)H

. (44)

Corollary 3. E[Kj ] is bounded by a constantB′1, E[K2

j ] is bounded by a constantB′2, for anyj. The expressions ofB′

1 and

B′2 are shown as below, where the rate functionl(a) is defined in Eq. (21).

B′1 = min

ǫ∈(0,N−λ(M+R))

2r0 +(N − 1)r0(r0 − 1)

2ǫ+

eβ − 1+

α(N − 1)

4ǫcsch2(

β

2)

+ 2α +2el(N−ǫ) − 1

el(N−ǫ)(el(N−ǫ) − 1)2

, (45)

20

B′2 = min

ǫ∈(0,N−λ(M+R))

4r0 +2(N − 1)r0(r0 − 1)

ǫ+ 4α

+(N − 1)2(r0 − 1)r0(2r0 − 1)

6ǫ2

+α(N − 1)2

8ǫ2sinh (β)csch4(

β

2)

+4α

eβ − 1+

α(N − 1)

ǫcsch2(

β

2)

+4e2l(N−ǫ) − 3el(N−ǫ) + 1

el(N−ǫ)(el(N−ǫ) − 1)3

. (46)

Proof: By substitutingH = 2 andH = 1 in Eq. (44), we directly achieve Eqs. (47) and (48).

B1 = minǫ∈(0,N−λ(M+R))

r0−1∑

r=0

(1 + ⌈

(N − 1)r

ǫ⌉

)+

∞∑

r=r0

αe−βr

(1 + ⌈

(N − 1)r

ǫ⌉

)

+2el(N−ǫ) − 1

el(N−ǫ)(el(N−ǫ) − 1)2

, (47)

B2 = minǫ∈(0,N−λ(M+R))

r0−1∑

r=0

(1 + ⌈

(N − 1)r

ǫ⌉

)2

+

∞∑

r=r0

αe−βr

(1 + ⌈

(N − 1)r

ǫ⌉

)2

+4e2l(N−ǫ) − 3el(N−ǫ) + 1

el(N−ǫ)(el(N−ǫ) − 1)3

. (48)

To get a simple expression without summation format, we relax the right hand side of Eq. 47 as below:

B1 < minǫ∈(0,N−λ(M+R))

r0−1∑

r=0

(2 +

(N − 1)r

ǫ

)+

∞∑

r=0

αe−βr

(2 +

(N − 1)r

ǫ

)

+2el(N−ǫ) − 1

el(N−ǫ)(el(N−ǫ) − 1)2

= minǫ∈(0,N−λ(M+R))

2r0 +(N − 1)r0(r0 − 1)

2ǫ+

eβ − 1+

α(N − 1)

4ǫcsch2(

β

2)

+ 2α +2el(N−ǫ) − 1

el(N−ǫ)(el(N−ǫ) − 1)2

, B′1.

(49)

Similarly,

21

B2 < minǫ∈(0,N−λ(M+R))

4r0 +2(N − 1)r0(r0 − 1)

ǫ+ 4α

+(N − 1)2(r0 − 1)r0(2r0 − 1)

6ǫ2

+α(N − 1)2

8ǫ2sinh (β)csch4(

β

2)

+4α

eβ − 1+

α(N − 1)

ǫcsch2(

β

2)

+4e2l(N−ǫ) − 3el(N−ǫ) + 1

el(N−ǫ)(el(N−ǫ) − 1)3

, B′2. (50)

Remark 3. Following the same proof steps, we can get the Corollary 2, Remark 1, Theorem 2 and Remark 4 as Section V-A.

The difference is that we need useB′1 and B′

2 instead ofB1 and B2 in this part.

Remark 4. Although any work-conserving scheduler has a constant efficiency ratio, the constant efficiency ratio may be

large (because the result is true for “any” work-conservingscheduler). We further discuss algorithms to tighten the constant

efficiency ratio in Section VI.

VI. AVAILABLE -SHORTEST-REMAINING -PROCESSING-TIME (ASRPT) ALGORITHM AND ANALYSIS

In the previous sections we have shown that any arbitrary work-conserving algorithm has a constant efficiency ratio, butthe

constant can be large as it is related to the size of the jobs. In this section, we design an algorithm with much tighter bounds

that does not depend on the size of jobs. Although, the tight bound is provided in the case of preemptive jobs, we also show

via numerical results that our algorithm works well in the non-preemptive case.

Before presenting our solution, we first describe a known algorithm called SRPT (Shortest Remaining Processing Time)

[10]. SRPT assumes that Map and Reduce tasks from the same jobcan be scheduled simultaneously in the same slot. In each

slot, SRPT picks up the job with the minimum total remaining workload, i.e., including Map and Reduce tasks, to schedule.

Observe that the SRPT scheduler may come up with an infeasible solution as it ignores the dependency between Map and

Reduce tasks.

Lemma 6. Without considering the requirement that Reduce tasks can be processed only if the corresponding Map tasks

are finished, the total flow-timeFS of Shortest-Remaining-Processing-Time (SRPT) algorithmis a lower bound on the total

flow-time of MapReduce framework.

Proof: Without the requirement that Reduce tasks can be processed only if the corresponding Map tasks are finished, the

22

optimization problem in the preemptive scenario will be as follows:

minmi,t,ri,t

n∑

i=1

(f

(r)i − ai + 1

)

s.t.n∑

i=1

(mi,t + ri,t) ≤ N, ri,t ≥ 0, mi,t ≥ 0, ∀t,

f(r)i∑

t=ai

(mi,t + ri,t) = Mi + Ri, ∀i ∈ {1, ..., n}.

(51)

The readers can easily check that

n⋃

i=1

f(r)i∑

t=ai

mi,t + ri,t = Mi + Ri

n⋃

i=1

{f(m)i∑

t=ai

mi,t = Mi

}∪

{f(r)i∑

t=f(m)i +1

ri,t = Ri,

}

.

(52)

Thus, the constraints in Eq. (51) are weaker than constraints in Eq. (1). Hence, the optimal solution of Eq. (51) is less than

the optimal solution of Eq. (1). Since we know that the SRPT algorithm can achieve the optimal solution of Eq. (51) [10],

then its total flow-timeFS is a lower bound of any scheduling method in the MapReduce framework. Since the optimization

problem in the non-preemptive scenario has more constraints, the lower bound also holds for the non-preemptive scenario.

A. ASRPT Algorithm

Based on the SRPT scheduler, we present our greedy scheme called ASRPT. We base our design on a very simple intuition

that by ensuring that the Map tasks of ASRPT finish as early as SRPT, and assigning Reduce tasks with smallest amount of

remaining workload, we can hope to reduce the overall flow-time. However, care must be taken to ensure that if Map tasks

are scheduled in this slot, then the Reduce tasks are scheduled after this slot.

ASRPT works as follows. ASRPT uses the schedule computed by SRPT to determine its schedule. In other words, it runs

SRPT in a virtual fashion and keeps track of how SRPT would have scheduled jobs in any given slot. In the beginning of each

slot, the list of unfinished jobsJ , and the scheduling listS of the SRPT scheduler are updated. The scheduled jobs in the

previous time slot are updated and the new arriving jobs are added to the listJ in non-decreasing order of available workload

in this slot. In the listJ , we also keep the number of schedulable tasks in this time-slot. For a job that has unscheduled

Map tasks, its available workload in this slot is the units ofunfinished workload of both Map and Reduce tasks, while its

schedulable tasks are the unfinished Map tasks. Otherwise, its available workload is the unfinished Reduce workload, while its

schedulable tasks are the unfinished workload of the Reduce tasks (preemptive scenario) or the number of unfinished Reduce

tasks (non-preemptive scenario), respectively. Then, thealgorithm assigns machines to the tasks in the following priority order

(from high to low): the previously scheduled Reduce tasks which are not finished yet (only in the non-preemptive scenario),

the Map tasks which are scheduled in the listS, the available Reduce tasks, and the available Map tasks. For each priority

group, the algorithm greedily assign machines to the corresponding tasks through the sorted available workload listJ . The

23

(a) SRPT (b) DSRPT (c) ASRPT

Fig. 4. The construction of schedule for ASRPT and DSRPT based on the schedule for SRPT. Observe that the Map tasks are scheduled in the same way.But, roughly speaking, the Reduce tasks are delayed by one slot.

pseudo-code of the algorithm is shown in Algorithm 1.

B. Efficiency Ratio Analysis of ASRPT Algorithm

We first prove a lower bound on its performance. The lower bound is based on the SRPT. For the SRPT algorithm, we

assume that in time slott, the number of machines which are scheduled to Map and Reducetasks areMSt andRS

t , respectively.

Now we construct a scheduling method called Delayed-Shortest-Remaining-Processing-Time (DSRPT) for MapReduce

framework based on the SRPT algorithm. We keep the scheduling method for the Map tasks exactly same as in SRPT.

For the Reduce tasks, they are scheduled in the same order from thenext time slotcompared to SRPT scheduling. An example

showing the operation of SRPT, DSRPT and ASRPT is shown in Fig. 4.

Theorem 5. In the preemptive scenario, DSRPT has an efficiency ratio 2.

Proof: We construct a queueing system to represent the DSRPT algorithm. In each time slott ≥ 2, there is an arrival with

workloadRSt−1, which is the total workload of the delayed Reduce tasks in time slott−1. The service capacity of the Reduce

tasks isN −MSt , and the service policy of the Reduce tasks is First-Come-First-Served (FCFS). The Reduce tasks which

are scheduled in previous time slots by the SRPT algorithm are delayed in the DSRPT algorithm up to the time of finishing

the remaining workload of delayed tasks. Also, the remaining workloadWt in the DSRPT algorithm will be processed first,

because the scheduling policy is FCFS. LetDt be the largest delay time of the Reduce tasks which are scheduled in the time

slot t− 1 in SRPT.

In the construction of the DSRPT algorithm, the remaining workload Wt is from the delayed tasks which are scheduled in

previous time slots by the SRPT algorithm, andMSs≥t is the workload of Map tasks which are scheduled in the futuretime

slots by the SRPT algorithm. Hence, they are independent. Weassume thatD′t is the first time such thatWt ≤ N−Ms, where

24

Algorithm 1 Available-Shortest-Remaining-Processing-Time (ASRPT)Algorithm for MapReduce FrameworkInput: List of unfinished jobs (including new arrivals in this slot)JOutput: Scheduled machines for Map tasks, scheduled machines for Reduce tasks, updated remaining jobsJ

1: Update the scheduling listS of SRPT algorithm without task dependency. For jobi, S(i).MapLoad is the correspondingscheduling of the Map tasks in SRPT.

2: Update the list of jobs with new arrivals and machine assignments in the previous slot to maintain a non-decreasing order of the available total workloadJ(i).AvailableWorkload. Also, keep the number of schedulable tasksJ(i).SchedulableTasksNum for each job in the list.

3: d← N ;4: if The scheduler is a non-preemptive schedulerthen5: Keep the assignment of machines which are already scheduledto the Reduce tasks previously and not finished yet;6: Updated, the idle number of machines;7: end if8: for i = 1→ |J | do9: if i has Map tasks which is scheduled inS then

10: if J(i).SchedulabeTasksNum≥ S(i).MapLoad then11: if S(i).MapLoad ≤ d then12: AssignS(i).MapLoad machines to the Map tasks of jobi;13: else14: Assignd machines to the Map tasks of jobi;15: end if16: else17: if J(i).SchedulabeTasksNum≤ d then18: AssignJ(i).SchedulabeTasksNum machines to the Map tasks of jobi;19: else20: Assignd machines to the Map tasks of jobi;21: end if22: end if23: Update status of jobi in the list of jobsJ ;24: Updated, the idle number of machines;25: end if26: end for27: for i = 1→ |J | do28: if ith job still has unfinished Reduce tasksthen29: if J(i).SchedulabeTasksNum≤ d then30: AssignJ(i).SchedulabeTasksNum machines to the Reduce tasks of jobi;31: else32: Assignd machines to the Reduce tasks of jobi;33: end if34: Updated, the idle number of machines;35: end if36: if d = 0 then37: RETURN;38: end if39: end for40: for i = 1→ |J | do41: if ith job still has unfinished Map tasksthen42: if J(i).SchedulabeTasksNum≤ d then43: AssignJ(i).SchedulableTasksNum machines to the Map tasks of jobi;44: else45: Assignd machines to the Map tasks of jobi;46: end if47: Updated, the idle number of machines;48: end if49: if d = 0 then50: RETURN;51: end if52: end for

25

s ≥ t. ThenDt ≤ D′t.

If a time slot t0 − 1 has no workload (except theRt0−1 which is delayed to the next time slot) left to the future, we call

t0 the first time slot of the interval it belongs to. (Note that the definition of interval is different from Section V.) Assume

that P (Wt0 ≤ N −Ms) = P (Rt0−1 ≤ N −Ms) = p, wheres ≥ t0. SinceRs ≤ N −Ms, thenp ≥ P (Rt0−1 ≤ Rs) ≥ 1/2.

Thus, we can get that

E[Dt0 ] ≤ E[D′t0

] =∞∑

k=1

kp(1− p)k−1 =1

p≤ 2. (53)

Note thatN −Ms ≥ Rs for all s, then the current remaining workloadWt ≤Wt0 . Then,E[Dt] ≤ E[Dt0 ] ≤ 2.

Then, for all the Reduce tasks, the expectation of delay compared to the SRPT algorithm is not greater than2. Let D has

the same distribution withDt. Thus, the total flow-timeFD of DSRPT algorithm satisfies

limT→∞

FD

FS

≤lim

n→∞

FS

n+ E[D]

limn→∞

FS

n

w.p.1

=1 +E[D]

limn→∞

FS

n

≤ 1 + E[D] ≤ 3.

(54)

For the flow timeF of any feasible scheduler in the MapReduce framework, we have

limT→∞

FD

F≤

limn→∞

FS

n+ E[D]

limn→∞

Fn

w.p.1

≤1 +E[D]

limn→∞

Fn

≤ 1 +E[D]

2≤ 2.

(55)

Corollary 4. From the proof of Theorem 5, the total flow-time of DSRPT is notgreater than 3 times the lower bound given

by Lemma 6 with probability 1, whenn goes to infinity.

Corollary 5. In the preemptive scenario, the ASRPT scheduler has an efficiency ratio 2.

Proof: For each time slot, all the Map tasks finished by DSRPT are alsofinished by ASRPT. For the Reduce tasks of each

job, the jobs with smaller remaining workload will be finished earlier than the jobs with larger remaining workload. Hence,

based on the optimality of SRPT, ASRPT can be viewed as an improvement of DSRPT. Thus, the total flow-time of ASRPT

will not be greater than DSRPT. So, the efficiency ratio of DSRPT also holds for ASRPT.

Note that, the total flow-time of ASRPT is not greater than3 times of the lower bound given by Lemma 6 with probability

1, whenn goes to infinity. We will show this performance in the next section. Also, the performance of ASRPT is analyzed

in the preemptive scenario in Corollary 5. We will show that ASRPT performs better than other schedulers in both preemptive

and non-preemptive scenarios via simulations in the next section.

26

VII. S IMULATION RESULTS

A. Simulation Setting

We evaluate the efficacy of our algorithm ASRPT for both preemptive and non-preemptive scenarios. We consider a data

center withN = 100 machines, and choose Poisson process with arrival rateλ = 2 jobs per time slot as the job arrival process.

We choose uniform distribution and exponential distribution as examples of bounded workload and light-tailed distributed

workload, respectively. For short, we useExp(µ) to represent an exponential distribution with meanµ, and useU [a, b] to

represent a uniform distribution on{a, a + 1, ..., b− 1, b}. We choose the total time slots to beT = 800, and the number of

tasks in each job is up to10.

We compare 3 typical schedulers to the ASRPT scheduler:

The FIFO scheduler: It is the default scheduler in Hadoop. All the jobs are scheduled in their order of arrival.

The Fair scheduler: It is a widely used scheduler in Hadoop. The assignment of machines are scheduled to all the waiting

jobs in a fair manner. However, if some jobs need fewer machines than others in each time slot, then the remaining machines

are scheduled to the other jobs, to avoid resource wastage and to keep the scheduler work-conserving.

The LRPT scheduler: Jobs with larger unfinished workload are always scheduled first. Roughly speaking, the performance

of this scheduler represents in a sense how poorly even some work-conserving schedulers can perform.

B. Efficiency Ratio

In the simulations, the efficiency ratio of a scheduler is obtained by the total flow-time of the scheduler over the lower

bound of the total flow-time inT time slots. Thus, the real efficiency ratio should be smallerthan the efficiency ratio given in

the simulations. However, the proportion of all schedulerswould remain the same.

First, we evaluate the exponentially distributed workload. We choose the workload distribution of Map for each job as

Exp(5) and the workload distribution of Reduce for each job asExp(40). The efficiency ratios of different schedulers are

shown in Fig. 5. For different workload, we choose workload distribution of Map asExp(30) and the workload distribution

of Reduce asExp(15). The efficiency ratios of schedulers are shown in Fig. 6.

Then, we evaluate the uniformly distributed workload. We choose the workload distribution of Map for each job asU [1, 9]

and the workload distribution of Reduce for each job asU [10, 70]. The efficiency ratios of different schedulers are shown in

Fig. 7. To evaluate for a smaller Reduce workload, we choose workload distribution of Map asU [10, 50] and the workload

distribution of Reduce asU [10, 20]. The efficiency ratios of different schedulers are shown in Fig. 8.

As an example, we show the convergence of efficiency ratios inFigs. 9-12, for the 4 different scenarios.

From Figures. 5-8, we can see that the total flow-time of ASRPTis much smaller than all the other schedulers. Also, as a

“bad” work-conserving, the LRPT scheduler also has a constant (maybe relative large) efficiency ratio, from Figures 12

C. Cumulative Distribution Function (CDF)

For the same setting and parameters, the CDFs of flow-times are shown in Fig. 13-16. We plot the CDF only for flow-time

up to 100 units. From these figures, we can see that the ASRPT scheduler has a very light tail in the CDF of flow-time,

27

0

2

4

6

8

10

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(a) Preemptive Scenario

0

2

4

6

8

10

12

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(b) Non-Preemptive Scenario

Fig. 5. Efficiency Ratio (Exponential Distribution, Large Reduce)

0

2

4

6

8

10

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(a) Preemptive Scenario

0

2

4

6

8

10

12

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(b) Non-Preemptive Scenario

Fig. 6. Efficiency Ratio (Exponential Distribution, Small Reduce)

compared to the FIFO and Fair schedulers. In other words, thefairness of the ASRPT scheduler is similar to the FIFO and

Fair schedulers. However, the LRPT scheduler has a long tailin the CDF of flow time. In the other words, the fairness of the

LRPT scheduler is not as good as the other schedulers.

28

0

0.5

1

1.5

2

2.5

3

3.5

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(a) Preemptive Scenario

0

1

2

3

4

5

6

7

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(b) Non-Preemptive Scenario

Fig. 7. Efficiency Ratio (Uniform Distribution, Large Reduce)

0

1

2

3

4

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(a) Preemptive Scenario

0

1

2

3

4

5

ASRPT Fair FIFO LRPT

Effi

cien

cy R

atio

(b) Non-Preemptive Scenario

Fig. 8. Efficiency Ratio (Uniform Distribution, Small Reduce)

VIII. C ONCLUSION

In this paper, we study the problem of minimizing the total flow-time of a sequence of jobs in the MapReduce framework,

where the jobs arrive over time and need to be processed through Map and Reduce procedures before leaving the system.

We show that no on-line algorithm can achieve a constant competitive ratio. We define weaker metric of performance called

the efficiency ratio and propose a corresponding technique to analyze on-line schedulers. Under some weak assumptions,we

then show a surprising property that for the flow-time problem any work-conserving scheduler has a constant efficiency ratio

29

0 200 400 600 8000

2

4

6

8

10

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 200 400 600 8000

2

4

6

8

10

12

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 9. Convergence of Efficiency Ratios (Exponential Distribution, Large Reduce)

0 200 400 600 8000

5

10

15

20

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 200 400 600 8000

5

10

15

20

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 10. Convergence of Efficiency Ratios (Exponential Distribution, Small Reduce)

0 200 400 600 8000

1

2

3

4

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 200 400 600 8000

2

4

6

8

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 11. Convergence of Efficiency Ratios (Uniform Distribution, Large Reduce)

30

0 200 400 600 8000

1

2

3

4

5

6

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 200 400 600 8000

1

2

3

4

5

Total Time (T)

Effi

cien

cy R

atio

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 12. Convergence of Efficiency Ratios (Uniform Distribution, Small Reduce)

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 13. CDF of Schedulers (Exponential Distribution, Large Reduce)

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 14. CDF of Schedulers (Exponential Distribution, Small Reduce)

31

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 15. CDF of Schedulers (Uniform Distribution, Large Reduce)

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(a) Preemptive Scenario

0 50 1000

0.2

0.4

0.6

0.8

1

Flow Time

CD

F

ASRPTFairFIFOLRPT

(b) Non-Preemptive Scenario

Fig. 16. CDF of Schedulers (Uniform Distribution, Small Reduce)

in both preemptive and non-preemptive scenarios. More importantly, we are able to find an online scheduler ASRPT with a

very small efficiency ratio. The simulation results show that the efficiency ratio of ASRPT is much smaller than the other

schedulers, while the fairness of ASRPT is as good as other schedulers.

REFERENCES

[1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” inProceedings of Sixth Symposium on Operating System Design

and Implementation, OSDI, pp. 137–150, December 2004.

[2] F. Chen, M. Kodialam, and T. Lakshman, “Joint schedulingof processing and shuffle phases in mapreduce systems,” inProceedings of IEEE Infocom,

March 2012.

[3] H. Chang, M. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee, and S. Mukherjee, “Scheduling in mapreduce-like systems for fast completion time,”

in Proceedings of IEEE Infocom, pp. 3074–3082, March 2011.

[4] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Job sechduling for multi-user mapreduce clusters,” tech. rep., University

of Califonia, Berkley, April 2009.

[5] J. Tan, X. Meng, and L. Zhang, “Performance analysis of coupling scheduler for mapreduce/hadoop,” inProceedings of IEEE Infocom, March 2012.

[6] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarlos, “On scheduling in map-reduce and flow-shops,” inProceedings of the 23rd ACM symposium on

Parallelism in algorithms and architectures, SPAA, pp. 289–298, June 2011.

32

[7] “Hadoop, http://hadoop.apache.org/.”

[8] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: A simple techniquefor achieving locality and fairness

in cluster scheduling,” inProceedings of the 5th European conference on Computer systems, EuroSys, pp. 265–278, April 2010.

[9] A. Shwartz and A. Weiss,Large Deviations for Performance Analysis: Queues, Communication and Computing. New York, NY, USA: Chapman and

Hall, 1995.

[10] K. R. Baker and D. Trietsch,Principles of Sequencing and Scheduling. Hoboken, NJ, USA: John Wiley & Sons, 2009.

ACKNOWLEDGEMENTS

We gratefully acknowledge Dr. Zizhan Zheng for his helpful suggestions.