A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers
-
Upload
independent -
Category
Documents
-
view
6 -
download
0
Transcript of A New Analytical Technique for Designing Provably Efficient MapReduce Schedulers
A New Analytical Technique for Designing
Provably Efficient MapReduce Schedulers
Yousi Zheng∗, Ness B. Shroff∗†, Prasun Sinha†
∗Department of Electrical and Computer Engineering
†Department of Computer Science and Engineering
The Ohio State University, Columbus, OH 43210
Abstract
With the rapid increase in size and number of jobs that are being processed in the MapReduce framework, efficiently scheduling
jobs under this framework is becoming increasingly important. We consider the problem of minimizing the total flow-timeof
a sequence of jobs in the MapReduce framework, where the jobsarrive over time and need to be processed through both Map
and Reduce procedures before leaving the system. We show that for this problem, no on-line algorithm can achieve a constant
competitive ratio (defined as the ratio between the completion time of the online algorithm to the completion time of the optimal
non-causal off-line algorithm). We then construct a slightly weaker metric of performance called the efficiency ratio.An online
algorithm is said to achieve an efficiency ratio ofγ when the flow-time incurred by that scheduler divided by the minimum
flow-time achieved over all possible schedulers isalmost surelyless than or equal toγ. Under some weak assumptions, we
then show a surprising property that for the flow-time problem any work-conserving scheduler has a constant efficiency ratio in
both preemptive and non-preemptive scenarios. More importantly, we are able to develop an online scheduler with a very small
efficiency ratio (2), and through simulations we show that itoutperforms the state-of-the-art schedulers.
I. I NTRODUCTION
MapReduce is a framework designed to process massive amounts of data in a cluster of machines [1]. Although it was first
proposed by Google [1], today, many other companies including Microsoft, Yahoo, and Facebook also use this framework.
Currently this framework is widely used for applications such as search indexing, distributed searching, web statistics generation,
and data mining.
MapReduce has two elemental processes: Map and Reduce. For the Map tasks, the inputs are divided into several small sets,
and processed by different machines in parallel. The outputof Map tasks is a set of pairs in<key, value> format. The Reduce
tasks then operate on this intermediate data, possibly running the operation on multiple machines in parallel to generate the
final result.
Each arriving job consists of a set of Map tasks and Reduce tasks. Thescheduleris centralized and is responsible for making
decisions on which task will be executed at what time and on which machine. The key metric considered in this paper is the
total delay in the systemper job, which includes both processing delay and the delay in waiting before the first task in the job
begins to get processed.
2
A critical consideration for the design of the scheduler is the dependencebetween the Map and Reduce tasks. For each
job, the Map tasks need to be finished before starting any of its Reduce tasks1 [1], [3]. Various scheduling solutions has been
proposed within the MapReduce framework [2]–[6], but analytical bounds on performance have been derived only in some
of these works [2], [3], [6]. However, rather than focusing directly on the flow-time, for deriving performance bounds, [2],
[3] have considered a slightly different problem of minimizing total completion time and [6] has assumed speed-up of the
machines. Further discussion of these schedulers is given in Section II.
In this paper, we directly analyze the performance of total delay (flow-time) in the system. In an attempt to minimize this,
we introduce a new metric to analyze the performance of schedulers called efficiency ratio. Based on this new metric, we
analyze and design several schedulers, which can guaranteethe performance of the MapReduce framework.
The contributions of this paper are as follows:
• For the problem of minimizing the total delay (flow-time) in the MapReduce framework, we show that no on-line algorithm
can achieve a constant competitive ratio. (Sections III andIV)
• To directly analyze the total delay in the system, we proposea new metric to measure the performance of schedulers,
which we call theefficiency ratio. (Section IV)
• Under some weak assumptions, we then show a surprising property that for the flow-time problem any work-conserving
scheduler has a constant efficiency ratio in both preemptiveand non-preemptive scenarios (precise definitions provided in
Section III). (Section V)
• We present an online scheduling algorithm called ASRPT (Available Shortest Remaining Processing Time) with a very
small (less than 2) efficiency ratio (Section VI), and show that it outperforms state-of-the-art schedulers through simulations
(Section VII).
II. RELATED WORK
In Hadoop [7], the most widely used implementation, the default scheduling method is First In First Out (FIFO). FIFO
suffers from the well known head-of-line blocking problem,which is mitigated in [4] by using the Fair scheduler.
In the case of the Fair scheduler [4], one problem is that jobsstick to the machines on which they are initially scheduled,
which could result in significant performance degradation.The solution of this problem given by [4] is delayed scheduling.
However, the fair scheduler could cause a starvation problem (referred to [4], [5]). In [5], the authors propose a Coupling
scheduler to mitigate this problem, and analyze its performance.
In [3], the authors assume that all Reduce tasks are non-preemptive. They design a scheduler in order to minimize the
weighted sum of the job completion times by determining the ordering of the tasks on each processor. The authors show that
this problem is NP-hard even in the offline case, and propose approximation algorithms that work within a factor of 3 of the
optimal. However, as the authors point out in the article, they ignore the dependency between Map and Reduce tasks, which
is a critical property of the MapReduce framework. Based on the work of [3], the authors in [2] add a precedence graph to
1Here, we consider the most popular case in reality without the Shuffle phase. For discussion about the Shuffle phase, see Section II and [2].
3
describe the precedence between Map and Reduce tasks, and consider the effect of the Shuffle phase between the Map and
Reduce tasks. They break the structure of the MapReduce framework from job level to task level using the Shuffle phase,
and seek to minimize the total completion time of tasks instead of jobs. In both [3] and [2], the schedulers use an LP based
lower bound which need to be recomputed frequently. However, the practical cost corresponding to the delay (or storage)of
jobs are directly related to the total flow-time, not the completion time. Although the optimal solution is the same for these
two optimization problems, the efficiency ratio obtained from minimizing the total flow-time will be much looser than the
efficiency ratio obtained from minimizing the total completion time. For example, if the flow time of each job is increasing
with its arriving time, then the total flow time (delay) do nothave constant efficiency ratio, but the completion time may have
a constant efficiency ratio. That means that even the total completion time has constant efficiency ratio, it cannot guarantee
the performance of real delay.
In [6], the authors study the problem of minimizing the totalflow-time of all jobs. They propose anO(1/ǫ5) competitive
algorithm with (1 + ǫ) speed for the online case, where0 < ǫ ≤ 1. However, speed-up of machines is necessary in the
algorithm; otherwise, there is no guarantee on the competitive ratio (if ǫ decreases to 0, the competitive ratio will increase to
infinity correspondingly).
III. SYSTEM MODEL
Consider a data center withN machines. We assume that each machine can only process one job at a time. A machine
could represent a processor, a core in a multi-core processor or a virtual machine. Assume that there aren jobs arriving into
the system. We assume the scheduler periodically collects the information on the state of jobs running on the machines, which
is used to make scheduling decisions. Such time slot structure can efficiently reduce the performance degeneration caused by
data locality (see [4] [8]). We assume that the distributionof job arrivals in each time slot is i.i.d., and the arrival rate is λ.
Each jobi bringsMi units of workload for its Map tasks andRi units of workload for its Reduce tasks. Each Map task has 1
unit of workload2, however, each Reduce task can have multiple units of workload. Time is slotted and each machine can run
one unit of workload in each time slot. Assume that{Mi} are i.i.d. with expectationM , and{Ri} are i.i.d. with expectation
R. We assume that the traffic intensityρ < 1, i.e.,λ < N
M+R. Assume the moment generating function of workload of arriving
jobs in a time slot has finite value in some neighborhood of 0. In time slott for job i, mi,t andri,t machines are scheduled for
Map and Reduce tasks, respectively. As we know, each job contains several tasks. We assume that jobi containsKi tasks, and
the workload of the Reduce taskk of job i is R(k)i . Thus, for any jobi,
Ki∑k=1
R(k)i = Ri. In time slott for job i, r
(k)i,t machines
are scheduled for the Reduce taskk. As each Reduce task may consist of multiple units of workload, it can be processed in
either preemptive or non-preemptive fashion based on the type of scheduler.
Definition 1. A scheduler is calledpreemptive if Reduce tasks belonging to the same job can run in parallel on multiple
machines, can be interrupted by any other task, and can be rescheduled to different machines in different time slots.
2Because the Map tasks are independent and have small workload [5], such assumption is valid.
4
A scheduler is callednon-preemptive if each Reduce task can only be scheduled on one machine and, once started, it must
keep running without any interruption.
In any time slott, the number of assigned machines must be less than or equal tothe total number of machinesN , i.e.,n∑
i=1
(mi,t + ri,t) ≤ N, ∀t.
Let the arrival time of jobi be ai, the time slot in which the last Map task finishes execution bef(m)i , and the time slot in
which all Reduce tasks are completed bef(r)i .
For any jobi, the workload of its Map tasks should be processed by the assigned number of machines between time slot
ai andf(m)i , i.e.,
f(m)i∑
t=ai
mi,t = Mi, ∀i ∈ {1, ..., n}.
For any jobi, if t < ai or t > f(m)i , thenmi,t = 0. Similarly, for any jobi, the workload of its Reduce tasks should be
processed by the assigned number of machines between time slot f(m)i + 1 andf
(r)i , i.e.,
f(r)i∑
t=f(m)i +1
ri,t = Ri, ∀i ∈ {1, ..., n}.
Since any Reduce task of jobi cannot start before the finishing time slot of the Map tasks, if t < f(m)i + 1 or t > f
(r)i ,
thenri,t = 0.
The waiting and processing timeSi,t of job i in time slott is represented by the indicator function1{ai≤t≤f
(r)i
} . Thus, we
define the delay cost as∞∑
t=1
n∑i=1
Si,t, which is equal ton∑
i=1
(f
(r)i − ai + 1
). The flow-timeFi of job i is equal tof (r)
i − ai +1.
The objective of the scheduler is to determine the assignment of jobs in each time slot, such that the cost of delaying the jobs
or equivalently the flow-time is minimized.
For the preemptive scenario, the problem definition is as follows:
minmi,t,ri,t
n∑
i=1
(f
(r)i − ai + 1
)
s.t.
n∑
i=1
(mi,t + ri,t) ≤ N, ri,t ≥ 0, mi,t ≥ 0, ∀t,
f(m)i∑
t=ai
mi,t = Mi,
f(r)i∑
t=f(m)i +1
ri,t = Ri, ∀i ∈ {1, ..., n}.
(1)
In the non-preemptive scenario, the Reduce tasks cannot be interrupted by other jobs. Once a Reduce task begins execution on
a machine, it has to keep running on that machine without interruption until all its workload is finished. Also, the optimization
problem in this scenario is similar to Eq. (1), with additional constraints representing the non-preemptive nature, asshown
below:
5
minmi,t,r
(k)i,t
n∑
i=1
(f
(r)i − ai + 1
)
s.t.n∑
i=1
(mi,t +
Ki∑
k=1
r(k)i,t ) ≤ N, ∀t,
f(m)i∑
t=ai
mi,t = Mi, mi,t ≥ 0, ∀i ∈ {1, ..., n},
f(r)i∑
t=f(m)i +1
r(k)i,t = R
(k)i , ∀i ∈ {1, ..., n}, ∀k ∈ {1, ..., Ki},
r(k)i,t = 0 or 1, r
(k)i,t = 1 if 0 <
t−1∑
s=0
r(k)i,s < R
(k)i .
(2)
We show that the offline scheduling problem is strongly NP-hard by reducing 3-PARTITION to the decision version of the
scheduling problem.
Consider an instance of 3-PARTITION: Given a setA of 3m elementsai ∈ Z+, i = 1, ..., 3m, and a boundN ∈ Z+, such
that N/4 < ai < N/2, ∀i and∑
i ai = mN . The problem is to decide ifA can be partitioned intom disjoint setsA1, ..., Am
such that∑
ai∈Akai = N, ∀k. Note that by the range ofai’s, every suchAk much contain exactly 3 elements.
Given an instance of 3-PARTITION, we consider the followinginstance of the scheduling problem. There areN machines
in the system (without loss of generality, we assumeN mod 3 = 0), and3m jobs,J1, ..., J3m, where each job arrives at time
0, andMi = ai, Ri = 23N − ai. Furthermore, letω = 3m2 + 3m. The problem is to decide if there is a feasible schedule with
a flow time no more thanω. This mapping can clearly be done in polynomial time.
Let S denote an arbitrary feasible schedule to the above problem,andF (S) its flow-time. LetnSk denote the number of jobs
that finish in thek-th time slot in scheduleS. We then haveF (S) =∑∞
k=1 k × nSk . We drop the superscript in the following
when there is no confusion. It is clear thatn1 = 0. Let Nk =∑k
j=1 nj denote the number of jobs finished by timek. The
following lemma provides a lower bound onNk.
Lemma 1. (1) N2k ≤ 3k andN2k+1 ≤ 3k+1, ∀k. (2)∑2m
k=1 Nk ≤ 3m2 and the lower bound is attained only if the following
condition holds:
N2k = 3k, ∀k ∈ {1, ..., m}, N2k+1 = 3k, ∀k ∈ {1, ..., m− 1}. (3)
Proof: Since Mi + Ri = 2N/3, at most3k jobs can finish by time slot2k, henceN2k ≤ 3k. Similarly, N2k+1 ≤
⌊3(2k+1)/2⌋ = 3k+1. Now consider the second part. First if Eq. (3) holds, we have∑2m
k=1 Nk =∑m
k=1 3k+∑m−1
k=1 3k = 3m2.
Now assumeN2k+1 = 3k + 1 for somek ∈ {1, ..., m− 1}. Then we must haveN2k < 3k. Otherwise the first2k time slots
will be fully occupied by the first3k jobs by the fact thatMi + Ri = 2N/3, and hence both theM -task andR-task of the
(3k+1)-th job have to be done in the(2k+1)-th time slot, a contradiction. By a similar argument, we also haveN2k+2 < 3k.
Hence ifN2k+1 = 3k + 1 for one or morek ∈ {1, ..., m− 1}, we have∑2m
k=1 Nk < 3m2. Therefore,3m2 is a lower bound
of∑2m
k=1 Nk, which is attained only if Eq. (3) holds.
6
Lemma 2. F (S) ≥ 3m2 + 3m and the lower bound is attained only if(3) holds.
Proof: We note thatF (S) =∑∞
k=1 k×nk =∑∞
k=1[(2m+1)− (2m+1− k)]×nk ≥ (2m+1)∑∞
k=1 nk−∑2m
k=1(2m+
1 − k)nk = (2m + 1)3m−∑2m
k=1 Nk ≥ 6m2 + 3m − 3m2 = 3m2 + 3m, where the last inequality follows from Lemma 1.
The lower bound is attained only ifnk = 0 for k > 2m and∑2m
k=1 Nk = 3m2, which holds only when Eq. (3) holds.
Theorem 1. The scheduling problem (both preemptive and non-preemptive) is NP-complete in the strong sense.
Proof: The scheduling problem is clearly in NP. We reduce 3-Partition to the scheduling problem using the above mapping.
If the answer to the partition problem is ‘yes’, thenA can be partitioned into disjoint setsA1, ..., Am each containing 3 elements.
Consider the scheduleS where fork = 1, ..., m, theM -tasks of the three jobs corresponding toAk are scheduled in thek-th
time slot, and theR-tasks of them are scheduled in the(k + 1)-th time slot. The schedule is clearly feasible. Furthermore,
(3) holds in this schedule. Therefore,F (S) = 3m2 + 3m by Lemma 2. On the other hand, if there is a scheduleS for the
scheduling problem such thatF (S) = 3m2 + 3m, then (3) holds by Lemma 2. An induction onk then shows that for each
k = 1, ..., m, the (2k − 1)-th time slot must be fully occupied by theM -tasks of three jobs, and the2k-th time slot must
be fully occupied by the correspondingR-tasks, which corresponds a feasible partition toA. Since 3-PARTITION is strongly
NP-complete, so is the scheduling problem.
IV. EFFICIENCY RATIO
The competitive ratio is often used as a measure of performance in a wide variety of scheduling problems. For our problem,
the scheduling algorithmS has a competitive ratio ofc, if for any total timeT , any number of arrivalsn in the timeT , any
arrival timeai of each jobi, any workloadMi andRi of Map and Reduce tasks with respect to each arrival jobi, the total
flow-time FS(T, n, {ai, Mi, Ri; i = 1...n}) of scheduling algorithmS satisfies the following:
FS(T, n, {ai, Mi, Ri; i = 1...n})
F ∗(T, n, {ai, Mi, Ri; i = 1...n})≤ c, (4)
where
F ∗(T, n, {ai, Mi, Ri; i = 1...n})
=minS
FS(T, n, {ai, Mi, Ri; i = 1...n}),(5)
is the minimum flow time of an optimal off-line algorithm.
For our problem, we construct an arrival pattern such that noscheduler can achieve a constant competitive ratioc in the
following example.
Consider the simplest example of non-preemptive scenario in which the data center only has one machine, i.e.,N = 1.
There is a job with 1 Map task and 1 Reduce task. The workload ofMap and Reduce tasks are1 and2Q + 1, respectively.
Without loss of generality, we assume that this job arrives in time slot1. We can show that, for any given constantc0, and
any scheduling algorithm, there exists a special sequence of future arrivals and workload, such that the competitive ratio c is
greater thanc0.
7
(a) Case I: SchedulerS (b) Case I: SchedulerS1
(c) Case II: SchedulerS (d) Case II: SchedulerS1
Fig. 1. The Quick Example
For any schedulerS, we assume that the Map task of the job is scheduled in time slot H + 1, and the Reduce task is
scheduled in the(H + L + 1)st time slot (H, L ≥ 0). Then the Reduce task will last for2Q + 1 time slots because the
Reduce task cannot be interrupted. This scheduler’s operation is given in Fig. 1(a). Observe that any arbitrary scheduler’s
operation can be represented by choosing appropriate values for H and L possibilities:H + L > 2(c0 − 1)(Q + 1) + 1 or
H + L ≤ 2(c0 − 1)(Q + 1) + 1.
Now let us construct an arrival patternA and a schedulerS1. WhenH + L > 2(c0 − 1)(Q + 1) + 1, the arrival patternA
only has this unique job arrive in the system. WhenH +L ≤ 2(c0−1)(Q+1)+1, A hasQ additional jobs arrive in time slots
H + L + 2, H + L + 4, ..., H + L + 2Q, and the Map and Reduce workload of each arrival is1. The schedulerS1 schedules
the Map task in the first time slot, and schedules the Reduce task in the second time slot whenH +L > 2(c0−1)(Q+1)+1.
When H + L ≤ 2(c0 − 1)(Q + 1) + 1, S1 schedules the Map function of the first arrival in the same time slot asS, and
schedules the lastQ arrivals before scheduling the Reduce task of the first arrival. The operation of the schedulerS1 is depicted
in Figs. 1(b) and 1(d).
Case I: H + L > 2(c0 − 1)(Q + 1) + 1.
In this case, the schedulerS andS1 are shown in Figs. 1(a) and 1(b), respectively. The flow-timeof S is L + 2Q + 1, and
the flow-time ofS1 is 2Q + 2. So, the competitive ratioc of S must satisfy the following:
c ≥H + L + 2Q + 1
2Q + 2> c0. (6)
Case II: H + L ≤ 2(c0 − 1)(Q + 1) + 1.
Then for the schedulerS, the total flow-time is greater than(H + L + 2Q + 1) + (2Q + 1)Q, no matter how the lastQ
jobs are scheduled. In this case, the schedulerS andS1 are shown in Fig. 1(c) and 1(d), respectively.
8
Then, for the schedulerS, the competitive ratioc satisfies the following:
c ≥(H + L + 2Q + 1) + (2Q + 1)Q
2Q + (H + L + 2Q + 2Q + 1)
≥(2Q + 1)(Q + 1)
6Q + 1 + (2(c0 − 1)(Q + 1) + 1)
>(2Q + 1)(Q + 1)
2(c0 + 2)(Q + 1)>
Q
c0 + 2.
(7)
By selectingQ > c20 + 2c0, using Eq. 7, we can getc > c0.
Thus, we show that for any constantc0 and schedulerS, there are sequences of arrivals and workloads, such that the
competitive ratioc is greater thanc0. In other words, in this scenario, there does not exist a constant competitive ratioc. This
is because the scheduler does not know the information of future arrivals, i.e., it only makes causal decisions. In fact,even if
the scheduler only knows a limited amount of future information, it can still be shown that no constant competitive ratiowill
hold by increasing the value ofQ.
We now introduce a slightly weaker notion of performance, called theefficiency ratio.
Definition 2. We say that the scheduling algorithmS has an efficiency ratioγ, if the total flow-timeFS(T, n, {ai, Mi, Ri; i =
1...n}) of scheduling algorithmS satisfies the following:
limT→∞
FS(T, n, {ai, Mi, Ri; i = 1...n})
F ∗(T, n, {ai, Mi, Ri; i = 1...n})≤ γ, with probability 1. (8)
Later, we will show that for the quick example, a constant efficiency ratioγ still can exist (e.g., the non-preemptive scenario
with light-tailed distributed Reduce workload in Section V).
V. WORK-CONSERVING SCHEDULERS
In this section, we analyze the performance of work-conserving schedulers in both preemptive and non-preemptive scenarios.
A. Preemptive Scenario
We first study the case in which the workload of Reduce tasks ofeach job is bounded by a constant, i.e., there exists a
constantRmax, s.t.,Ri ≤ Rmax, ∀i.
Consider the total scheduled number of machines in all the time slots. If all theN machines are scheduled in a time slot,
we call this time slot a “developed” time slot; otherwise, wecall this time slot a “developing” time slot. We call the timeslot a
“no-arrival” time slot, if there is no arrival in this time slot. The probability that a time slot is no-arrival time slot only depends
on the distribution of arrivals. Since the job arrivals in each time slot are i.i.d., the probability that a time slot is noarrival
time slot is equal top0, andp0 < 1. We define thejth “interval” to be the interval between the(j − 1)th developing time
slot and thejth developing time slot. (We define the first interval as the interval from the first time slot to the first developing
time slot.) Thus, the last time slot of each interval is the only developing time slot in the interval. LetKj be the length of the
jth interval, as shown in Fig. 2.
Firstly, we show that any moment of the lengthKj of interval is bounded by some constants, as shown in Lemma 3.Then, we
show the expression of first and second moment ofKj in Corollary 1. For the artificial system with dummy Reduce workload,
9
the similar results can be achieved as shown in Corollary 2. Moreover, the internal length are i.i.d. in the artificial system as
shown in Remark 4. Based on all above results, we can get Theorem 2, which show the performance of any work-conserving
scheduler.
Lemma 3. If the algorithm is work-conserving, then for any given number H , there exists a constantBH , such thatE[KHj ] <
BH , ∀j.
Proof: SinceKj is a positive random variable, thenE[KHj ] ≤ 1 if H ≤ 0. Thus, we only focus on the case thatH > 0.
In the jth interval, consider thetth time slot in this interval. (Assume t=1 is the first time slot in this interval.) Assume that
the total workload of the arrivals in thetth time slot of this interval isWj,t. Also assume that the unavailable workload of
the Reduce function at the end of thetth time slot in this interval isRj,t, i.e., the workload of the Reduce functions whose
corresponding Map functions are finished in thetth time slot in this interval isRj,t. The remaining workload of the Reduce
function left by the previous interval isRj,0. Then, the distribution ofKj can be shown as below.
P (Kj = k) = P (Wj,1 + Rj,0 −Rj,1 ≥ N,
Wj,1 + Wj,2 + Rj,0 −Rj,2 ≥ 2N,
...
k−1∑
t=1
Wj,t + Rj,0 −Rj,k−1 ≥ (k − 1)N,
k∑
t=1
Wj,t + Rj,0 −Rj,k < kN)
(9)
By the assumptions, for each job,Ri ≤ Rmax, for all i. Also, the unit of allocation is one machine, thus, in the last slot
of the previous interval, the maximum number of finished Map functions isN − 1. Thus,Rj,0 ≤ (N − 1)Rmax. Now, let’s
consider the case ofk > 1.
P (Kj = k) ≤ P
(k−1∑
t=1
Wj,t + Rj,0 −Rj,k−1 ≥ (k − 1)N
)
≤ P
(k−1∑
t=1
Wj,t + (N − 1)Rmax ≥ (k − 1)N
)
= P
k−1∑t=1
Wj,t
k − 1≥ N −
(N − 1)Rmax
k − 1
.
(10)
By the assumptions, the total (both Map and Reduce) arrivingworkloadWj,t in each time slot is i.i.d., and the distribution
does not depend on the interval indexj or the index of time slott in this interval. Then, for any intervalj, we know that
P (Kj = k) ≤ P
k−1∑s=1
Ws
k − 1≥ N −
(N − 1)Rmax
k − 1
, (11)
where{Ws} is a sequence of i.i.d. random variables with the same distribution asWj,t.
10
Since the traffic intensityρ < 1, i.e., E[Ws] = λ(M + R) < N .
For anyǫ > 0, there exists ak0, such that for anyk > k0, (N−1)Rmax
k−1 ≤ ǫ. Then for anyk > k0, we have
P (Kj = k) ≤ P
k−1∑s=1
Ws
k − 1≥ N − ǫ
, (12)
We takeǫ < N − E[Ws] and correspondingk0. For anyk > k0 = 1 + ⌈ (N−1)Rmax
ǫ⌉, by Chernoff’s theorem in the large
deviation theory [9], we can achieve that
P
k−1∑s=1
Ws
k − 1≥ N − ǫ
≤ e−kl(N−ǫ), (13)
where the rate functionl(a) is shown as below:
l(a) = supθ≥0
(θa− log(E[eθWs ])
). (14)
Since the moment generating function of workload in a time slot has finite value in some neighborhood around 0, for any
0 < ǫ < N − λ(M + R), we know thatl(N − ǫ) > 0. Thus, we can achieve
E[KHj ] =
∞∑
k=0
kHP (Kj = k)
=
k0∑
k=0
kHP (Kj = k) +
∞∑
k=k0+1
kHP (Kj = k)
≤kH0 +
∞∑
k=2
kHe−kl(N−ǫ)
=
(1 + ⌈
(N − 1)Rmax
ǫ⌉
)H
+
∞∑
k=2
kHe−kl(N−ǫ).
(15)
Since we know thatl(N − ǫ) > 0 in this scenario,∞∑
k=2
kHe−kl(N−ǫ) is bounded for givenH .
Thus, givenH , E[KHj ] is bounded by a constantBH , for any j, whereBH is given as below.
BH , minǫ∈(0,N−λ(M+R))
(1 + ⌈
(N − 1)Rmax
ǫ⌉
)H
+
∞∑
k=2
kHe−kl(N−ǫ)
. (16)
Corollary 1. E[Kj ] is bounded by a constantB1, E[K2j ] is bounded by a constantB2, for anyj. The expressions ofB1 and
B2 are shown as below, where the rate functionl(a) is defined in Eq. (21).
B1 = minǫ∈(0,N−λ(M+R))
1 + ⌈(N − 1)Rmax
ǫ⌉
+2el(N−ǫ) − 1
el(N−ǫ)(el(N−ǫ) − 1)2
, (17)
11
B2 = minǫ∈(0,N−λ(M+R))
(1 + ⌈
(N − 1)Rmax
ǫ⌉
)2
+4e2l(N−ǫ) − 3el(N−ǫ) + 1
el(N−ǫ)(el(N−ǫ) − 1)3
. (18)
Proof: By substitutingH = 2 andH = 1 in Eq. (16), we directly achieve Eqs. (19) and (20).
We add some dummy Reduce workload in the beginning time slot of each interval, such that the remaining workload of the
previous interval is equal to(N − 1)Rmax. The added workload are only scheduled in the last time slot of each interval in
the original system. Then, in the new artificial system, the total flow time of the real jobs doesn’t change. However, the total
number of intervals may decrease, because the added workload may merge some adjacent intervals.
Corollary 2. In the artificial system, the length of new intervals{Kj, j = 1, 2, ...} also satisfies the same results of Lemma 3
and Corollary 1.
Proof: The proof is similar to the proofs of Lemma 3 and Corollary 1.
Remark 1. In the artificial system constructed in Corollary 2, the length of each interval has no effect on the length of
other intervals, and it is only dependent on the job arrivalsand the workload of each job in this interval. Based on our
assumptions, the job arrivals in each time slot are i.i.d.. Also, the workload of jobs are also i.i.d.. Thus, the random variables
{Kj, j = 1, 2, ...} are i.i.d..
Theorem 2. In the preemptive scenario, any work-conserving schedulerhas a constant efficiency ratio
B2+B21
max{2,1−p0
λ, ρ
λ}max{1, 1N(1−ρ)}
, where p0 is the probability that no job arrives in a time slot, andB1 and B2 are
given in Eqs. (19) and (20).
B1 = minǫ∈(0,N−λ(M+R))
1 + ⌈(N − 1)Rmax
ǫ⌉
+2el(N−ǫ) − 1
el(N−ǫ)(el(N−ǫ) − 1)2
, (19)
B2 = minǫ∈(0,N−λ(M+R))
(1 + ⌈
(N − 1)Rmax
ǫ⌉
)2
+4e2l(N−ǫ) − 3el(N−ǫ) + 1
el(N−ǫ)(el(N−ǫ) − 1)3
, (20)
where the rate functionl(a) is defined as
l(a) = supθ≥0
(θa− log(E[eθWs ])
). (21)
Proof: Consider the artificial system constructed in Corollary 2. Since we have more dummy Reduce workload in the
artificial system, the total flow timeF of the artificial system is equal to or greater than the total flow time F of the work-
conserving scheduling policy. (The total flow time is definedin Section III.)
Observe that in each interval, all the job arrivals in this interval must finish their Map functions in this interval, and all
their Reduce functions will finish before the end of the next interval. In other words, for all the arrivals in thejth interval,
12
Fig. 2. An example schedule of a work-conserving scheduler
the Map functions are finished inKj time slots, and the Reduce functions are finished inKj + Kj+1 time slots, as shown in
Fig. 2. (The vertical axis represents the total number of scheduled machines in each time slot.) And this is also true for the
artificial system with dummy Reduce workload, i.e., all the arrivals in thejth interval of the artificial system can be finished
in Kj + Kj+1 time slots.
Let Aj be the number of arrivals in thejth interval. Since the algorithm is work-conserving, all the available Map functions
which arrive before the interval must have finished before this interval. And all the Reduce function which are generatedby
the arrivals in this interval must finish before the end of thelast time slot of the next interval. So the total flow timeFj of
work-conserving algorithm for all the arrivals in thejth interval is less thanAj(Kj + Kj+1). Similarly, the total flow timeFj
for all the arrivals in thejth interval of the artificial system is less thanAj(Kj + Kj+1), whereAj is the number of arrivals
in the jth interval of the artificial system.
T ,m∑
j=1
Kj is equal to the finishing time of all then jobs. For any scheduling policy, the total flow time is not less than
the number of time slotsT that the number of scheduled machines is not equal to 0. Thus,for the optimal scheduling method,
its the total flow timeF ∗j should be not less thanT .
For any scheduling policy, each job needs at least 1 time slotto schedule its Map function and 1 time slot to schedule its
Reduce function. Thus, the lower bound of the each job’s flow time is 2. Thus, for the optimal scheduling method, its total
flow time F ∗j should not be less than2Aj .
Let F be the total flow time of any work-conserving algorithm, andF ∗ be the total flow time of optimal scheduling policy.
Assume that there are a total ofm intervals for then arrivals. Correspondingly, there are a total ofm intervals in the artificial
system. Also, based on the description of the artificial system, m ≥ m andF = F =m∑
j=1
Fj .
SinceE(Mi + Ri) > 0, ρ < 1, then limn→∞
m =∞ and limn→∞
m =∞.
13
limn→∞
F
F ∗= lim
n→∞
F
F ∗= lim
n→∞
m∑j=1
Fj
m∑j=1
F ∗j
≤ limm→∞
m∑j=1
Aj(Kj + Kj+1)
T
≤lim
m→∞
m∑j=1
AjKj
m+ lim
m→∞
m∑j=1
AjKj+1
m
limm→∞
Tm
.
(22)
In the artificial system, the distribution ofAj is the number of arrivals injth interval. Since{Kj, j = 1, 2, ...} are i.i.d.
distributed random variables, then{AjKj, j = 1, 2, ...} are also i.i.d..Based on Strong Law of Large Number (SLLN), we
can achieve that
limm→∞
m∑j=1
AjKj
m
w.p.1==== E
[AjKj
]. (23)
For {AjKj+1, j = 1, 2, ...}, they are identically distributed, but not independent. However, all the odd term are independent,
and all the even term are independent. Based on SLLN, we can achieve that
limm→∞
m∑j=1
AjKj+1
m
= limm→∞
∑{j≤m,j is odd}
AjKj+1 +∑
{j≤m,j is even}AjKj+1
m
= limm→∞
∑{j≤m,j is odd}
AjKj+1
⌈ m2 ⌉
⌈ m2 ⌉
m+
limm→∞
∑{j≤m,j is even}
AjKj+1
⌊ m2 ⌋
⌊ m2 ⌋
m
w.p.1====
1
2E[AjKj+1] +
1
2E[AjKj+1] = E[AjKj+1].
(24)
Based on Eqs. 23 and 24, we can achieve that
14
limm→∞
m∑j=1
AjKj
m+
m∑j=1
AjKj+1
mw.p.1
==== E[AjKj] + E[AjKj+1]
= E[AjKj ] + E[Aj ]E[Kj+1]
= E[E[AjKj |Kj]] + E[E[Aj |Kj ]]E[Kj+1]
= E[KjE[Aj |Kj]] + E[E[Aj |Kj ]]E[Kj+1]
= E[Kj(λKj)] + E[λKj ]E[Kj+1]
= λ(E[Kj
2] + E[Kj ]
2).
(25)
For thesen jobs, there arem developing time slots before the timeT , and for each developing time slot, there are at most
N − 1 machines are assigned. Then, we can have
N
m∑
j=1
(Kj − 1) + (N − 1)m ≥∑
s∈{Arrivals beforeT}
Ws. (26)
Thus,
N −
∑s∈{Arrivals beforeT}
Ws
T
m∑j=1
Kj
m≥ 1 (27)
Since the workloadWs of each job is i.i.d., by SLLN, we have
(N − λ(M + R)
)lim
m→∞
m∑j=1
Kj
m≥ 1 w.p.1, (28)
i.e.,
limm→∞
m∑j=1
Kj
m≥
1
N(1− ρ)w.p.1. (29)
Thus,
limm→∞
T
m= lim
m→∞
m∑j=1
Kj
m≥ max
{1,
1
N(1− ρ)
}w.p.1. (30)
For any work-conserving scheduler,T − T is less then the total number of no-arrival time slots, we have
limm→∞
T
m≥ (1− p0) lim
m→∞
T
mw. p. 1, (31)
At the same time, for each time slot that the number of scheduled machines is greater than 0, there are at mostN machines
are scheduled. Then, we can get
NT ≥
n∑
i=1
(Mi + Ri). (32)
15
Then, we have
limm→∞
T
m≥
λ(M + R)
Nlim
m→∞
T
mw.p.1.= ρ lim
m→∞
T
m(33)
Thus,
limn→∞
F
F ∗≤
λ(E[Kj
2] + E[Kj ]
2)
max{ρ, 1− p0}max{
1, 1N(1−ρ)
} w.p.1.
≤λ(B2 + B2
1)
max{ρ, 1− p0}max{1, 1
N(1−ρ)
} w.p.1.
(34)
Similarly,
limn→∞
F
F ∗= lim
n→∞
m∑j=1
Fj
m∑j=1
F ∗j
≤
limm→∞
m∑j=1
Aj(Kj + Kj+1)
limm→∞
m∑j=1
2Aj
≤lim
m→∞
m∑j=1
Aj(Kj+Kj+1)
m
limm→∞
m∑j=1
2Aj
m
(35)
With Strong Law of Large Number (SLLN), we can achieve that
limm→∞
m∑j=1
Aj
m= lim
m→∞
m∑j=1
Aj
m∑j=1
Kj
limm→∞
m∑j=1
Kj
m
=λ
m∑j=1
Kj
mw. p. 1.≥ λmax
{1,
1
N(1− ρ)
}
(36)
Thus,
limn→∞
F
F ∗≤
λ(B2 + B21)
2λmax{
1, 1N(1−ρ)
} w.p.1. (37)
Thus, based on the Eqs. 34 and 37, we can achieve that the efficient ratio of any work-conserving scheduling policy is not
greater than B2+B21
max{2,1−p0
λ,
ρλ}max{1, 1
N(1−ρ)}with probability 1, whenn goes to infinity.
B. Non-preemptive Scenario
The definitions ofB1, B2, p0, F , F , F ∗, Fj , Fj , Kj, Kj , Aj , Aj , Wj,t, andRj,t in this section are same as Section V-A.
Lemma 4. Lemma 3, Corollary 1, Corollary 2 and Remark 4 in Section V-A are also valid in non-preemptive Reduce scenario.
Proof: Different from migratable Reduce scenario in Section V-A, the constitution ofRj,t are different, because it conations
both two parts. First, it contains the unavailable Reduce functions which are just released by the Map functions which are
finished in the current time slot, the last time slot of each interval. Second, it also contains the workload of Reduce functions
16
which cannot be processed in this time slot, even if there areidle machines, because the Reduce functions cannot migrate
among machines.
However, we can also achieve that the remaining workload of Reduce functions from previous interval satisfiesRj,0 ≤
(N − 1)Rmax, because the total number of remaining unfinished Reduce functions and new available Reduce functions in this
time slot is less thanN . All the following steps are same.
Theorem 3. If the Reduce functions are not allowed to migrate among different machines, then for any work-conserving
scheduling policy, the efficient ratio is not greater than B2+B21+B1(Rmax−1)
max{2,1−p0
λ,
ρλ}max{1, 1
N(1−ρ)}with probability 1, whenn goes to
infinity. B1 and B2 can be achieved by Corollary 1, andp0 is the probability that a time slot is no-arrival time slot.
Proof: In the non-preemptive Reduce scenario, in each interval, all the Map functions of job arrivals in this interval are
also finished in this interval, and all their Reduce functions will be started before the end of the next interval. In otherwords,
for all the arrivals in thejth interval, the Map functions are finished inKj time slots, and the Reduce functions are finished in
Kj + Kj+1 + Rmax − 1 time slots, as shown in Fig. 2. (The vertical axis representsthe total number of scheduled machines
in each time slot.) And this is also true for the artificial system with dummy Reduce workload, i.e., all the arrivals in thejth
interval of the artificial system can be finished inKj + Kj+1 + Rmax − 1.
We also assume that there areAj arrivals in thejth interval. Since the algorithm is work-conserving, all the available Map
functions which arrive before the interval are already finished before this interval. And all the Reduce function which are
generated by the arrivals in this interval can be finished before the end of the time slot next to this interval plusRmax − 1.
So the total flow timeFi of work-conserving algorithm in thejth interval is less thanAj(Kj + Kj+1 +Rmax− 1). Similarly,
the the total flow timeFi in the jth interval of the artificial system is less thanAj(Kj + Kj+1 + Rmax− 1), whereAj is the
number of arrivals in thejth interval of the artificial system.
Compared to Section V-A, the flow time of each arrival has one more termRmax − 1 because of non-preeptive of Reduce
functions in this section. Thus, we can use the same proving method in Theorem 2 to achieve that
limn→∞
F
F ∗≤ lim
m→∞
m∑j=1
Aj(Kj + Kj+1) +n∑
i=1
(Ri − 1)
m∑j=1
(Kj − Sj)
=
limm→∞
1m
m∑j=1
Aj(Kj + Kj+1) + limm→∞
nm
limn→∞
1n
n∑i=1
(Ri − 1)
limm→∞
1m
m∑j=1
(Kj − Sj)
≤λ[B2 + B2
1 + B1(R− 1)]
max {ρ, 1− p0}max{1, 1
N(1−ρ)
} w.p.1,
(38)
and
limn→∞
F
F ∗≤
λ[B2 + B21 + B1(R − 1)]
2λmax{
1, 1N(1−ρ)
} w.p.1. (39)
17
Thus, based on the Eqs. 38 and 39, we can achieve that any work-conserving scheduler has the efficient ratio
B2+B21+B1(R−1)
max{2,1−p0
λ,
ρλ}max{1, 1
N(1−ρ)}.
Theorem 4. In the non-preemptive scenario, any work-conserving scheduler has a constant efficiency ratio
B2+B21+B1(R−1)
max{2,1−p0
λ,
ρλ}max{1, 1
N(1−ρ)}, wherep0 is the probability that no job arrives in a time slot, andB1 and B2 are given in
Eqs. (19) and (20).
Proof: In this scenario, for all the arrivals in thejth interval, the Map functions are finished inKj time slots, and the
Reduce functions are finished inKj + Kj+1 + R − 1 time slots, as shown in Fig. 3. Since the number of arrivals ineach
interval is independent of the workload of Reduce functions, different from Eqs. 38 and 39, we can get
limn→∞
F
F ∗≤ lim
m→∞
m∑j=1
Aj(Kj + Kj+1 + R− 1)
m∑j=1
Kj − Sj
≤λ[B2 + B2
1 + B1(R − 1)]
max {ρ, 1− p0}max{1, 1
N(1−ρ)
} w.p.1,
(40)
and
limn→∞
F
F ∗≤
λ[B2 + B21 + B1(R − 1)]
2λmax{
1, 1N(1−ρ)
} w.p.1. (41)
Thus, we can achieve that the efficient ratio of any work-conserving scheduling policy is not greater than
B2+B21+B1(R−1)
max{2,1−p0
λ, ρ
λ}max{1, 1N(1−ρ)}
with probability 1, whenn goes to infinity. Similarly, if the workload of Reduce functions
is light-tailed distributed, then the efficient ratio also holds if we useB′1 andB′
2 instead ofB1 andB2.
Remark 2. In Theorems 2 and 4, we can relax the assumption of boundedness for each Reduce job and allow them to follow
a light-tailed distribution, i.e., a distribution onRi such that∃r0, such thatP (Ri ≥ r) ≤ α exp (−βr), ∀r ≥ r0, ∀i, where
α, β > 0 are two constants. We obtain similar results to Theorem 2 and4 with different expressions.
Lemma 5. If the scheduling algorithm is work-conserving, then for any given numberH , there exists a constantBH , such
that E[KHj ] < BH , ∀j.
Proof: Similarly to Eqs. 9 and 10, we can achieve
P (Kj = k) ≤ P
(k−1∑
t=1
Wj,t + Rj,0 ≥ (k − 1)N
). (42)
18
Fig. 3. The jobi, which arrives in thejth interval, finishes its Map tasks inKj time slots and Reduce tasks inKj + Kj+1 + Ri − 1 time slots.
Let R′ be the maximal remaining workload of jobs in the previous interval, then
P (Kj = k) ≤ P
(k−1∑
t=1
Wj,t + (N − 1)R′ ≥ (k − 1)N
)
=
∞∑
r=0
[P
(k−1∑
t=1
Wj,t + (N − 1)R′ ≥ (k − 1)N∣∣∣R′ = r
)
P (R′ = r)
]
=
∞∑
r=0
P
k−1∑t=1
Wj,t
k − 1≥ N −
(N − 1)r
k − 1
P (R′ = r)
.
(43)
Thus, we can achieve
E[KHj ] =
∞∑
k=0
kHP (Kj = k)
=
∞∑
k=0
{kH
∞∑
r=0
[P(
k−1∑t=1
Wj,t
k − 1≥ N −
(N − 1)r
k − 1
)P (R′ = r)
]}
=∞∑
r=0
{P (R′ = r)
∞∑
k=0
[kHP
(k−1∑t=1
Wj,t
k − 1≥ N −
(N − 1)r
k − 1
)]}.
The next steps are similar to the proof of Lemma 3, we skip the details and directly show the following result. For any
19
0 < ǫ < N − λ(M + R), we know thatl(N − ǫ) > 0. Thus, we can achieve
E[KHj ]
≤
∞∑
r=0
{P (R′ = r)
[(1 + ⌈
(N − 1)r
ǫ⌉)H
+
∞∑
k=2
kHe−kl(N−ǫ)]}
=
∞∑
k=2
kHe−kl(N−ǫ) +
∞∑
r=0
{P (R′ = r)
(1 + ⌈
(N − 1)r
ǫ⌉)H}
≤
∞∑
k=2
kHe−kl(N−ǫ) +
∞∑
r=0
{P (R′ ≥ r)
(1 + ⌈
(N − 1)r
ǫ⌉)H}
≤
∞∑
k=2
kHe−kl(N−ǫ) +
∞∑
r=0
{P (R ≥ r)
(1 + ⌈
(N − 1)r
ǫ⌉)H}
≤
∞∑
k=2
kHe−kl(N−ǫ) +
r0−1∑
r=0
{(1 + ⌈
(N − 1)r
ǫ⌉)H}
+
∞∑
r=r0
{αe−βr
(1 + ⌈
(N − 1)r
ǫ⌉)H}
.
Since we know thatl(N − ǫ) > 0 and β > 0 in this scenario,∞∑
k=2
kHe−kl(N−ǫ),∞∑
r=r0
{αe−βr
(1 + ⌈ (N−1)r
ǫ⌉)H}
and
r0−1∑r=0
{(1 + ⌈ (N−1)r
ǫ⌉)H}
are bounded for givenH .
Thus, givenH , E[KHj ] is bounded by a constantBH , for any j, whereBH is given as below.
BH , minǫ∈(0,N−λ(M+R))
∞∑
k=2
kHe−kl(N−ǫ)+
r0−1∑
r=0
(1 + ⌈
(N − 1)r
ǫ⌉
)H
+
∞∑
r=r0
αe−βr
(1 + ⌈
(N − 1)r
ǫ⌉
)H
. (44)
Corollary 3. E[Kj ] is bounded by a constantB′1, E[K2
j ] is bounded by a constantB′2, for anyj. The expressions ofB′
1 and
B′2 are shown as below, where the rate functionl(a) is defined in Eq. (21).
B′1 = min
ǫ∈(0,N−λ(M+R))
2r0 +(N − 1)r0(r0 − 1)
2ǫ+
2α
eβ − 1+
α(N − 1)
4ǫcsch2(
β
2)
+ 2α +2el(N−ǫ) − 1
el(N−ǫ)(el(N−ǫ) − 1)2
, (45)
20
B′2 = min
ǫ∈(0,N−λ(M+R))
4r0 +2(N − 1)r0(r0 − 1)
ǫ+ 4α
+(N − 1)2(r0 − 1)r0(2r0 − 1)
6ǫ2
+α(N − 1)2
8ǫ2sinh (β)csch4(
β
2)
+4α
eβ − 1+
α(N − 1)
ǫcsch2(
β
2)
+4e2l(N−ǫ) − 3el(N−ǫ) + 1
el(N−ǫ)(el(N−ǫ) − 1)3
. (46)
Proof: By substitutingH = 2 andH = 1 in Eq. (44), we directly achieve Eqs. (47) and (48).
B1 = minǫ∈(0,N−λ(M+R))
r0−1∑
r=0
(1 + ⌈
(N − 1)r
ǫ⌉
)+
∞∑
r=r0
αe−βr
(1 + ⌈
(N − 1)r
ǫ⌉
)
+2el(N−ǫ) − 1
el(N−ǫ)(el(N−ǫ) − 1)2
, (47)
B2 = minǫ∈(0,N−λ(M+R))
r0−1∑
r=0
(1 + ⌈
(N − 1)r
ǫ⌉
)2
+
∞∑
r=r0
αe−βr
(1 + ⌈
(N − 1)r
ǫ⌉
)2
+4e2l(N−ǫ) − 3el(N−ǫ) + 1
el(N−ǫ)(el(N−ǫ) − 1)3
. (48)
To get a simple expression without summation format, we relax the right hand side of Eq. 47 as below:
B1 < minǫ∈(0,N−λ(M+R))
r0−1∑
r=0
(2 +
(N − 1)r
ǫ
)+
∞∑
r=0
αe−βr
(2 +
(N − 1)r
ǫ
)
+2el(N−ǫ) − 1
el(N−ǫ)(el(N−ǫ) − 1)2
= minǫ∈(0,N−λ(M+R))
2r0 +(N − 1)r0(r0 − 1)
2ǫ+
2α
eβ − 1+
α(N − 1)
4ǫcsch2(
β
2)
+ 2α +2el(N−ǫ) − 1
el(N−ǫ)(el(N−ǫ) − 1)2
, B′1.
(49)
Similarly,
21
B2 < minǫ∈(0,N−λ(M+R))
4r0 +2(N − 1)r0(r0 − 1)
ǫ+ 4α
+(N − 1)2(r0 − 1)r0(2r0 − 1)
6ǫ2
+α(N − 1)2
8ǫ2sinh (β)csch4(
β
2)
+4α
eβ − 1+
α(N − 1)
ǫcsch2(
β
2)
+4e2l(N−ǫ) − 3el(N−ǫ) + 1
el(N−ǫ)(el(N−ǫ) − 1)3
, B′2. (50)
Remark 3. Following the same proof steps, we can get the Corollary 2, Remark 1, Theorem 2 and Remark 4 as Section V-A.
The difference is that we need useB′1 and B′
2 instead ofB1 and B2 in this part.
Remark 4. Although any work-conserving scheduler has a constant efficiency ratio, the constant efficiency ratio may be
large (because the result is true for “any” work-conservingscheduler). We further discuss algorithms to tighten the constant
efficiency ratio in Section VI.
VI. AVAILABLE -SHORTEST-REMAINING -PROCESSING-TIME (ASRPT) ALGORITHM AND ANALYSIS
In the previous sections we have shown that any arbitrary work-conserving algorithm has a constant efficiency ratio, butthe
constant can be large as it is related to the size of the jobs. In this section, we design an algorithm with much tighter bounds
that does not depend on the size of jobs. Although, the tight bound is provided in the case of preemptive jobs, we also show
via numerical results that our algorithm works well in the non-preemptive case.
Before presenting our solution, we first describe a known algorithm called SRPT (Shortest Remaining Processing Time)
[10]. SRPT assumes that Map and Reduce tasks from the same jobcan be scheduled simultaneously in the same slot. In each
slot, SRPT picks up the job with the minimum total remaining workload, i.e., including Map and Reduce tasks, to schedule.
Observe that the SRPT scheduler may come up with an infeasible solution as it ignores the dependency between Map and
Reduce tasks.
Lemma 6. Without considering the requirement that Reduce tasks can be processed only if the corresponding Map tasks
are finished, the total flow-timeFS of Shortest-Remaining-Processing-Time (SRPT) algorithmis a lower bound on the total
flow-time of MapReduce framework.
Proof: Without the requirement that Reduce tasks can be processed only if the corresponding Map tasks are finished, the
22
optimization problem in the preemptive scenario will be as follows:
minmi,t,ri,t
n∑
i=1
(f
(r)i − ai + 1
)
s.t.n∑
i=1
(mi,t + ri,t) ≤ N, ri,t ≥ 0, mi,t ≥ 0, ∀t,
f(r)i∑
t=ai
(mi,t + ri,t) = Mi + Ri, ∀i ∈ {1, ..., n}.
(51)
The readers can easily check that
n⋃
i=1
f(r)i∑
t=ai
mi,t + ri,t = Mi + Ri
⊃
n⋃
i=1
{f(m)i∑
t=ai
mi,t = Mi
}∪
{f(r)i∑
t=f(m)i +1
ri,t = Ri,
}
.
(52)
Thus, the constraints in Eq. (51) are weaker than constraints in Eq. (1). Hence, the optimal solution of Eq. (51) is less than
the optimal solution of Eq. (1). Since we know that the SRPT algorithm can achieve the optimal solution of Eq. (51) [10],
then its total flow-timeFS is a lower bound of any scheduling method in the MapReduce framework. Since the optimization
problem in the non-preemptive scenario has more constraints, the lower bound also holds for the non-preemptive scenario.
A. ASRPT Algorithm
Based on the SRPT scheduler, we present our greedy scheme called ASRPT. We base our design on a very simple intuition
that by ensuring that the Map tasks of ASRPT finish as early as SRPT, and assigning Reduce tasks with smallest amount of
remaining workload, we can hope to reduce the overall flow-time. However, care must be taken to ensure that if Map tasks
are scheduled in this slot, then the Reduce tasks are scheduled after this slot.
ASRPT works as follows. ASRPT uses the schedule computed by SRPT to determine its schedule. In other words, it runs
SRPT in a virtual fashion and keeps track of how SRPT would have scheduled jobs in any given slot. In the beginning of each
slot, the list of unfinished jobsJ , and the scheduling listS of the SRPT scheduler are updated. The scheduled jobs in the
previous time slot are updated and the new arriving jobs are added to the listJ in non-decreasing order of available workload
in this slot. In the listJ , we also keep the number of schedulable tasks in this time-slot. For a job that has unscheduled
Map tasks, its available workload in this slot is the units ofunfinished workload of both Map and Reduce tasks, while its
schedulable tasks are the unfinished Map tasks. Otherwise, its available workload is the unfinished Reduce workload, while its
schedulable tasks are the unfinished workload of the Reduce tasks (preemptive scenario) or the number of unfinished Reduce
tasks (non-preemptive scenario), respectively. Then, thealgorithm assigns machines to the tasks in the following priority order
(from high to low): the previously scheduled Reduce tasks which are not finished yet (only in the non-preemptive scenario),
the Map tasks which are scheduled in the listS, the available Reduce tasks, and the available Map tasks. For each priority
group, the algorithm greedily assign machines to the corresponding tasks through the sorted available workload listJ . The
23
(a) SRPT (b) DSRPT (c) ASRPT
Fig. 4. The construction of schedule for ASRPT and DSRPT based on the schedule for SRPT. Observe that the Map tasks are scheduled in the same way.But, roughly speaking, the Reduce tasks are delayed by one slot.
pseudo-code of the algorithm is shown in Algorithm 1.
B. Efficiency Ratio Analysis of ASRPT Algorithm
We first prove a lower bound on its performance. The lower bound is based on the SRPT. For the SRPT algorithm, we
assume that in time slott, the number of machines which are scheduled to Map and Reducetasks areMSt andRS
t , respectively.
Now we construct a scheduling method called Delayed-Shortest-Remaining-Processing-Time (DSRPT) for MapReduce
framework based on the SRPT algorithm. We keep the scheduling method for the Map tasks exactly same as in SRPT.
For the Reduce tasks, they are scheduled in the same order from thenext time slotcompared to SRPT scheduling. An example
showing the operation of SRPT, DSRPT and ASRPT is shown in Fig. 4.
Theorem 5. In the preemptive scenario, DSRPT has an efficiency ratio 2.
Proof: We construct a queueing system to represent the DSRPT algorithm. In each time slott ≥ 2, there is an arrival with
workloadRSt−1, which is the total workload of the delayed Reduce tasks in time slott−1. The service capacity of the Reduce
tasks isN −MSt , and the service policy of the Reduce tasks is First-Come-First-Served (FCFS). The Reduce tasks which
are scheduled in previous time slots by the SRPT algorithm are delayed in the DSRPT algorithm up to the time of finishing
the remaining workload of delayed tasks. Also, the remaining workloadWt in the DSRPT algorithm will be processed first,
because the scheduling policy is FCFS. LetDt be the largest delay time of the Reduce tasks which are scheduled in the time
slot t− 1 in SRPT.
In the construction of the DSRPT algorithm, the remaining workload Wt is from the delayed tasks which are scheduled in
previous time slots by the SRPT algorithm, andMSs≥t is the workload of Map tasks which are scheduled in the futuretime
slots by the SRPT algorithm. Hence, they are independent. Weassume thatD′t is the first time such thatWt ≤ N−Ms, where
24
Algorithm 1 Available-Shortest-Remaining-Processing-Time (ASRPT)Algorithm for MapReduce FrameworkInput: List of unfinished jobs (including new arrivals in this slot)JOutput: Scheduled machines for Map tasks, scheduled machines for Reduce tasks, updated remaining jobsJ
1: Update the scheduling listS of SRPT algorithm without task dependency. For jobi, S(i).MapLoad is the correspondingscheduling of the Map tasks in SRPT.
2: Update the list of jobs with new arrivals and machine assignments in the previous slot to maintain a non-decreasing order of the available total workloadJ(i).AvailableWorkload. Also, keep the number of schedulable tasksJ(i).SchedulableTasksNum for each job in the list.
3: d← N ;4: if The scheduler is a non-preemptive schedulerthen5: Keep the assignment of machines which are already scheduledto the Reduce tasks previously and not finished yet;6: Updated, the idle number of machines;7: end if8: for i = 1→ |J | do9: if i has Map tasks which is scheduled inS then
10: if J(i).SchedulabeTasksNum≥ S(i).MapLoad then11: if S(i).MapLoad ≤ d then12: AssignS(i).MapLoad machines to the Map tasks of jobi;13: else14: Assignd machines to the Map tasks of jobi;15: end if16: else17: if J(i).SchedulabeTasksNum≤ d then18: AssignJ(i).SchedulabeTasksNum machines to the Map tasks of jobi;19: else20: Assignd machines to the Map tasks of jobi;21: end if22: end if23: Update status of jobi in the list of jobsJ ;24: Updated, the idle number of machines;25: end if26: end for27: for i = 1→ |J | do28: if ith job still has unfinished Reduce tasksthen29: if J(i).SchedulabeTasksNum≤ d then30: AssignJ(i).SchedulabeTasksNum machines to the Reduce tasks of jobi;31: else32: Assignd machines to the Reduce tasks of jobi;33: end if34: Updated, the idle number of machines;35: end if36: if d = 0 then37: RETURN;38: end if39: end for40: for i = 1→ |J | do41: if ith job still has unfinished Map tasksthen42: if J(i).SchedulabeTasksNum≤ d then43: AssignJ(i).SchedulableTasksNum machines to the Map tasks of jobi;44: else45: Assignd machines to the Map tasks of jobi;46: end if47: Updated, the idle number of machines;48: end if49: if d = 0 then50: RETURN;51: end if52: end for
25
s ≥ t. ThenDt ≤ D′t.
If a time slot t0 − 1 has no workload (except theRt0−1 which is delayed to the next time slot) left to the future, we call
t0 the first time slot of the interval it belongs to. (Note that the definition of interval is different from Section V.) Assume
that P (Wt0 ≤ N −Ms) = P (Rt0−1 ≤ N −Ms) = p, wheres ≥ t0. SinceRs ≤ N −Ms, thenp ≥ P (Rt0−1 ≤ Rs) ≥ 1/2.
Thus, we can get that
E[Dt0 ] ≤ E[D′t0
] =∞∑
k=1
kp(1− p)k−1 =1
p≤ 2. (53)
Note thatN −Ms ≥ Rs for all s, then the current remaining workloadWt ≤Wt0 . Then,E[Dt] ≤ E[Dt0 ] ≤ 2.
Then, for all the Reduce tasks, the expectation of delay compared to the SRPT algorithm is not greater than2. Let D has
the same distribution withDt. Thus, the total flow-timeFD of DSRPT algorithm satisfies
limT→∞
FD
FS
≤lim
n→∞
FS
n+ E[D]
limn→∞
FS
n
w.p.1
=1 +E[D]
limn→∞
FS
n
≤ 1 + E[D] ≤ 3.
(54)
For the flow timeF of any feasible scheduler in the MapReduce framework, we have
limT→∞
FD
F≤
limn→∞
FS
n+ E[D]
limn→∞
Fn
w.p.1
≤1 +E[D]
limn→∞
Fn
≤ 1 +E[D]
2≤ 2.
(55)
Corollary 4. From the proof of Theorem 5, the total flow-time of DSRPT is notgreater than 3 times the lower bound given
by Lemma 6 with probability 1, whenn goes to infinity.
Corollary 5. In the preemptive scenario, the ASRPT scheduler has an efficiency ratio 2.
Proof: For each time slot, all the Map tasks finished by DSRPT are alsofinished by ASRPT. For the Reduce tasks of each
job, the jobs with smaller remaining workload will be finished earlier than the jobs with larger remaining workload. Hence,
based on the optimality of SRPT, ASRPT can be viewed as an improvement of DSRPT. Thus, the total flow-time of ASRPT
will not be greater than DSRPT. So, the efficiency ratio of DSRPT also holds for ASRPT.
Note that, the total flow-time of ASRPT is not greater than3 times of the lower bound given by Lemma 6 with probability
1, whenn goes to infinity. We will show this performance in the next section. Also, the performance of ASRPT is analyzed
in the preemptive scenario in Corollary 5. We will show that ASRPT performs better than other schedulers in both preemptive
and non-preemptive scenarios via simulations in the next section.
26
VII. S IMULATION RESULTS
A. Simulation Setting
We evaluate the efficacy of our algorithm ASRPT for both preemptive and non-preemptive scenarios. We consider a data
center withN = 100 machines, and choose Poisson process with arrival rateλ = 2 jobs per time slot as the job arrival process.
We choose uniform distribution and exponential distribution as examples of bounded workload and light-tailed distributed
workload, respectively. For short, we useExp(µ) to represent an exponential distribution with meanµ, and useU [a, b] to
represent a uniform distribution on{a, a + 1, ..., b− 1, b}. We choose the total time slots to beT = 800, and the number of
tasks in each job is up to10.
We compare 3 typical schedulers to the ASRPT scheduler:
The FIFO scheduler: It is the default scheduler in Hadoop. All the jobs are scheduled in their order of arrival.
The Fair scheduler: It is a widely used scheduler in Hadoop. The assignment of machines are scheduled to all the waiting
jobs in a fair manner. However, if some jobs need fewer machines than others in each time slot, then the remaining machines
are scheduled to the other jobs, to avoid resource wastage and to keep the scheduler work-conserving.
The LRPT scheduler: Jobs with larger unfinished workload are always scheduled first. Roughly speaking, the performance
of this scheduler represents in a sense how poorly even some work-conserving schedulers can perform.
B. Efficiency Ratio
In the simulations, the efficiency ratio of a scheduler is obtained by the total flow-time of the scheduler over the lower
bound of the total flow-time inT time slots. Thus, the real efficiency ratio should be smallerthan the efficiency ratio given in
the simulations. However, the proportion of all schedulerswould remain the same.
First, we evaluate the exponentially distributed workload. We choose the workload distribution of Map for each job as
Exp(5) and the workload distribution of Reduce for each job asExp(40). The efficiency ratios of different schedulers are
shown in Fig. 5. For different workload, we choose workload distribution of Map asExp(30) and the workload distribution
of Reduce asExp(15). The efficiency ratios of schedulers are shown in Fig. 6.
Then, we evaluate the uniformly distributed workload. We choose the workload distribution of Map for each job asU [1, 9]
and the workload distribution of Reduce for each job asU [10, 70]. The efficiency ratios of different schedulers are shown in
Fig. 7. To evaluate for a smaller Reduce workload, we choose workload distribution of Map asU [10, 50] and the workload
distribution of Reduce asU [10, 20]. The efficiency ratios of different schedulers are shown in Fig. 8.
As an example, we show the convergence of efficiency ratios inFigs. 9-12, for the 4 different scenarios.
From Figures. 5-8, we can see that the total flow-time of ASRPTis much smaller than all the other schedulers. Also, as a
“bad” work-conserving, the LRPT scheduler also has a constant (maybe relative large) efficiency ratio, from Figures 12
C. Cumulative Distribution Function (CDF)
For the same setting and parameters, the CDFs of flow-times are shown in Fig. 13-16. We plot the CDF only for flow-time
up to 100 units. From these figures, we can see that the ASRPT scheduler has a very light tail in the CDF of flow-time,
27
0
2
4
6
8
10
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(a) Preemptive Scenario
0
2
4
6
8
10
12
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(b) Non-Preemptive Scenario
Fig. 5. Efficiency Ratio (Exponential Distribution, Large Reduce)
0
2
4
6
8
10
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(a) Preemptive Scenario
0
2
4
6
8
10
12
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(b) Non-Preemptive Scenario
Fig. 6. Efficiency Ratio (Exponential Distribution, Small Reduce)
compared to the FIFO and Fair schedulers. In other words, thefairness of the ASRPT scheduler is similar to the FIFO and
Fair schedulers. However, the LRPT scheduler has a long tailin the CDF of flow time. In the other words, the fairness of the
LRPT scheduler is not as good as the other schedulers.
28
0
0.5
1
1.5
2
2.5
3
3.5
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(a) Preemptive Scenario
0
1
2
3
4
5
6
7
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(b) Non-Preemptive Scenario
Fig. 7. Efficiency Ratio (Uniform Distribution, Large Reduce)
0
1
2
3
4
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(a) Preemptive Scenario
0
1
2
3
4
5
ASRPT Fair FIFO LRPT
Effi
cien
cy R
atio
(b) Non-Preemptive Scenario
Fig. 8. Efficiency Ratio (Uniform Distribution, Small Reduce)
VIII. C ONCLUSION
In this paper, we study the problem of minimizing the total flow-time of a sequence of jobs in the MapReduce framework,
where the jobs arrive over time and need to be processed through Map and Reduce procedures before leaving the system.
We show that no on-line algorithm can achieve a constant competitive ratio. We define weaker metric of performance called
the efficiency ratio and propose a corresponding technique to analyze on-line schedulers. Under some weak assumptions,we
then show a surprising property that for the flow-time problem any work-conserving scheduler has a constant efficiency ratio
29
0 200 400 600 8000
2
4
6
8
10
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 200 400 600 8000
2
4
6
8
10
12
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 9. Convergence of Efficiency Ratios (Exponential Distribution, Large Reduce)
0 200 400 600 8000
5
10
15
20
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 200 400 600 8000
5
10
15
20
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 10. Convergence of Efficiency Ratios (Exponential Distribution, Small Reduce)
0 200 400 600 8000
1
2
3
4
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 200 400 600 8000
2
4
6
8
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 11. Convergence of Efficiency Ratios (Uniform Distribution, Large Reduce)
30
0 200 400 600 8000
1
2
3
4
5
6
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 200 400 600 8000
1
2
3
4
5
Total Time (T)
Effi
cien
cy R
atio
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 12. Convergence of Efficiency Ratios (Uniform Distribution, Small Reduce)
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 13. CDF of Schedulers (Exponential Distribution, Large Reduce)
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 14. CDF of Schedulers (Exponential Distribution, Small Reduce)
31
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 15. CDF of Schedulers (Uniform Distribution, Large Reduce)
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(a) Preemptive Scenario
0 50 1000
0.2
0.4
0.6
0.8
1
Flow Time
CD
F
ASRPTFairFIFOLRPT
(b) Non-Preemptive Scenario
Fig. 16. CDF of Schedulers (Uniform Distribution, Small Reduce)
in both preemptive and non-preemptive scenarios. More importantly, we are able to find an online scheduler ASRPT with a
very small efficiency ratio. The simulation results show that the efficiency ratio of ASRPT is much smaller than the other
schedulers, while the fairness of ASRPT is as good as other schedulers.
REFERENCES
[1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” inProceedings of Sixth Symposium on Operating System Design
and Implementation, OSDI, pp. 137–150, December 2004.
[2] F. Chen, M. Kodialam, and T. Lakshman, “Joint schedulingof processing and shuffle phases in mapreduce systems,” inProceedings of IEEE Infocom,
March 2012.
[3] H. Chang, M. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee, and S. Mukherjee, “Scheduling in mapreduce-like systems for fast completion time,”
in Proceedings of IEEE Infocom, pp. 3074–3082, March 2011.
[4] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Job sechduling for multi-user mapreduce clusters,” tech. rep., University
of Califonia, Berkley, April 2009.
[5] J. Tan, X. Meng, and L. Zhang, “Performance analysis of coupling scheduler for mapreduce/hadoop,” inProceedings of IEEE Infocom, March 2012.
[6] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarlos, “On scheduling in map-reduce and flow-shops,” inProceedings of the 23rd ACM symposium on
Parallelism in algorithms and architectures, SPAA, pp. 289–298, June 2011.
32
[7] “Hadoop, http://hadoop.apache.org/.”
[8] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: A simple techniquefor achieving locality and fairness
in cluster scheduling,” inProceedings of the 5th European conference on Computer systems, EuroSys, pp. 265–278, April 2010.
[9] A. Shwartz and A. Weiss,Large Deviations for Performance Analysis: Queues, Communication and Computing. New York, NY, USA: Chapman and
Hall, 1995.
[10] K. R. Baker and D. Trietsch,Principles of Sequencing and Scheduling. Hoboken, NJ, USA: John Wiley & Sons, 2009.
ACKNOWLEDGEMENTS
We gratefully acknowledge Dr. Zizhan Zheng for his helpful suggestions.