Approximation Algorithms for the Multiorganization Scheduling Problem
Efficient Approximation Algorithms for Scheduling Moldable ...
-
Upload
khangminh22 -
Category
Documents
-
view
5 -
download
0
Transcript of Efficient Approximation Algorithms for Scheduling Moldable ...
Efficient Approximation Algorithms for SchedulingMoldable TasksXIAOHU WU, Nanyang Technological University, SingaporePATRICK LOISEAU, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
We study the problem of scheduling π independent moldable tasks on π processors that arises in large-scaleparallel computations. When tasks are monotonic, the best known result is a ( 32 + π)-approximation algorithmfor makespan minimization with a complexity linear in π and polynomial in logπ and 1
π where π is arbitrarilysmall. We propose a new perspective of the existing speedup models: the speedup of a task ππ is linear whenthe number π of assigned processors is small (up to a threshold πΏ π ) while it presents monotonicity when π
ranges in [πΏ π , π π ]; the bound π π indicates an unacceptable overhead when parallelizing on too many processors.
For a given integer πΏ β₯ 5, let π’ =
β2βπΏ
ββ 1. In this paper, we propose a 1
\ (πΏ) (1 + π)-approximation algorithm
for makespan minimization with a complexity O(π log(π/π) logπ) where \ (πΏ) = π’+1π’+2
(1 β π
π
)(π β« π). As
a by-product, we also propose a \ (πΏ)-approximation algorithm for throughput maximization with a commondeadline with a complexity O(π2 logπ).
Additional Key Words and Phrases: Scheduling, approximation algorithms, moldable tasks
1 INTRODUCTIONMost computations nowadays are done in a parallelized way on large computers containing many
processors. Optimizing the use of processors leads to the problem of scheduling parallel tasks basedon their characteristics. In certain cases, the number of processors assigned to a task is predefinedby its owner and is said to be rigid. However, in many cases, the scheduler can decide this numberbefore the task execution: if this number cannot be changed during the task execution, the task issaid to be moldable; otherwise, it is said to be malleable1. Moldable tasks are easier to implementand manage than malleable tasks; the latter require additional system support for task migrations andpreemptions [7].
1.1 General Problem DescriptionWe consider the problem of scheduling π independent moldable tasks T = {π1,π2, Β· Β· Β· , ππ} on
π identical processors; all tasks are available at time zero. For every task ππ β T , its executiontime π‘ π,1 on one processor is given, as well as the speedup [ π,π when assigned π > 1 processors;the execution time of ππ on π processors is π‘ π,π =
π‘ π,1[ π,π
and its workload is π· π,π = π‘ π,ππ. ππ can berepresented by a rectangle in the processors Γ time space. Like [14, 22, 23], given a real number π ,we define a parameter πΎ ( π, π) as the minimum number of processors needed to finish task ππ by timeπ; by convention, we set πΎ ( π, π) = +β to imply the case that ππ cannot be finished by time π on anynumber of processors.
We often hope to finish all tasks as soon as possible. Sometimes, a task ππ also has a value π£ π thatcan be obtained if it is finished by a deadline π ; then we hope to finish by time π the most valuabletasks. We will propose algorithms that generate schedules for different objectives: (i) minimize themakespan, i.e., the maximum completion time of all tasks of T or (ii) choose a subset of tasks andfinish them on the π processors by a deadline π to maximize the throughput, i.e., the aggregate
1In the earlier literature, moldable tasks was also called malleable tasks. Now, malleable tasks refer to another type of paralleltasks [7].
Authorsβ addresses: Xiaohu Wu, Nanyang Technological University, Singapore, [email protected]; Patrick Loiseau,Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France, [email protected].
arX
iv:1
609.
0858
8v11
[cs
.DS]
20
Nov
202
1
2 Wu and Loiseau
value of tasks finished by time π . For each task to be executed, a schedule will define the numberof processors assigned to it and the time interval in which it is finished. For our minimization (resp.maximization) problem, an algorithm is a π-approximation if it produces a schedule whose makespanis at most π times the optimal makespan where π β₯ 1 (resp. whose throughput is at least π times theoptimal throughput where π β€ 1). It is always desired to have performance bound π closer to one,while keeping algorithms simple to run efficiently.
1.2 Typical Speedup Models, and MotivationFor moldable tasks, a key aspect that conditions scheduling is the relation between the task
execution time and the number of assigned processors. Now, we introduce several typical speedupmodels in literature and the most related works under each model, as well as the main motivation fordeveloping the ideas of this paper. Like ours, these works consider offline scheduling of independentmoldable tasks for makespan minimization, except [6, 10, 12, 26, 28].Linear-Speedup Model. An ideal speedup model is linear, i.e., [ π,π = π, when π does not exceeda threshold πΏ π [7]; the workload of π π is independent of π since π· π,π =
π‘ π,1[ π,π
π = π‘ π,1. This modelapplies to embarrassingly parallel tasks [4, 29]: each task consists of many independent subtasksthat can simultaneously be executed on multiple processors without any communication overhead.Benoit et al. propose a list-based scheduling algorithm, called LPA-LIST, for failure-prone platformswith additional constraints and show that it is a 2-approximation [3]. When there are precedenceconstraints among moldable tasks, Wang et al. propose a
(3 β 2
π
)-approximation algorithm [26].
Besides, the case of scheduling independent malleable tasks has well been studied [6, 12, 28].Communication Models. When communication is needed among the subtasks of a task ππ , it istypical of parallel applications that the speedup is still linear when π is within a relatively small πΏ π ;afterwards, the speedup is sublinear (i.e., [ π,π < π) and assigning more than πΏ π processors to executeππ becomes less efficient due to the communication overheads [6]. Such tasks can be simplifiedand still treated as tasks with linear speedup by limiting that the maximum number of processorsassigned toππ is πΏ π , although it may be worth exploring the opportunity of assigning more processorsto get better resource efficiency. Another typical speedup function is as follows: π‘ π,π =
π‘ π,1π+ (π β 1)π π ;
the term (π β 1)π π implies that, as more processors are assigned, the overhead and workload π· π,π
increase; if π is too large, π‘ π,π will not decrease as π increases. Then, Havill et al. propose an onlinealgorithm with an approximation ratio 4(πβ1)
πfor even π β₯ 2 and 4π
π+1 for odd π β₯ 3 [10]; Benoit etal. also show that LPA-LIST is a 3-approximation for failure-prone platforms [3]. As elaborated later,these speedup modes are also validated empirically and consistent.Monotonic Model. To date, the best algorithms for our problem are designed by simply using ageneral monotonic assumption: the execution time of ππ is non-increasing and its workload is non-decreasing in π, where [ π,π β€ π. Belkhale et al. give a 2
1+1/π -approximation algorithm [2]. MouniΓ©
et al. propose a (β3+ π)-approximation algorithm and then a ( 32 + π)-approximation algorithm with
a complexity O(ππ log ππ) where π is arbitrarily small [22, 23]. Jansen et al. achieve an improved
complexity polynomial in logπ and 1π
and linear in π, although their algorithm is still a ( 32 + π)-approximation [14]. In the special case where π β₯ 8π
π, they also give a fully polynomial time
approximation scheme (FPTAS) with a complexity O(π log2π(logπ + log 1π)). The FPTAS requires
a specific relation between π andπ.In the case where π is independent ofπ, we hope to develop a π-approximation algorithm with
π < 32 and will revisit the related speedup models. Under the monotonic assumption, we have the
following bounds of the execution time π‘ π,πΎ ( π,π) when a task ππ is assigned πΎ ( π, π) processors, which
Efficient Approximation Algorithms for Scheduling Moldable Tasks 3
will also hold in this paper:
π β₯ π‘ π,πΎ ( π,π) >πΎ ( π, π) β 1πΎ ( π, π) π. (1)
By the definition of πΎ ( π, π), π· π,πΎ ( π,π) is the minimum workload needed to complete ππ by timeπ. Suppose that an algorithm produces a schedule of a makespan π. We observe that it is a 1
\-
approximation to makespan minimization if every task ππ β T has the minimum workload and theaggregate workload processed on theπ processors in [0, π] is β₯ \ππ where \ is a lower bound ofthe processor utilization. Our objective is to make \ large (e.g., \ > 2
3 ). For each task ππ with largeπΎ ( π, π) (e.g., πΎ ( π, π) β₯ 4), we have by Inequality (1) that executing it on πΎ ( π, π) processors alone canmake these processors achieve a high utilization in [0, π]. One main challenge comes from tasks withsmaller πΎ ( π, π). Then, a more precise speedup description than monotonicity could help, which isfortunately available in literature; it allows quantitatively characterizing the execution time reductionwhile keeping the workload constant, when the number of processors assigned to ππ changes fromπΎ ( π, π) to a larger value. We can thus obtain some desired properties and design a schedule underwhich the π processors achieve a high overall utilization in [0, π] under some additional constraints(see Section 3).
In fact, the speedup function in the communication model above is validated on widely used NASparallel benchmarks and HPLinpack, which contain multiple computations with typical communi-cation patterns for evaluating the performance of parallel systems [18]. The benchmarking resultsof [8] indicate that π π is far smaller than π‘ π,1 such that when π is small, the effect of (π β 1)π π onπ‘ π,π is negligible compared with the term π‘ π,1
π, and the speedup coincides quite accurately with the
linear-speedup model. Thus, we associate such a task ππ with two thresholds πΏ π and π π to distinguishthe speedup modes of ππ when π is in different ranges where πΏ π β€ π π ; then, we make the followingdefinition on which we will base the algorithmic design of this paper.
DEFINITION 1. A task ππ β T is said to be (πΏ π , π π )-monotonic if it is moldable and satisfies(1) when π β [1, πΏ π ], its workload remains constant and the speedup is linear, i.e., π· π,1 = π· π,π =
π‘ π,ππ;(2) If πΏ π < π π , its workload is increasing while its execution time is decreasing in π β [πΏ π , π π ], i.e.,
π· π,π < π· π,π+1 and π‘ π,π > π‘ π,π+1 for π β [πΏ π , π π β 1].(3) The parameter π π is a parallelism bound, i.e., the maximum number of processors allowed to
be assigned to ππ .
Particularly, if π π = πΏ π , then a (πΏ π , π π )-monotonic task is simply a task whose speedup is linearwithin a parallelism bound πΏ π . In Definition 1, the second point implies that assigning more than πΏ πfor executing ππ is less efficient but its execution time is decreasing in π β [πΏ π , π π ]. The third point isused to reflect that when π > π π , the workload begins to increase to an unacceptable extent such thatthe execution time does not decrease any more (i.e., [ π,π π
β₯ [ π,π ), which implies that assigning morethan π π processors to ππ cannot bring any benefit. The parameters πΏ π and π π are fixed; as reported in[8], πΏ π and π π can range in [25, 150] and [250, 512], depending on the types of computation embodiedin the tasks of T . The numberπ of processors is large since our problem arises in large-scale parallelsystems, e.g., supercomputers can haveπ = 216 processors inside [1, 27].
1.3 Algorithmic ResultsConsider a set T in which each task ππ is (πΏ π , π π )-monotonic. We denote by πΏ the minimum
linear-speedup threshold of all tasks of T and by π the maximum parallelism bound of all tasks of T ,i.e.,
πΏ = minππ βT{πΏ π } and π = maxππ βT{π π }. (2)
4 Wu and Loiseau
Let π’ =
β2βπΏ
ββ 1, that is, π’ is the unique integer such that πΏ β [π’2 + 1, (π’ + 1)2]. In this paper,
for any πΏ β₯ 5, the main algorithmic result is a 1\ (πΏ) (1 + π)-approximation algorithm for makespan
minimization with a complexity O(π log(π/π) logπ) where
\ (πΏ) = π’ + 1π’ + 2
(1 β π
π
).
The algorithm achieves an approximation ratio \ (πΏ) close to π’+2π’+1 since π β« π. Typically, the
minimum linear-speedup threshold πΏ has an effective range of [25, 150] [8]. In the worst case thatπΏ = 25, \ (πΏ) is close to 6
5 ; when πΏ = 150, π’+2π’+1 = 14
13 β 1.077, which is close to 1. The larger thelinear-speedup threshold πΏ , the better the proposed algorithm. For throughput maximization with agiven deadline π , we assume that every task ππ β T can be finished by time π , i.e., πΎ ( π, π) β [1,π].As a by-product, another algorithmic result of this paper is a \ (πΏ)-approximation algorithm with acomplexity O(π2 logπ) to maximize the throughput with a deadline π . To the best of our knowledge,we are the first to address this scheduling objective for moldable tasks, while this objective has beenaddressed for other types of parallel tasks in the literature of scheduling theory [9, 17].
The rest of this paper is organized as follows. In Section 2, we give more related work. In Section 3,we give an overview of the ideas developed in this paper. The following two sections are used toelaborate these ideas. In particular, in Section 4, we propose an algorithm that produces a schedulewith several features described in Section 3. In Section 5, we show the application of this schedule tothe objectives of makespan minimization and throughput maximization with a deadline respectively.Finally, we conclude this paper in Section 6.
2 RELATED WORK2.1 Makespan Minimization
The problem of scheduling moldable tasks to minimize the makespan is strongly NP-hard whenπ β₯ 5 [7]. There is a long history of study with continuous improvements to the approximation ratio ortime complexity. Turek et al. consider moldable tasks without monotonicity and propose a two-phasesapproach [25]: (i) determine the number of processors assigned to each task and (ii) solve the resultingstrip packing problem; the latter has well been studied, e.g., we can directly use the 2-approximationalgorithm of Steinberg [24]. Further, the authors show that any _-approximation algorithm of acomplexity O(π (π,π)) for strip packing can be transformed into a _-approximation algorithm ofa complexity O(πππ (π,π)) for our problem. In the special case of monotonic tasks, Ludwig et al.improve the transformation complexity to O(π log2π + π (π,π)) [21]. Jansen et al. formulate theoriginal problem as a linear program [15]. They propose a polynomial time approximation scheme(PTAS) when the number π of processors is constant; here, the complexity is exponential in π.Further, Jansen et al. propose a PTAS when π is polynomially bounded in the number π of tasks[16]. In the case of an arbitrary number of processors, Jansen et al. also propose a polynomial time( 32 + π)-approximation algorithm for any fixed π [13]. In the special case of π identical tasks, Deckeret al. give a 5
4 -approximation algorithm [5].As introduced in Section 1, of the great relevance to our work are [14, 22, 23] that use similar
techniques for monotonic tasks. For example, in [23], MouniΓ© et al. apply the dual approximationtechnique [11]: it takes a real number π as an input, and either outputs a schedule of a makespanβ€ 3
2π or answers correctly that π is a lower bound of the optimal makespan. To realize this, tasksare mainly classified into two subsets T1 and T2 whose tasks are respectively assigned πΎ ( π, π) andπΎ ( π, π2 ) processors; the classification aims at minimizing the total workloadπ of T1 and T2 whileguaranteeing that the total number of processors assigned to T1 is β€ π, which is formulated as aknapsack problem. If the optimalπ exceeds the processing capacity of theπ processors, there exists
Efficient Approximation Algorithms for Scheduling Moldable Tasks 5
no schedule with a makespan < π. Otherwise, the total number of processors assigned to T2 mayexceedπ and a series of reductions to the numbers of processors assigned to the tasks of T1 and T2 istaken to get a feasible schedule: the tasks are assigned to different parts of processors respectively inthe time intervals [0, 32π], [0, π] and [π, 32π].
2.2 Throughput MaximizationSeveral works have considered scheduling other types of parallel tasks to maximize the throughput.Jansen et al. and Fishkin et al. consider scheduling rigid tasks with a common deadline [9, 17],
e.g., they apply the theory of knapsack problem and linear programming to propose an ( 12 + π)-approximation algorithm. For malleable tasks with individual deadlines whose speedup is linear,a parameter π is used to characterize the tasksβ delay-tolerance: each ππ has to be finished in atime window [π π , π π ]; it has the minimum execution time πππ π when assigned πΏ π processors; π isthe minimum ratio of π π β π π to πππ π among all tasks. For offline scheduling, Jain et al. proposea greedy πβπΏ
ππ β1π
-approximation algorithm where πΏ is the maximum πΏ π of all tasks [12]. Wu et al.prove that the best approximation ratio that this type of greedy algorithms can achieve is π β1
π and
propose such an algorithm; they also show a sufficient and necessary condition under which a setof malleable tasks with deadlines can feasibly be scheduled on a fixed number of processors andpropose an exact algorithm by dynamic programming [28]. For online scheduling, Lucier et al.propose a 2+O
(1/( 3βπ β 1)2
)-approximation algorithm [20]. In cluster scheduling, many applications
are delay-tolerant where π β« 1 andπ β« πΏ . Thus, their algorithms achieve good approximation ratiosin practical settings.
3 OVERVIEW OF THE APPROACHESCentral to our algorithm design is an algorithm ππβππ that aims to schedule a set T of tasks on the
π processors in a time interval [0, π] and achieves a processor utilization β₯ \ (πΏ) on the conditionsthat (i) each scheduled task ππ has a workload π· π,πΎ ( π,π) , which is the minimum workload to finishππ by time π, and (ii) there exists some task rejected to be scheduled due to the insufficiency ofidle processors (see Section 4). We establish the connection of ππβππ with our two problems in thefollowing ways.
For makespan minimization, we need to schedule all tasks, while ππβππ can play a role only whena part of tasks are scheduled. We apply a binary search procedure to find two parameters π and πΏ
such that ππβππ can schedule all tasks by time π but only a part of tasks by time πΏ, with the relationπ β€ πΏ(1 + π) (see Section 5.1). Let πβ denote the optimal makespan. We can establish the relationbetween π and πβ via πΏ and prove π /πβ β€ 1
\ (πΏ) (1 + π), thus showing that the resulting algorithm isa 1
\ (πΏ) (1 + π)-approximation. Specifically, in the case that πβ β [πΏ,π ], we have π /πβ β€ (1 + π)/\trivially. In the case that πβ < πΏ, we have that the total workload of all tasks of T in an optimalschedule is β€ ππβ but β₯ the total workload processed when ππβππ manages to schedule a part oftasks of T by time πΏ. Thus, we haveππβ β₯ π\πΏ β₯ π\π /(1 + π) and π β€ πβ (1 + π)/\ .
For throughput maximization, π£ π/π· π,πΎ ( π,π) is the maximum possible value obtained from processinga unit of workload of ππ , called its value density. Let us accept the maximum number of tasks in thenon-increasing order of their value densities until ππβππ cannot produce a feasible schedule by timeπ ; then, the feature of ππβππ leads to that the utilization \ (πΏ) will be the approximation ratio of theresulting algorithm (see Section 5.2).
Finally, the design of ππβππ relies on the properties of the speedup model in Definition 1 to classifythe tasks of T . The minimum linear-speedup threshold πΏ in Equation (2) is a fixed parameter and wehave the following property by Definition 1.
6 Wu and Loiseau
PROPERTY 3.1. If a task ππ is (πΏ π , π π )-monotonic, we have that (i) the workload π· π,π is non-decreasing and the execution time π‘ π,π is non-increasing in the number π of assigned processors whenπ β [1, π π ] and (ii) the speedup is linear when π β [1, πΏ], i.e., π‘ π,π =
π‘ π,1π
.
The classification process mainly uses three integer variables a , π» and πΏ β². a and π» are fordistinguishing tasks with different πΎ ( π, π) where a < π» : a task ππ is said to have a large, medium orsmall πΎ ( π, π) if πΎ ( π, π) is β₯ π» , in [a, π» β 1] or β€ a β 1. In ππβππ, each task ππ will be executed oneither πΎ ( π, π) or πΏ β² processors. Let π = π»β1
π». We will use ππ and (1 β π )π to distinguish tasks with
different execution times. The first class of tasks, denoted by A β², includes every task that has a largeexecution time β₯ ππ when assigned a group of πΎ ( π, π) processors (see Equation (5)), e.g., every taskwith large πΎ ( π, π) has such a feature by Inequality (1).
For the remaining tasks with medium or small πΎ ( π, π), we will maintain several relations among a ,π» , πΏ β² and πΏ . For example, by letting π» β 1 β€ πΏ β² β€ πΏ , the speedup is linear and the workload keepsconstant when the number π of assigned processors ranges in [πΎ ( π, π), πΏ β²]. These relations finallyenable the following properties:
β’ For the tasks with small πΎ ( π, π) whose execution times are < ππ when assigned πΎ ( π, π) proces-sors, they are denoted by Baβ1 and their execution times will decrease remarkably (by a factorat least πΏβ²
aβ1 ) to a small value < (1 β π )π when assigned πΏ β² processors (see Equation (6) andLemma 4.2). Executing as many such tasks as possible on a group of πΏ β² processors in [0, π]will lead to an aggregate execution time β₯ ππ.β’ Let β be an integer in [a, π» β 1]. For the tasks with πΎ ( π, π) = β whose execution times areβ₯ (1βπ )π when assigned πΏ β² processors and < ππ when assigned β processors, they are denotedby Aβ and there exists a positive integer π₯β such that the aggregate execution time is in [ππ, π]when π₯β such tasks are sequentially executed on a group of πΏ β² processors (see Equation (10)and Proposition 4.2).
Finally, each group of πΎ ( π, π) or πΏ β² assigned processors described above can achieve a utilization β₯ π
in [0, π]. The overall utilization \ of the π processors can be derived when some task is rejecteddue to the insufficiency of processors, with at most π β 1 processors idle. The task classification andmaintained relations are formally described in Section 4.1, with other related issues solved. Thescheduling algorithm ππβππ is given in Section 4.2.
4 THE ALGORITHM ππβππ
In this section, we consider the case that every task ππ β T can be finished by time π (i.e.,πΎ ( π, π) β [1,π]).
LEMMA 4.1. For every (πΏ π , π π )-monotonic task ππ β T , Inequality (1) holds.
PROOF. We have π‘ π,πΎ ( π,π) β€ π and π‘ π,πΎ ( π,π)β1 > π by the definition of πΎ ( π, π). By Property 3.1,π· π,πΎ ( π,π) β₯ π· π,πΎ ( π,π)β1. Further, we have πΎ ( π, π)π‘ π,πΎ ( π,π) β₯ (πΎ ( π, π) β 1)π‘ π,πΎ ( π,π)β1 > (πΎ ( π, π) β 1)π.Hence, Inequality (1) holds. β‘
4.1 Task ClassificationAll tasks of T are divided into different classesA β²,A β²β²,Aπ»β1, Β· Β· Β· ,Aπ£ . Following the high-level
ideas in Section 3, we now begin to elaborate the task classification. For ease of reference, we firstsummarize the maintained relations between the integer variables π» , a , πΏ β², π₯a , Β· Β· Β· , π₯π»β1 and the
Efficient Approximation Algorithms for Scheduling Moldable Tasks 7
Fig. 1. Task Classification.
fixed parameter πΏ where π = π»β1π»
:
1 β€ a β€ π» β 1 β€ πΏ β² β€ πΏ (3a)πa
πΏ β²β₯ 1 β π (3b)
π (a β 1)πΏ β²
< 1 β π (3c)
and for all β β [a, π» β 1]
πβ
πΏ β²π₯β β€ 1, (4a)
max{1 β π, β β 1
πΏ β²
}π₯β β₯ π . (4b)
As we classify tasks and prove their properties, we can gradually perceive the underlying reasonswhy these relations are established to get the desired properties. At the end of this subsection, wewill give a feasible solution of π» , a , πΏ β², π₯a , Β· Β· Β· , π₯π»β1 that satisfy the relations (3a)-(4b).
Figure 1 summarizes how to classify a task ππ β T according to its value of πΎ ( π, π) and itsexecution time on πΎ ( π, π) or πΏ β² processors. Specifically, the first class of tasks contains all tasks whoseexecution times π‘ π,πΎ ( π,π) are β₯ ππ when assigned πΎ ( π, π) processors and is defined as
Aβ² = {ππ β T |πΎ ( π, π) β₯ π» }
βͺ{ππ β T |πΎ ( π, π) β [1, π» β 1], π‘ π,πΎ ( π,π) β₯ ππ
} (5)
A β² also includes a part of tasks with smaller πΎ ( π, π) but they have π‘ π,πΎ ( π,π) β₯ ππ. Except A β², theremaining tasks have medium or small πΎ ( π, π) and each has an execution time π‘ π,πΎ ( π,π) < ππ . Among
8 Wu and Loiseau
these tasks, let Baβ1 denote all tasks with πΎ ( π, π) β€ a β 1, i.e.,
Baβ1 = {ππ β T | πΎ ( π, π) β€ a β 1, π‘ π,πΎ ( π,π) < ππ}; (6)
let Bπ»β1 denote all tasks that satisfy πΎ ( π, π) β [a, π» β 1] and π‘ π,πΏβ² < (1 β π )π , i.e.,
Bπ»β1 = {ππ β T |πΎ ( π, π) β [a, π» β 1], π‘ π,πΏβ² < (1 β π )π, π‘ π,πΎ ( π,π) < ππ}. (7)
The second class of tasks is defined as
Aβ²β² = Baβ1 βͺ Bπ»β1 . (8)
For each task ππ with πΎ ( π, π) β€ π» β 1, the relation (3a) ensures by Property 3.1 that the speedup islinear when the number of processors assigned to ππ changes from πΎ ( π, π) to πΏ β²; when assigned πΏ β²
processors, its execution time π‘ π,πΏβ² satisfies
π‘ π,πΏβ² = π‘ π,πΎ ( π,π)πΎ ( π, π)πΏ β²
. (9)
LEMMA 4.2. For each task ππ β Baβ1, we have π‘ π,πΏβ² < (1 β π )π .
PROOF. The execution time of ππ satisfies
π‘ π,πΏβ²(π)= π‘ π,πΎ ( π,π)
πΎ ( π, π)πΏ β²
(π)<
a β 1πΏ β²
ππ(π)< (1 β π )π
where the above (a), (b) and (c) are due to Equation (9), Equation (6) and the relation (3c) respectively.β‘
PROPOSITION 4.1. For every task ππ β A β²β², we have π‘ π,πΏβ² < (1 β π )π .
PROOF. It follows from Lemma 4.2 and the definition of Bπ»β1 in Equation (7). β‘
Finally, the remaining are all tasks with πΎ ( π, π) β [a, π» β 1] and each has an execution timeπ‘ π,πΎ ( π,π) < ππ when assigned πΎ ( π, π) processors and π‘ π,πΏβ² β₯ (1 β π )π when assigned πΏ β² processors. Foreach β β [a, π» β 1], a single class of tasks Aβ is defined to contain all such tasks with πΎ ( π, π) = β,i.e.,
Aπ = {ππ β T |πΎ ( π, π) = β, π‘ π,β < ππ, π‘ π,πΏβ² β₯ (1 β π )π}. (10)
PROPOSITION 4.2. When assigned πΏ β² processors, we have that(i) for every task ππ β Aβ , its execution time π‘ π,πΏβ² is < πβπ where πβ = β
πΏβ² π ;(ii) the aggregate sequential execution time of any π₯β tasks of Aβ is in [ππ, π].PROOF. The relation (3b) implies that a is the maximum possible integer such that the relation
(3c) can hold. Let us consider every task ππ β Aβ and by the definition of Aβ in Equation (10), wehave
π‘ π,πΎ ( π,π) < ππ (11)
π‘ π,πΏβ² β₯ (1 β π )π. (12)
where πΎ ( π, π) = β. We have by Lemma 4.1 that the execution time of this task ππ satisfies
π‘ π,πΎ ( π,π) >β β 1β
π. (13)
Thus, by Equation (9), we have
π‘ π,πΏβ² = π‘ π,ββ
πΏ β²(π)<
β
πΏ β²ππ (14)
π‘ π,πΏβ² = π‘ π,ββ
πΏ β²(π)>(β β 1)π
β
β
πΏ β²=β β 1πΏ β²
π (15)
Efficient Approximation Algorithms for Scheduling Moldable Tasks 9
where the above (d) is due to Inequality (11), and (e) is due to Inequality (13). By Inequalities (12),(14) and (15), we have for any task ππ β Aβ that
π‘ π,πΏβ² β[max
{1 β π, β β 1
πΏ β²
}π,
β
πΏ β²ππ
].
While sequentially executing any π₯β tasks of Aβ on πΏ β² processors, the relations (4a) and (4b) ensurethat their aggregate execution time is in [ππ, π]. Together with Inequality (14), Proposition 4.2 thusholds. β‘
Proposition 4.1 and 4.2 enable us to design good schedules. Executing as many tasks from A β²β² aspossible on πΏ β² processors by time π can lead to that these processors have a utilization β₯ π in [0, π].This also holds for the tasks of Aβ where β β [a, π» β 1] since at least π₯β tasks can be finished bytime π .
PROPOSITION 4.3. For a given linear-speedup threshold πΏ β₯ 5, let π’ =
β2βπΏ
ββ 1 where π’ β₯ 2 and
πΏ β [π’2 + 1, (π’ + 1)2]. A feasible solution that satisfies Inequalities (3a)-(4b) is as follows:π» = π’ + 2πΏ β² = π’2 + 1a = π’
π₯β = 2π’ + 1 β β for all β β {a, π» β 1}
(16)
where π = π’+1π’+2 .
PROOF. We can easily verify that the setting in Equation (16) satisfies the relation (3a). We have
πa
πΏ β²=π’ + 1π’ + 2
π’
π’2 + 1(π)β₯ 1
π’ + 2 = 1 β π
where the above (a) is due to π’ (π’ + 1) β₯ π’2 + 1; thus, the relation (3b) is satisfied. We haveπ (a β 1)
πΏ β²=π’ + 1π’ + 2
π’ β 1π’2 + 1 <
1π’ + 2 = 1 β π ;
thus, the relation (3c) is satisfied.We have β β [a, π» β 1] = [π’,π’ + 1] by Equation (16). In the following, we first prove that the
relations (4a) and (4b) hold when β = π’. We have
ππ’
πΏ β²π₯π’ =
π’ + 1π’ + 2
π’
π’2 + 1 (π’ + 1)(π)β€ 1 (17)
where (b) is due to that (π’ + 1)2π’ β (π’ + 2) (π’2 + 1) = β2 < 0. Thus, the relation (4a) holds whenβ = π’. We have
max{1 β π, π’ β 1
πΏ β²
}= max
{1
π’ + 2 ,π’ β 1π’2 + 1
}(π)=
{1
π’+2 if 2 β€ π’ β€ 3π’β1π’2+1 if π’ β₯ 4
(18)
where (c) is due to that (π’2 + 1) β (π’ + 2) (π’ β 1) = 3 β π’. Further, if 2 β€ π’ β€ 3, we have1
π’ + 2π₯π’ =π’ + 1π’ + 2 β€ 1. (19)
If π’ β₯ 4, we can easily verify thatπ’ β 1π’2 + 1π₯π’ =
(π’ β 1) (π’ + 1)π’2 + 1 β€ 1. (20)
By Inequalities (18), (19) and (20), the relation (4b) holds when β = π’.
10 Wu and Loiseau
Next, we prove that the relations (4a) and (4b) hold when β = π’ + 1.
ππ’ + 1πΏ β²
π₯π’+1 =π’ + 1π’ + 2
π’ + 1π’2 + 1π’
(π)β€ 1 (21)
where (d) is again due to that (π’ + 1)2π’ β (π’ + 2) (π’2 + 1) = β2 < 0. Thus, the relation (4a) holdswhen β = π’ + 1. We have
max{1 β π, π’
πΏ β²
}= max
{1
π’ + 2 ,π’
π’2 + 1
}(π)=
π’
π’2 + 1
where (e) is due to that (π’ + 2)π’ β (π’2 + 1) = 2π’ β 1 > 0. Further, we can easily verify that
π’
π’2 + 1π₯π’+1 =π’2
π’2 + 1 β€ 1.
Thus, the relation (4b) holds when β = π’ + 1. β‘
In the rest of this paper, we will set the parameter values in the way described in Proposition 4.3.Since a = π’ and π» = π’ + 2, the tasks of T are finally classified as A β²,Aπ’+1,Aπ’,A β²β². Finally, weshow the time complexity while classifying the tasks of T .
PROPOSITION 4.4. The time complexity of task classification is O(π logπ).
PROOF. As illustrated in Figure 1, each task ππ β T is classified according to the values of πΎ ( π, π),π‘ π,πΎ ( π,π) and π‘ π,πΏβ² . Given the speedup [ π,π , πΎ ( π, π) can be found in time O(logπ) by binary search;π‘ π,πΎ ( π,π) and π‘ π,πΏβ² can be found in time O(1) [14]. Thus, the time complexity of classifying the π tasksof T is O(π logπ). β‘
4.2 Algorithm DescriptionNow, we give the scheduling algorithm ππβππ, presented as Algorithm 1, to assign the tasks ofT onto processors. Initially, all processors are idle. T is partitioned into A β², Aπ’+1, Aπ’ , A β²β², andthese sets are also sorted and assigned in this order where the tasks in the same set are chosen in anarbitrary order. Following this order, ππβππ assigns tasks in the following way until all tasks of T areassigned or there are not enough idle processors:
(i) For each unassigned taskππ β A β², assign it onto πΎ ( π, π) idle processors. Then, these processorsbecome occupied (lines 4-6).
(ii) For the remaining idle processors, put every πΏ β² processors into an individual group. For eachgroup, get as many unassigned tasks ofAπ’+1 βͺAπ’ βͺA β²β² as possible such that their sequentialexecution time on πΏ β² processors is β€ π; assign these tasks onto the group of processors (lines9-24).
Algorithm 1 ends with two conditions: (i) the idle processors are not enough (πβ² < π in line 7 orπβ² < πΏ β² in line 10), or (ii) all tasks of T have been assigned.
4.2.1 Example. We give a toy example where πΏ = 5 to illustrate ππβππ. By Proposition 4.3, wehave π’ = a = 2, π» = 4, πΏ β² = 5, π₯3 = 2, π₯2 = 3, and π = 3
4 ; then, T is divided into 4 subsets A β², A3,A2, and A β²β². Suppose that we are given π = 33, A β² = {π1}, A3 = {π2,π3,π4}, A2 = {π5, Β· Β· Β· ,π9},A β²β² = {π10, Β· Β· Β· ,π18} and πΎ (1, π) = 4 for π1. From the π processors, we get six groups. The firstgroup has πΎ (1, π) processors, each of the remaining groups has πΏ β² processors, and there are also 4ungrouped processors. As illustrated in Figure 2, ππβππ assigns tasks in the following way:
(1) The only task π1 of A β² is assigned onto the 1st group.(2) π₯3 = 2 tasks of A3 are assigned onto the 2nd group.
Efficient Approximation Algorithms for Scheduling Moldable Tasks 11
Algorithm 1: ππβππ
1 Set the parameters in the way of Proposition 4.3, and classify the tasks of T into A β², Aπ’+1,Aπ’ and A β²β², as illustrated in Figure 1;
2 Xβ²β A β², Xπ’+1 β Aπ’+1, Xπ’ β Aπ’ , Xπ’β1 β A β²β²; // π β², Xπ’+1, Xπ’ , Xπ’β1 are used to
record the currently unassigned tasks
3 πβ²βπ; // record the number of currently idle processors
/* First, assign the tasks of Xβ² */4 while Xβ² β β and π β€ πβ² do5 Get an arbitrary task ππ off Xβ²: Xβ²β Xβ² β {ππ };6 Assign ππ onto πΎ ( π, π) idle processors:πβ²βπβ² β πΎ ( π, π);7 if Xβ² β β andπβ² < π then8 exit; // exit when πβ² is smaller than the parallelism bound π and there are
still unassigned tasks in Aβ²
/* Second, assign the tasks of Xπ’+1, Xπ’ and Xπ’β1; initially, Xπ’+1 may be empty
or non-empty */
9 Let π be the maximum integer in {π’ + 1, π’,π’ β 1} such that Xπ β β ;10 while Xπ’+1 βͺ Xπ’ βͺ Xπ’β1 β β and πΏ β² β€ πβ² do11 Get πΏ β² idle processors:πβ²βπβ² β πΏ β²;12 TπΏβ² β β , π‘ β 0; // TπΏβ²: the tasks currently chosen to be executed on the πΏβ²
processors by time π; π‘: the aggregate execution time of TπΏβ²13 while Xπ β β do14 Get an arbitrary task ππ from Xπ ;15 if π‘ + π‘ π,πΏβ² β€ π then16 π‘ β π‘ + π‘ π,πΏβ² , TπΏβ² β TπΏβ² βͺ {ππ }, Xπ β Xπ β {ππ };17 else18 break; // go to line 24
19 if Xπ = β and π > π’ β 1 then20 if there exists an integer π β {π’ + 1, π’,π’ β 1} such that Xπ β β then21 Let π be the maximum such integer; // go to line 13
22 else23 π β π’ β 1; // Xπ’β1 = β
24 Sequentially execute the tasks of TπΏβ² on the πΏ β² idle processors;
(3) One remaining task of A3 and one task of A2 are assigned onto the 3rd group. The secondtask of A2 cannot be added and completed by time π and will be rejected to be executed onthis group.
(4) π₯2 = 3 tasks of A2 are assigned onto the 4th group.(5) One remaining task of A2 and three tasks of A β²β² are assigned onto the 5th group.(6) Five tasks of A β²β² are assigned onto the 6th group.
By the definition of A β² in Equation (5) and Propositions 4.1 and 4.2, the 1st-2nd and 4th-6thgroups have an execution time in [ππ, π]. The 3rd group of πΏ β² processors executes a mix of the tasksof A3 and A2; the aggregate execution time of tasks is < ππ but β₯ (1 β π2)π since the rejected task
12 Wu and Loiseau
Fig. 2. Task assignment when πΏ = 5 where each colored rectangle represents a task of some type.
of A2 has an execution time β€ π2π = 3π/10 by Proposition 4.2. Finally, there is one unassigned taskof A β²β² and the number of idle processors is at most πΏ β² β 1. The total number of processors whoseexecution time is < ππ is πΏ β² + (πΏ β² β 1) = 2πΏ β² β 1. The total workload processed by theπ processorsin [0, π] is at least
π€ β² = (π β 2πΏ β² + 1)ππ + πΏ β²(1 β π2)π,and the overall processor utilization in [0, π] is at least
π€ β²
ππ= π β π (2πΏ β² β 1)
π+ (1 β π2)πΏ
β²
π=34β 3.25
π.
4.2.2 Algorithm Analysis. Now, we prove the features of ππβππ . Recall that the parameters π» , a ,πΏ β², π₯a , Β· Β· Β· , π₯π»β1 are set in the way described in Proposition 4.3.
PROPOSITION 4.5. Suppose ππβππ cannot schedule all tasks of T on the π processors by time π .We have that ππβππ achieves a processor utilization of at least
\ (πΏ) = π β ππ
π.
where π = π’+1π’+2 β (0, 1).
PROOF. After executing ππβππ , theπ processors may be divided into three parts:(i) the first part executes the tasks ofA β² (lines 4-6), e.g, the 1st group of processors in the example
above,(ii) the second part executes the tasks of Aπ’+1, Aπ’ , A β²β² (lines 9-24), e.g., the 2nd-6th groups in
the example,(iii) the third part is idle and not assigned any task.
ππβππ ends with two cases: (i) πβ² < π (lines 7-8), or (ii) πβ² < πΏ β² (line 10). Different parts exist ineach case. Our analysis proceeds by showing (a) which parts of processors exist in each case and (b)the utilization of each part. The worst-case utilization will be a lower bound of the ratio of the totalworkload processed by different parts toππ .
Efficient Approximation Algorithms for Scheduling Moldable Tasks 13
First, we analyze the utilization of the three parts. We have that (i) the first part of processors has autilization β₯ π in [0, π] by the definition of A β², and (ii) the utilization of the third part is zero. Thesecond part can be divided into several groups, each with πΏ β² processors. Let Aπ’β1 = A β²β² for ease ofexposition. By lines 9-24, each group of the second part executes
(1) the tasks purely from a single set Aβ where β β [π’ β 1, π’ + 1], or(2) a mix of the tasks of multiple sets Aβ , Aββ1, Β· Β· Β· , Aββ² where π’ + 1 β₯ β > ββ² β₯ π’ β 1 and
ββ² β {π’,π’ β 1}.In the former, we have that each group has an execution time β₯ ππ by Propositions 4.1 and 4.2. Inthe latter, there exists a task ππ of Aββ² that cannot be completed by time π:
(2.a) if ββ² = π’, the group may have an execution time < ππ but β₯ (1 β ππ’)π since π‘ π,πΏβ² < ππ’π
by Proposition 4.2 (see the third group in the example); by Proposition 4.3, the processedworkload is at least
π€ = πΏ β²(1 β ππ’)π = (π’2 + 1)(1 β π’
π’2 + 1π’ + 1π’ + 2
)π =
π’3 + π’2 + 2π’ + 2 π. (22)
(2.b) if ββ² = π’ β 1, the group has an execution time β₯ ππ since π‘ π,πΏβ² < (1 β π )π (see the fifth group inthe example).
To sum up, in the second part, there are at most πΏ β² processors whose utilization is < π in [0, π] andon which the amount of processed workload is β₯ π€ .
Next, we analyze which parts of processors exist in each case when ππβππ ends. In the first case,ππβππ ends at lines 7-8) and the first and third parts may exist. The third part has at most π β 1 idleprocessors. Thus, the average utilization of theπ processors is at least
π1 =(π β π + 1)ππ
ππ= π β π π β 1
π.
In the second case, ππβππ ends at line 10. All the three parts of processors may exist and the third parthas at most πΏ β² β 1 processors. For the second part, there are at most πΏ β² processors whose utilization is< π . Thus, the average utilization of theπ processors is at least
π2 =π€ + (π β πΏ β² β (πΏ β² β 1))ππ
ππ
(π)= π β 1
π
( (2π’2 + 1
) π’ + 1π’ + 2 β
π’3 + π’2 + 2π’ + 2
)= π β 1
π
π’3 + π’2 + π’ β 1π’ + 2
(π)= π β 1
π
(πΏ β²π β 2
π’ + 2
)where the above (a) and (b) are due to Equation (22) and Proposition 4.3. Finally, when ππβππ ends,a lower bound of the processor utilization is min{π1, π2}, i.e.,
\ (πΏ) = π βmax{π (π β 1)
π,1π
(πΏ β²π β 2
π’ + 2
)}(π)β₯ π β ππ
π(23)
where (c) is because πΏ β² β€ πΏ β€ π by Inequality (3a). β‘
PROPOSITION 4.6. The time complexity of ππβππ is O(π logπ).
PROOF. The time complexity of task classification is O(π logπ) by Proposition 4.4 (line 1).Afterwards, the π tasks are sequentially assigned to processors one by one (lines 5, 14) and ππβππ
stops until all tasks are assigned or there are not enough processors to assign the remaining tasks,which has a time complexity of O(π). Hence, ππβππ has a time complexity of O(π logπ). β‘
Given a set of tasks T , let S denote the subset of tasks accepted by Algorithm 1.
14 Wu and Loiseau
LEMMA 4.3. In Algorithm 1, we have for every task ππ β S that its workload is π· π,πΎ ( π,π) , which isthe minimum workload needed to be processed to complete ππ by time π .
PROOF. πΎ ( π, π) is the minimum number of processors needed to complete ππ by time π. ByProperty 3.1, π· π,πΎ ( π,π) is the minimum workload needed to be processed to complete ππ by time π . InAlgorithm 1, the number of processors used to simultaneously execute a task is either πΎ ( π, π) for A β²or no more than πΏ for Aπ’+1, Aπ’ , and A β²β². For the latter, by Inequality (3a), we have for each task ππthat πΎ ( π, π) β€ π» β 1 β€ πΏ β² β€ πΏ; by Property 3.1, the workload of ππ keeps constant when the numberof assigned processors varies in [0, πΏ] and we have in Algorithm 1 that the workload of ππ equalsπ· π,πΎ ( π,π) . Thus, the lemma holds. β‘
With Proposition 4.5 and Lemma 4.3, we have completed the design of the scheduling algorithmππβππ described in Section 3.
5 APPLICATION TO TWO OBJECTIVESIn this section, we apply ππβππ to respectively minimize the makespan and maximize the through-
put with a common deadline π .
5.1 Makespan MinimizationBuilt on ππβππ, we propose a binary search procedure to find a feasible schedule of all tasks
of T , called the OMS algorithm (Optimized MakeSpan). At its beginning, let π and πΏ be suchthat ππβππ can produce a feasible schedule of T by time π but fails to do so by time πΏ, e.g.,π = π(πΏ + 2)maxππ βT{π‘ π,1} and πΏ = 0; we explain the reason why such π is feasible in A. TheOMS algorithm will repeatedly execute the following operations, and stop when π β€ (1 + π)πΏ whereπ β (0, 1) is arbitrarily small:
(1) π β π +πΏ2 ;
(2) if there exists a task ππ β T that cannot be completed by time π on any of theπ processors(i.e., πΎ ( π, π) = +β) or ππβππ fails to produce a feasible schedule of all tasks of T by time π ,set πΏ β π;
(3) otherwise, set π β π , and ππβππ produces a feasible schedule of T by time π .When the algorithm stops, we have that (i) ππβππ can produce a feasible schedule of T with amakespan β€ π but > πΏ, and, (ii) π β€ (1 + π)πΏ.
In the rest of this subsection, we analyze the approximation ratio and complexity of the algorithm.As shown below, for a task ππ , the larger the value of π , the smaller the value of πΎ ( π, π).
LEMMA 5.1. If π β² < π β²β², then we have πΎ ( π, π β²) β₯ πΎ ( π, π β²β²).
PROOF. We prove this by contradiction. Suppose πΎ ( π, π β²) < πΎ ( π, π β²β²); then we have by Property 3.1that π‘ π,πΎ ( π,πβ²) β₯ π‘ π,πΎ ( π,πβ²β²) . Since π‘ π,πΎ ( π,πβ²) β€ π β² < π β²β², the minimum number of processors needed tocomplete ππ by time π β²β² is no greater than πΎ ( π, π β²), which contradicts the assumption that πΎ ( π, π β²) <πΎ ( π, π β²β²). β‘
Let πβ denote the optimal makespan. In an optimal schedule, let π·βπ denote the workload of a taskππ and π·β denote the total workload of all tasks of T to be processed on theπ processors in [0, πβ]where we have
ππβ β₯ π·β . (24)
When the OMS algorithm ends, if πΎ ( π, πΏ) β [1,π] for every task ππ β T , only a part of tasks arescheduled by ππβππ by time πΏ and we have by Proposition 4.5 that \ (πΏ) is a lower bound of the
Efficient Approximation Algorithms for Scheduling Moldable Tasks 15
processor utilization in [0, πΏ]; we denote by π·πΏπ the workload of a scheduled task ππ and by π·πΏ the
total workload of all the scheduled tasks; here, we have
ππΏ β₯ π·πΏ β₯ \ (πΏ)ππΏ. (25)
LEMMA 5.2. When the OMS algorithm ends, if πβ < πΏ, then we have that (i) π·β β₯ π·πΏ and (ii)πΎ ( π, πΏ) β [1,π] for every task ππ β T .
PROOF. For every ππ β T , if πβ < πΏ, we have πΎ ( π, πΏ) β [1,π] since ππ can be finished by πβ.By Lemma 5.1, if π β² < π β²β², we have πΎ ( π, π β²) β₯ πΎ ( π, π β²β²). Since πβ < πΏ, we have in an optimalschedule that the number of processors assigned to a task ππ is β₯ πΎ ( π, πβ), which is β₯ πΎ ( π, πΏ). ByProperty 3.1, we have π·βπ β₯ π· π,πΎ ( π,πβ) β₯ π· π,πΎ ( π,πΏ) . By Lemma 4.3, we have π· π,πΎ ( π,πΏ) = π·πΏ
π . Finally,we have π·βπ β₯ π·πΏ
π and π·β β₯ π·πΏ . β‘
PROPOSITION 5.1. The OMS algorithm gives a 1\ (πΏ) (1 + π)-approximation to the makespan
minimization problem with a complexity of O(π log(π/π) logπ).
PROOF. We simply denote \ (πΏ) by \ . For the approximation ratio, it suffices to show π /πβ β€(1 + π)/\ where \ β (0, 1]. When the OMS algorithm ends, we have
π β€ (1 + π)πΏ. (26)
Obviously, πβ β€ π . In the case that πβ β [πΏ,π ], we have
π
πββ€ 1 + π β€ 1
\(1 + π).
In the other case that πβ < πΏ, we have by Inequalities (24), (25) and (26) and Lemma 5.2 that
ππβ β₯ π·β β₯ π·πΏ β₯ π\πΏ β₯ π\π
1 + π .
Further, we have π /πβ β€ (1 + π)/\ .Finally, the initial values of π and πΏ are π(πΏ + 2)maxππ βT{π‘ π,1} and 0. The binary search stops
when π β€ πΏ(1 + π) and the number of iterations is O(log(π/π)). At each iteration, ππβππ is run andhas a time complexity O(π logπ) by Proposition 4.6. Finally, the OMS algorithm has a complexityO(π log(π/π) logπ). β‘
5.2 Throughput Maximization with a Common DeadlineLet π£ β²π = π£ π/π· π,πΎ ( π,π) , and it is the maximum possible value obtained from processing a unit
of workload of ππ , referred to as the (maximum) value density of ππ . We assume without loss ofgenerality that
π£ β²1 β₯ π£ β²2 β₯ Β· Β· Β· β₯ π£ β²π .
We propose a greedy algorithm called GreedyAlgo, presented in Algorithm 2: it considers tasks inthe non-increasing order of their value densities π£ β²π and finally finds the maximum π β² such that ππβππcan output a feasible schedule by time π for the first π β² tasks, denoted by Sπβ² , but fails to do so for thefirst π β² + 1 tasks. The throughput of GreedyAlgo is
βπβ²π=1 π£ π .
PROPOSITION 5.2. GreedyAlgo gives a \ (πΏ)-approximation to the throughput maximizationproblem with a common deadline and it has a complexity of O(π2 logπ).
In the rest of this subsection, we give an overview of the proof of Proposition 5.2. By Proposi-tion 4.5, \ (πΏ) is a lower bound of the processor utilization when ππβππ schedules Sπβ² in [0, π]. LetOPT denote the optimal throughput of our problem. The proof of Proposition 5.2 has two parts:
16 Wu and Loiseau
Algorithm 2: GreedyAlgo
1 initialize Sπ = {π1,π2, Β· Β· Β· ,ππ } for all π β [1, π];2 for π β 1 to π do3 if ππβππ produces a feasible schedule of all tasks of Sπ by time π then4 π β²β π;
5 else6 exit;
(i) We give an upper bound of OPT , denoted by OPT , i.e.,
OPT β₯ OPT (27)
where OPT will be specified in (29).(ii) We show that \ (πΏ) is a lower bound of the ratio of the throughput of GreedyAlgo to the upper
bound, i.e., βπβ²π=1 π£ π
OPTβ₯ \ (πΏ). (28)
Then, we have by Inequalities (27) and (28) thatβπβ²π=1 π£ π
OPT β₯βπβ²
π=1 π£ π
OPTβ₯ \ (πΏ).
Thus, the throughputβπβ²
π=1 π£ π of GreedyAlgo is at least \ (πΏ) times the optimal throughput OPT andGreedyAlgo is a \ (πΏ)-approximation algorithm.
For the first part, let us consider a fractional knapsack problem [19] and there is a knapsack of sizeππ and π divisible items. With abuse of notation, each item is still denoted by ππ , with a fixed sizeπ· π,πΎ ( π,π) and a value π£ π . Its optimal solution is packing into the knapsack the first π items, denoted bySβ², with the highest value densities such that their total size equals ππ:
πβ1βοΈπ=1
π· π,πΎ ( π,π) + πΌπ·π,πΎ (π,π) = ππ
where πΌ β (0, 1] and the π-th item may be partially packed. The following lemma completes thedescription of the first part.
LEMMA 5.3. An upper bound of OPT is
OPT =
πβ1βοΈπ=1
π£ π + πΌπ£π , (29)
which is the optimal value of the knapsack problem.
PROOF. GreedyAlgo chooses a subset of tasks Sπβ² = {π1,π2, Β· Β· Β· ,ππβ²} and uses ππβππ to scheduleSπβ² on the π processors in [0, π]. We will show that any solution to the problem of this papercorresponds to a feasible solution to the above knapsack problem, where the same tasks/itemsare chosen and the two solutions have the same total value of tasks/items; the lemma thus holds.Specifically, when a task ππ β Sπβ² is chosen in our problem and assigned π π processors, we cancorrespondingly pack an item ππ with a size π· π,πΎ ( π,π) into the above knapsack. By Lemma 4.3, π· π,π π
Efficient Approximation Algorithms for Scheduling Moldable Tasks 17
= π· π,πΎ ( π,π) andβ
ππ βSπβ²π· π,πΎ ( π,π) β€ ππ; thus, the items π1, π2, Β· Β· Β· , ππβ² can successfully be packed into
the knapsack. β‘
For the second part, the detailed proof of (28) will be provided in B. Below, we provide theunderlying intuition while proving (28). The workload of each taskππ β Sπβ² accepted by GreedyAlgois also π· π,πΎ ( π,π) by Lemma 4.3. Sπβ² and Sβ² contain the first π β² and π tasks with the highest valuedensities respectively. We have π β² β€ π since in GreedyAlgo the utilization of the π processors in[0, π] is β€ 1. Thus, the average value density of Sπβ² is no smaller than the average value density ofSβ², i.e., βπβ²
π=1 π£ πβπβ²π=1 π· π,πΎ ( π,π)
β₯ OPTππ
.
Further, we can prove (28): βπβ²π=1 π£ π
OPTβ₯
βπβ²π=1 π· π,πΎ ( π,π)
ππβ₯ \ (πΏ).
Finally, Algorithm 2 considers S1, S2, Β· Β· Β· , Sπ one by one (line 2). Whenever ππβππ attempts toschedule tasks on π processor by time π (line 3), it has a time complexity O(π logπ) by Proposi-tion 4.6. Hence, the complexity of GreedyAlgo is O(π2 logπ).
6 CONCLUSIONSWe study the problem of scheduling π independent moldable tasks on π processors that arises
in large-scale parallel computations. When tasks are monotonic, the best known result is a ( 32 + π)-approximation algorithm for makespan minimization with a complexity linear in π and polynomialin logπ and 1
πwhere π is arbitrarily small. We propose a new perspective of the existing speedup
models: the speedup of a task ππ is linear when the number π of assigned processors is small (upto a threshold πΏ π ) while it presents monotonicity when π ranges in [πΏ π , π π ]; the bound π π indicatesan unacceptable overhead when parallelizing on too many processors. For a given integer πΏ β₯ 5,let π’ =
β2βπΏ
ββ 1. In this paper, we propose a 1
\ (πΏ) (1 + π)-approximation algorithm for makespan
minimization with a complexity O(π log(π/π) logπ) where \ (πΏ) = π’+1π’+2
(1 β π
π
)(π β« π). As a
by-product, we also propose a \ (πΏ)-approximation algorithm for throughput maximization with acommon deadline with a complexity O(π2 logπ).
ACKNOWLEDGEMENTSThe work of Patrick Loiseau was supported by the French National Research Agency (ANR)
through the βInvestissements dβavenirβ program (ANR-15-IDEX- 02), and by the Alexander vonHumboldt Foundation.
REFERENCES[1] ARIDOR, Y., DOMANY, T., GOLDSHMIDT, O., KLITEYNIK, Y., MOREIRA, J., AND SHMUELI, E. (2005). Open jobmanagement architecture for the blue gene/l supercomputer. In Proceedings of the 11th Workshop on Job SchedulingStrategies for Parallel Processing. Springer, 91β107.
[2] BELKHALE, K. P. AND BANERJEE, P. (1990). An approximate algorithm for the partitionable independent task schedulingproblem. In Proceedings of the 1990 International Conference on Parallel Processing. Pennsylvania State University Press,72β75.
[3] BENOIT, A., LE FΓVRE, V., PEROTIN, L., RAGHAVAN, P., ROBERT, Y., AND SUN, H. (2020). Resilient scheduling ofmoldable jobs on failure-prone platforms. In Proceedings of the 22nd IEEE International Conference on Cluster Computing.IEEE, 81β91.
18 Wu and Loiseau
[4] BERG, B., VESILO, R., AND HARCHOL-BALTER, M. (2019). hesrpt: Optimal scheduling of parallel jobs with knownsizes. ACM SIGMETRICS Performance Evaluation Review 47, 2, 18β20.
[5] DECKER, T., LΓCKING, T., AND MONIEN, B. (2006). A 54 -approximation algorithm for scheduling identical malleable
tasks. Theoretical Computer Science 361, 2-3, 226β240.[6] DROZDOWSKI, M. (1996). Real-time scheduling of linear speedup parallel tasks. Information processing letters 57, 1,35β40.
[7] DROZDOWSKI, M. (2004). Scheduling Parallel Tasks β Algorithms and Complexity. CRC Press.[8] DUTTON, R. A., MAO, W., CHEN, J., AND WATSON III, W. (2008). Parallel job scheduling with overhead: A benchmarkstudy. In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage. IEEE, 326β333.
[9] FISHKIN, A. V., GERBER, O., JANSEN, K., AND SOLIS-OBA, R. (2005). Packing weighted rectangles into a square. InProceedings of the 30th International Symposium on Mathematical Foundations of Computer Science. Springer, 352β363.
[10] HAVILL, J. T. AND MAO, W. (2008). Competitive online scheduling of perfectly malleable jobs with setup times.European Journal of Operational Research 187, 3, 1126β1142.
[11] HOCHBAUM, D. S. AND SHMOYS, D. B. (1987). Using dual approximation algorithms for scheduling problemstheoretical and practical results. Journal of the ACM 34, 1, 144β162.
[12] JAIN, N., MENACHE, I., NAOR, J. S., AND YANIV, J. (2015). Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters. ACM Transactions on Parallel Computing 2, 1, 3.
[13] JANSEN, K. (2012). A ( 32 + π) approximation algorithm for scheduling moldable and non-moldable parallel tasks. InProceedings of the 24th annual ACM symposium on Parallelism in algorithms and architectures. ACM, 224β235.
[14] JANSEN, K. AND LAND, F. (2018). Scheduling monotone moldable jobs in linear time. In Proceedings of the IEEEInternational Parallel and Distributed Processing Symposium. IEEE, 172β181.
[15] JANSEN, K. AND PORKOLAB, L. (2002). Linear-time approximation schemes for scheduling malleable parallel tasks.Algorithmica 32, 3, 507β520.
[16] JANSEN, K. AND THΓLE, R. (2010). Approximation algorithms for scheduling parallel jobs. SIAM Journal onComputing 39, 8, 3571β3615.
[17] JANSEN, K. AND ZHANG, G. (2007). Maximizing the total profit of rectangles packed into a rectangle. Algorith-mica 47, 3, 323β342.
[18] JOHN, L. K. AND EECKHOUT, L. (2018). Performance evaluation and benchmarking. CRC Press.[19] KORTE, B. AND VYGEN, J. (2018). The Knapsack Problem. Springer, Berlin, Heidelberg, 471β487.[20] LUCIER, B., MENACHE, I., NAOR, J. S., AND YANIV, J. (2013). Efficient online scheduling for deadline-sensitive jobs.In Proceedings of the 25th ACM symposium on Parallelism in Algorithms and Architectures. ACM, 305β314.
[21] LUDWIG, W. AND TIWARI, P. (1994). Scheduling malleable and nonmalleable parallel tasks. In Proceedings of the fifthannual ACM-SIAM symposium on Discrete algorithms. ACM, 167β176.
[22] MOUNIΓ, G., RAPINE, C., AND TRYSTRAM, D. (1999). Efficient approximation algorithms for scheduling malleabletasks. In Proceedings of the 11th ACM symposium on Parallel algorithms and architectures. ACM, 23β32.
[23] MOUNIΓ, G., RAPINE, C., AND TRYSTRAM, D. (2007). A 32 -approximation algorithm for scheduling independent
monotonic malleable tasks. SIAM Journal on Computing 37, 2, 401β412.[24] STEINBERG, A. (1997). A strip-packing algorithm with absolute performance bound 2. SIAM Journal on Comput-ing 26, 2, 401β409.
[25] TUREK, J., WOLF, J. L., AND YU, P. S. (1992). Approximate algorithms scheduling parallelizable tasks. In Proceedingsof the fourth annual ACM symposium on Parallel algorithms and architectures. ACM, 323β332.
[26] WANG, Q. AND CHENG, K.-H. (1992). A heuristic of scheduling parallel tasks and its analysis. SIAM Journal onComputing 21, 2, 281β294.
[27] WIKIPEDIA. (2018). Supercomputer architechture. [accessed online, June 17, 2018],https://en.wikipedia.org/wiki/Supercomputer_architecture#21st-century_architectural_trends.
[28] WU, X. AND LOISEAU, P. (2015). Algorithms for scheduling deadline-sensitive malleable tasks. In Proceedings of the53rd Annual Allerton Conference on Communication, Control, and Computing. IEEE, 530β537.
[29] WU, X., LOISEAU, P., AND HYYTIΓ, E. (2019). Toward designing cost-optimal policies to utilize iaas clouds withonline learning. IEEE Transactions on Parallel and Distributed Systems 31, 3, 501β514.
A THE INITIAL VALUE OF π
The initial value of π is set as
π = π(πΏ + 2)maxππ βT{π‘ π,1},which is at least πΏ + 2 times the total execution time of all tasks when every task is assigned oneprocessor. πΎ ( π,π ) is the minimum number of processors needed to complete ππ by time π , and we
Efficient Approximation Algorithms for Scheduling Moldable Tasks 19
have πΎ ( π,π ) = 1 for allππ β T . We have by Inequality (3a) that 1 β€ π» β 1 β€ πΏ β² β€ πΏ . By Property 3.1,we have for every task ππ β T that
π‘ π,πΏβ² =π‘ π,1
πΏ β²β€ π
π(πΏ + 2)πΏ β² β€π
πΏ + 2 <π
π»= (1 β π )π
where π β₯ 1 and π = π»β1π»
. Every task of T has an execution time < (1 β π )π when assigned πΏ β²
processors. Thus, all tasks of T are in the class A β²β², and the other classes A β², Aπ»β1, Β· Β· Β· , Aa areempty. Now, we show that ππβππ can produce a feasible schedule for all tasks of T by time π .All tasks of T constitute A β²β² and will be sequentially executed on πΏ β² processors (see lines 9-24 ofAlgorithm 1); the total execution time of T is β€ πmaxππ βT{π‘ π,1} β€ π .
B THE DETAILED PROOF OF THE SECOND PARTBelow, we formally prove Inequality (28). GreedyAlgo accepts the first π β² tasks with the highest
value densities π£ β²π , and the achieved throughput isβπβ²
π=1 π£ π . ππβππ is used to schedule the π β² tasks, andeach accepted task ππ has a workload π· π,πΎ ( π,π) by Lemma 4.3. We denote by π β [0, 1] the actualutilization of theπ processors in [0, π] achieved by GreedyAlgo, i.e.,
πβ²βοΈπ=1
π· π,πΎ ( π,π) = πππ.
By Proposition 4.5, \ (πΏ) is a lower bound of the processor utilization and we have
π β₯ \ (πΏ) β (0, 1] . (30)
Since π β€ 1, we have
π β² β€ π. (31)
LEMMA B.1. The throughputπβ²βπ=1
π£ π achieved by GreedyAlgo is at least \ (πΏ)OPT where OPT
is given in Equation (29).
PROOF. By Inequality (31), we will analyze two cases that π β² = π and π β² < π respectively. In thecase that π β² = π , we have
πβ²βοΈπ=1
π£ π β€ OPT =
πβ1βοΈπ=1
π£ π + πΌπ£π
by Lemma 5.3. Thus, we have πΌ = 1 and the lemma holds.In the case that π β² < π , let
π1 =1π
πβ²βοΈπ=1
π· π,πΎ ( π,π) βπβ²βοΈπ=1
π· π,πΎ ( π,π)
π2 =
πβ1βπ=πβ²+1
π· π,πΎ ( π,π) + πΌπ·π,πΎ ( π,π) if π β² < π β 1
πΌπ·π,πΎ ( π,π) if π β² = π β 1
π =
πβ1βπ=πβ²+1
π£ π + πΌπ£π if π β² < π β 1
πΌπ£π if π β² = π β 1
20 Wu and Loiseau
Recall
ππ =1π
πβ²βοΈπ=1
π· π,πΎ ( π,π) =πβ1βοΈπ=1
π· π,πΎ ( π,π) + πΌπ·π,πΎ (π,π) .
We thus have π1 = π2 since
ππ β π1 =
πβ²βοΈπ=1
π· π,πΎ ( π,π) = ππ β π2 .
The total value obtained by GreedyAlgo isπβ²βπ=1
π£ π and we have
πβ²βπ=1
π£ π
πππ
(π)=
πβ²βπ=1
π£ β²π( 1ππ· π,πΎ ( π,π) β π· π,πΎ ( π,π) + π· π,πΎ ( π,π)
)ππ
(π)β₯
πβ²βπ=1
π£ π + π£ β²πβ²π1
ππ
(π)=
πβ²βπ=1
π£ π + π£ β²πβ²π2
ππ
(π)β₯
πβ²βπ=1
π£ π + π
ππ=OPTππ
.
(32)
Here, in Equation (a), π£ π = π£ β²ππ· π,πΎ ( π,π) ; Inequalities (b) and (d) are due to that π£ β²1 β₯ Β· Β· Β· β₯ π£ β²πβ² β₯ Β· Β· Β· β₯
π£ β²π; Equation (c) is due to that π1 = π2. Due to Inequality (32), we haveπβ²βπ=1
π£ π β₯ πOPT ; further, by
Inequality (30), the lemma holds. β‘