A Launch-time Scheduling Heuristics for Parallel Applications on Wide Area Grids

17
J Grid Computing (2008) 6:159–175 DOI 10.1007/s10723-006-9061-5 A Launch-time Scheduling Heuristics for Parallel Applications on Wide Area Grids Ranieri Baraglia · Renato Ferrini · Nicola Tonellotto · Laura Ricci · Ramin Yahyapour Received: 26 June 2006 / Accepted: 4 November 2006 / Published online: 17 March 2007 © Springer Science + Business Media B.V. 2007 Abstract Large and dynamic computational Grids, generally known as wide-area Grids, are characterized by a large availability, heterogene- ity on computational resources, and high vari- ability on their status during the time. Such Grid infrastructures require appropriate schedule mechanisms in order to satisfy the application performance requirements (QoS). In this paper we propose a launch-time heuristics to schedule component-based parallel applications on such kind of Grid. The goal of the proposed heuristics is threefold: to meet the minimal task computation- al requirement, to maximize the throughput between communicating tasks, and to evaluate on-the-fly the resource availability to minimize the aging effect on the resources state. We evaluate the proposed heuristics by simulations applying it to a suite of task graphs and Grid platforms R. Baraglia · R. Ferrini · N. Tonellotto (B ) Information Science and Technologies Institute, ISTI-CNR, Via G. Moruzzi 1, Pisa, Italy e-mail: [email protected] L. Ricci Department of Computer Science, University of Pisa, Largo B. Pontecorvo 3, Pisa, Italy R. Yahyapour Institute for Robotics Research, University of Dortmund, Otto-Hahn Straße 8, Dortmund, Germany randomly generated. Moreover, a further test was conducted to schedule a real application on a real Grid. Experimental results shown that the proposed solution can be a viable one. Keywords Grid computing · Wide-area Grids · Performance requirements · Static scheduling · Quality of service 1 Introduction A Grid is a dynamic, seamless, integrated com- putational and collaborative environment. Grids integrate networking, computation and informa- tion to provide a virtual platform for computing and data management. Resources in a Grid are typically grouped into autonomous administrative domains communicating through high-speed com- munication links [17, 18]. A Grid Resource Management System (RMS) is fundamental for the operation of a Grid, and its basic functions are to accept requests for re- sources and assign specific resources to a request from the overall pool of Grid resources for which a user has access permission. The Grid RMS is composed by the middleware, tools and ser- vices allowing to disseminate resource inform- ation, to discover suitable resources and to schedule resources for job execution. The design of a Grid RMS must consider aspects such as:

Transcript of A Launch-time Scheduling Heuristics for Parallel Applications on Wide Area Grids

J Grid Computing (2008) 6:159–175DOI 10.1007/s10723-006-9061-5

A Launch-time Scheduling Heuristics for ParallelApplications on Wide Area Grids

Ranieri Baraglia · Renato Ferrini · Nicola Tonellotto ·Laura Ricci · Ramin Yahyapour

Received: 26 June 2006 / Accepted: 4 November 2006 / Published online: 17 March 2007© Springer Science + Business Media B.V. 2007

Abstract Large and dynamic computationalGrids, generally known as wide-area Grids, arecharacterized by a large availability, heterogene-ity on computational resources, and high vari-ability on their status during the time. SuchGrid infrastructures require appropriate schedulemechanisms in order to satisfy the applicationperformance requirements (QoS). In this paperwe propose a launch-time heuristics to schedulecomponent-based parallel applications on suchkind of Grid. The goal of the proposed heuristics isthreefold: to meet the minimal task computation-al requirement, to maximize the throughputbetween communicating tasks, and to evaluateon-the-fly the resource availability to minimize theaging effect on the resources state. We evaluatethe proposed heuristics by simulations applyingit to a suite of task graphs and Grid platforms

R. Baraglia · R. Ferrini · N. Tonellotto (B)Information Science and Technologies Institute,ISTI-CNR, Via G. Moruzzi 1, Pisa, Italye-mail: [email protected]

L. RicciDepartment of Computer Science, University of Pisa,Largo B. Pontecorvo 3, Pisa, Italy

R. YahyapourInstitute for Robotics Research,University of Dortmund, Otto-Hahn Straße 8,Dortmund, Germany

randomly generated. Moreover, a further testwas conducted to schedule a real application ona real Grid. Experimental results shown that theproposed solution can be a viable one.

Keywords Grid computing · Wide-area Grids ·Performance requirements · Static scheduling ·Quality of service

1 Introduction

A Grid is a dynamic, seamless, integrated com-putational and collaborative environment. Gridsintegrate networking, computation and informa-tion to provide a virtual platform for computingand data management. Resources in a Grid aretypically grouped into autonomous administrativedomains communicating through high-speed com-munication links [17, 18].

A Grid Resource Management System (RMS)is fundamental for the operation of a Grid, andits basic functions are to accept requests for re-sources and assign specific resources to a requestfrom the overall pool of Grid resources for whicha user has access permission. The Grid RMSis composed by the middleware, tools and ser-vices allowing to disseminate resource inform-ation, to discover suitable resources and toschedule resources for job execution. The designof a Grid RMS must consider aspects such as:

160 R. Baraglia, et al.

site autonomy, heterogeneity, extensibility, allo-cation/co-allocation, scheduling, and online man-agement [9, 21, 25, 28]. Regarding the schedulingpolicy adopted within a Grid RMS, we candistinguish between the system-oriented andapplication-oriented ones. Local job schedulersusually adopts system-oriented policies, trying tooptimize system throughput. On the other hand,application schedulers try to optimize application-specific metrics like completion time.

Even if, in the broadest sense, any applica-tion can be a Grid application, in practice, ap-plications that can take more advantage of usingthe Grid are loosely-coupled, computational in-tensive, component-based, and, often, multidisci-plinary. The use of components provide suitablemeans to easily build efficient Grid applica-tions [19, 29]. Components can be sequential orparallel, written using different programming lan-guages, and may be distributed across the compu-tational resources of the Grid.

Complex distributed application schedulers areneeded to reach application goals such as de-sired application performance and scalability. Theproblem with such applications is that schedulingdecisions taken at launch-time (static scheduling)cannot be sufficient to maintain the applicationperformance requirements and need to be mod-ified (dynamic re-scheduling) at run-time. Staticscheduling is used before an application starts todetermine how its components should be initiallyscheduled on available Grid resources. When thecharacteristics of a parallel application (e.g. taskcomputational cost, amount of data exchangedamong tasks, task dependencies) are known be-fore the application execution, a static schedul-ing approach can be profitably exploited. Staticscheduling usually does not imply overheads onthe execution time of the scheduled application,and, therefore, more complex scheduling solu-tions than the dynamic ones can be adopted.However, in the case of non-dedicated productionGrids, the scheduling policy must consider the ag-ing effect on the resources state. High-complexitysolutions could lead to wrong scheduling deci-sions, due to load variations in the system ordynamic changes of the resource availability. Onthe other hand, re-scheduling requires the ability

to modify the application tasks schedule duringtheir execution to cope with dynamic changesof the resource availability to sustain the re-quired performance, or to improve the current ap-plication/system performance. Re-scheduling caninclude changing the machines on which the appli-cation processes are executing (migration) and/orchanging the scheduling of data to those machines(dynamic load balancing). Dynamic task replica-tion is also a popular technique used to circumventruntime scheduling problems [7, 11].

To this end performance monitoring and com-ponent performance models should be exploitedto deal with dynamic changes in the resourcescapacities and availability. Moreover, the systemshould exhibit the ability to completely freeze anapplication, and continue its execution on a dif-ferent set of resources (checkpointing/migration).Such a technique is useful, not only when thescheduled resources are not able to guaranteeanymore the required application performancecontract, but also in the case of resource failure.

When performance falls below expectations,the re-scheduling mechanism must determinewhether re-scheduling is profitable, and, if so,what new schedule should be used. Since re-scheduling could be a very expensive operation,the launch-time scheduling has to be “optimized”in order to minimize the re-scheduling activity.

In this paper we investigate the issues relatedto the launch-time scheduling of parallel applica-tions on wide-area Grids. We propose a heuristics,called WASE (Wide Area Scheduling Heuristics),exploiting a TIG (Task Interaction Graphs) [26]as application model and considering large com-putational Grids made up of not-dedicated com-putational resources. The goals of such heuristicsis threefold: to meet the minimal task computa-tional requirement (soft QoS), to maximize thethroughput between communicating tasks, andto evaluate on-the-fly the resource availability tominimize the aging effect on the resources state.

The rest of this paper is organized as follows.Section 2 highlights current efforts in the are ofmulti-component, parallel applications schedul-ing. Section 3 gives a description of the prob-lem, Section 4 describe our heuristics, Section 5outlines and evaluates the proposed heuristics

A launch-time scheduling heuristics for parallel applications on wide area Grids 161

through simulation, and by scheduling a real ap-plication on a real Grid. Finally, conclusion andfuture works are given in Section 6.

2 Related Work

Two important current Grid scheduling efforts inthe area of multi-components resource-intensiveparallel applications are AppLeS (ApplicationLevel Scheduling) [4] and GrADS (Grid Ap-plication Development Software) [10] projects.AppLeS proposes a methodology for adaptive ap-plication scheduling on heterogeneous computingplatforms to meet some application-specific met-rics, such as the minimization of the applicationcompletion time. The AppLeS approach exploitsstatic and dynamic resource information, perfor-mance predictions, application and user-specificinformation, and scheduling techniques that adapton-the-fly the application execution. In order tomake the integration of a scheduling agent inthe application easier, some templates for para-meter sweep applications (APST), master/workerapplications (AMWAT), and for moldable jobs onspace-shared parallel supercomputers (SA) wereprovided. The goal of GrADS is to realize aGrid system by providing tools, such as problemsolving environments, Grid compilers, schedulers,performance monitors, to manage all the stages ofapplication development and execution. Exploit-ing GrADS, users will only concentrate on high-level application design without putting attentionto the peculiarities of the Grid computing plat-form used. The scheduler is a key component ofthe GrADS system. In GrADS, scheduling de-cisions are taken by exploiting application char-acteristics and requirements in order to obtainthe best application execution time. The scheduleris structured according to three distinct phases:launch-time scheduling, re-scheduling and meta-scheduling. At launch-time the application exe-cution is started on the selected Grid resourcesand a real-time performance monitor is used totrack program performance and to detect vio-lation of performance guarantees. Performanceguarantees are formalized in a performance con-tract. In the case of a performance contract vi-

olation, the re-scheduler is invoked to evaluatealternative schedules. Meta-scheduling involvesthe coordination of schedules for multiple ap-plications running on the same Grid at once.It implements scheduling policies for balancingthe interests of different applications. Monitor-ing and Discovery Service (MDS) [9], NetworkWeather Service (NWS), ClassAds/Matchmakingapproach [24], Globus Toolkit [16] and GridFTPfor file transfer [3] are the existing software com-ponents used to implement the GrADS system.

Moreover, in the past, a number of staticscheduling heuristics for heterogeneous comput-ing platforms have been proposed to approximateoptimal scheduling solutions. These exploit sev-eral techniques such as list scheduling (e.g [20,27]), clustering (e.g [5, 15], evolutionary methods(e.g [23, 30]) and task duplication (e.g. [2, 8].The objective of almost the totality of these al-gorithms is to minimize the overall applicationcompletion time, and, in general, such methodsconsider dedicated Grids with a limited number ofnodes. Moreover, the in-depth exploration of thesolution space to refine the approximation of theoptimal solution may require heuristics executiontimes that can make these algorithms algorithmnot exploitable on non-dedicated wide-area Grids.This is due, in general, to the high variability ofthe Grid resources status during the time that canmake the found scheduling solution inefficient. Inpractice it is not required to find an initial schedulethat better approximates the optimal one, but itis sufficient to find a schedule that guarantees therequired application performance.

3 Problem Description

3.1 The Application Model

WASE assumes that a parallel application is com-posed by multiple tasks in continuous execution,and that the communications between tasks arebidirectional and may occur at any time duringthe application execution. It is also assumed thatthe tasks computation and communication costsare known before the application execution. This

162 R. Baraglia, et al.

kind of applications require the co-allocation ofparallel tasks before starting the execution.

A weighted Task Interaction Graph (TIG)(Fig. 1), denoted as GApp = (N, V), is adoptedto model a parallel application. In GApp a nodeni ∈ N, with 1 ≤ i ≤ |N|, models a task, and anundirected edge aij ∈ V, with i �= j, models a com-munication occurring between node ni and nodenj. Nodes have an associated weight representingthe amount of computation required to executethe corresponding task, and edges have an as-sociated weight representing the amount of dataexchanged between two nodes. Application costscan be obtained either by static program analysisor by executing the code on a reference system.

In general, the practical evaluation of compu-tation and communication costs is not an easytask. Moreover computational cost, power andload should be represented by means of multi-dimensional quantities, in which each dimensionrepresents one aspect of the computation (floatingpoint operations, memory access, I/O operations).In order to simplify this evaluation we proposeto exploit both qualitative and quantitative infor-mation on the application. The qualitative oneis represented by the Relative Computational Re-quirements, RCR(ni), and the Relative Quality ofCommunications, RQoC(aij). Both are uniformly

defined for nodes and edges, respectively, wheree.g. RCR(nh) = 3 and RCR(nk) = 1.5 denotesthat nh has a minimal computational requirementtwice nk. Such values relatively specify the re-quired computational work and communicationbandwidth only. We assume that at least one nodeand one edge exist with RCR = 1 and RQoC = 1respectively.

The quantitative information is represented bythe value MCR(GApp), which specifies the Min-imal Computational Requirement of the applica-tion. It is expressed in MFlop/s, as well as thecomputational power provided by the Grid ma-chines, and it represents the minimal computa-tional power required to efficiently run the taskswith RCR = 1.

Given the MCR(GApp) value, it is possible todetermine the MCR(ni) value for each task of theapplication:

MCR(ni) = MCR(GApp) · RCR(ni) (1)

3.2 Platform Model

A hierarchical graph (i.e. dendrogram) denoted asGGrid = (AN, NL) is adopted to model the Grid.

Fig. 1 Exampleof a weighted taskinteraction graph

A launch-time scheduling heuristics for parallel applications on wide area Grids 163

AN is the set of nodes representing sets of com-putational resources of the Grid and NL abstractsthe interconnection networks. We model threedifferent kinds of networks: Wide Area Networks(WAN), Metropolitan Area Networks (MAN) andLocal Area Networks (LAN). The root of thegraph is a “virtual node” grouping together all theWANs. Every node in the dendrogram representsa set of resources. The LAN set contains the re-sources to be exploited for tasks execution, whilehigher-level nodes contain all the resource of thelower-level nodes. Figure 2 shows a dendrogramincluding two MANs.

A basic assumption of our approach is thatintra-LAN communications are characterized byhigher bandwidth and a lower level of congestionwith respect to any other network communication.It is assumed that for each computational resourcemij ∈ ani ∈ AN, its nominal computational powerPmax

ij and its percentage of current computationalload Cload

ij are known. The current computationalpower of a resource Pcur

ij is computed as follows:

P curij = P max

ij ·(

1 − Cloadij

)(2)

We suppose that each computational resourcesupports a multiprogramming environment.

The current and average computational powerof a LAN ani containing |ani| hosts are computedas follows:

P curi =

|ani|∑j=1

P curij (3)

P curi = P cur

i

|ani| (4)

The current computational availability of aMAN and of a WAN can be defined in a similarway.

Finally, we also suppose that the current band-width BW(ani, anj) between two sibling nodes ani

and anj is known.Both the structure of the dendrogram and the

characteristics of the resources can be returnedby tools like Network Weather Service [31] andTopoMon [13]. The former returns the networkand computational resource performances, whilethe latter returns an abstraction of the networktopology.

4 Algorithm Architecture

The goal of WASE is to provide a soft QoS sup-port by fulfilling the application computationalrequirements and exploiting regions of the Grid

Fig. 2 Grid networktopology represented as adendrogram

164 R. Baraglia, et al.

network with high communication bandwidths,trying to qualitatively maximize the applicationcommunication throughput. The algorithm takesasn input the GApp and GGrid graphs, which repre-sent respectively a parallel application and a Grid.The algorithm is structured according to two steps.

Step 1 (Clustering): the goal of this preprocessingstep is to elaborate GApp to find a set of sub-graphs (called clusters) S = (S1, . . . , Sk) groupingthe tasks with high inter-communication costs.The clustering is carried out by using the ε-approximation correlation clustering m proposedin [12]. The goal of the algorithm is to partitionthe application graph into clusters exploiting asimilarity degree associated to each edge (i.e. wetry to put into the same subgraph nodes char-acterized by higher communication cost amongthem). A positive similarity degree between twonodes means that the algorithm should try toput the two linked nodes in the same cluster,while a negative number will force the two nodesin two different clusters. Basically, the clusteringproblem is formulated as an integer linear pro-gramming problem with boolean variables. Thisproblem is then relaxed and solved with standardLP methods and the solution is rounded to theinteger solution exploiting the Region Growingtechnique [1]. This clustering algorithm does notneed any meta-parameter (e.g. number of clus-ters), and the similarity measure is defined as:

wij = log

(RQoC(aij)

RQoCmax − RQoC(aij)

)(5)

where RQoCmax is the maximum value of RQoCin the application graph. Higher is the communi-cation cost between two tasks (i.e. RQoC value),higher is the confidence that the tasks are similar.In this way, the number of clusters carried out isindependent by the Grid size and topology. As aresult, we obtain that tasks within the same clusterneed higher communication bandwidth than thosebelonging to different clusters. Therefore, accord-ing to the heuristics assumption that the bettercommunication quality characterizes a LAN withrespect to MANs and WANs, WASE attemptsto allocate all the tasks within a cluster onto themachines of the same LAN.

Step 2 (Scheduling): the tasks within the clustershave to be scheduled onto Grid machines. First,the clusters in S are arranged in descending or-der with respect to their minimal computationalrequirement MCR(Si):

MCR(Si) = MCR(GApp) ·∑

nk∈Si

RCR(nk) (6)

Then, the clusters are scheduled onto the ma-chines of suitable LANs to run each cluster’s tasksby adopting a priority-based policy.

Selecting a cluster and starting from the userLAN (LANu), a Closest First Search (CFS) isapplied to GGrid in order to find a suitable LANto host all the tasks within the selected cluster.The Closest First Search first visits the local LANs(i.e. the LANs in the user MAN) in descendingorder of current link bandwidth, measured fromthe user LAN. Then it moves upward to the userMAN, and again visits the MANs, in the userWAN, in decreasing order of current link band-width, measured from the user MAN. Finally, itmoves upward visiting the WANs with the usualbandwidth order. Note that, in general, the searchdoes not visit all the nodes of the dendrogram,but it ends when enough resources to satisfy theapplication requirements are found. With a cer-tain bias, this heuristics tries to minimize the com-munication overhead and therefore performancedegradation by MAN and WAN network linksby putting strongly communicating nodes withinsingle LANs.

The LAN j suitability with respect to cluster Si

is computed according to the following formula:

Suitability(Si, LAN j) = P curj [Si]

MCR(Si) + σMCR(Si)(7)

where P curj [Si] is the mean computational power

offered by the more powerful min(|Si|, |an j|)machines of the LAN,1 MCR(Si) is the meanminimal computational power requested by thecluster:

MCR(Si) = MCR(Si)

|Si| (8)

1Note that if the number of tasks is greater than thenumber of machines, P cur

j [Si] corresponds to the meancomputational of the LAN, as in Eq. 4.

A launch-time scheduling heuristics for parallel applications on wide area Grids 165

and σMCR(Si) is the standard deviation of thecluster computational power:

σMCR(Si) =

√√√√√√∑

nk∈Si

(MCR(nk) − MCR(Si)

)2

|Si| (9)

The LANj is suitable for Si if and only ifSuitability(Si, LANj) ≥ 1. The tasks within a clus-ter are scheduled onto the machines of the firstsuitable LAN in CFS order.

The tasks are arranged in descending orderwith respect to their computational cost (i.e. theRCR(ni) value), and then they are scheduledto the suitable LAN’s machines by adopting apriority-based policy. The machine onto whichschedule a task is selected according to the follow-ing parameter:

Affinity(ni, mjk) = P curjk

MCR(ni)(10)

The machine mjk is suitable for a task ni if andonly if Affinity(ni, mjk) ≥ 1. Every task is sched-uled on the machine with the highest affinity rankin the suitable LAN. The allocation of the task

1. Order clusters Si ∈ S into CPQqueue in decreasingorder of MCR(Si) values;

2. while (!CPQ.empty()) {3. Sl = CPQ.removeFirst();4. Get first LANi ∈ GGrid such

thatSuitability(LANi, Sl) ≥ 1 byvisiting GGrid from LANu

in a CFS fashion;5. if (LANi �= null)6. T BA = SIMPLE-TASKMAP

(LANi, Sl);7. else8. Exit with ERROR;9. if (T BA �= ∅)10. PROXY-TASKMAP(LANi,T BA);11. }12.Exit with SUCCESS;

Algorithm 1: LAN-SELECT(S,GSys)

1. Order tasks ni ∈ Sl into T PQqueue in decreasing

order of MCR(ni) values;2. while (!T PQ.empty()) {3. ni = T PQ.removeFirst();4. Get resource mjk ∈ ANj withhighest value

of Affinity(ni, mjk);5. if (Affinity(ni, mjk) ≥ 1) {6. Allocate task ni ontoresource mjk;7. P cur

jk = P curjk − MCR(ni);

8. if (P curjk = 0)

9. GGrid.remove(mjk);10. } else11. Put task ni into T BA queue;12. }15. return T BA;

Algorithm 2: SIMPLE-TASKMAP(ANj, Sl)

ni on the machine mjk causes a reduction of P curjk

equals to MCR(ni).Since statistical information is used to select a

suitable LAN, the chosen LAN may not be able tohost all the tasks inside a cluster. It could happenthat some tasks do not find any suitable machine.To overcome this problem, unallocated tasks arescheduled onto machines of a LAN as close aspossible to the other tasks in terms of locality andavailable bandwidth in the network tree.

This process is repeated until all the cluster’stasks are allocated or until the root of the GGrid

graph is reached. In last case the scheduling algo-rithm fails, ending without a valid allocation.

In Algorithms 1, 2 and 3, the pseudo-code ofthe heuristics is shown.

Through the identification of the more com-municating GApp’s regions, and the exploitationof priority-based allocation techniques, the algo-rithm is able to reduce both the aging and stealingeffects on the resources used.

5 Performance Evaluation

In this section we present the evaluation of thescheduling solution carried out by WASE. The

166 R. Baraglia, et al.

1. Order all ANh brothers of ANi

(with h �= i) intoAN PQ queue in decreasingorder of BW(ANi, ANh);

2. while (!T BA.empty() ∧!AN PQ.empty) {

3. ANh = AN PQ.removeFirst();4. T BA = SIMPLE-TASKMAP

(ANh, T BA);5. }6. if (!T BA.empty())7. PROXY-TASKMAP(ANi.father(),

T BA);8. if (!T BA.empty())9. Exit with ERROR;10. Exit with SUCCESS;

Algorithm 3: PROXY-TASKMAP(ANi, T BA)

objective is to investigate the effect of the ap-plication computational and communication re-quirements on reaching the WASE goals. Theevaluation was conducted by simulations apply-ing WASE to a suite of task graphs and Gridplatforms randomly generated. Moreover, a morethorough test was conducted by using a real liferendering application.

5.1 Simulation Environment

To carry out a set of meaningful statistical in-formation to evaluate the proposed heuristics alarge number of simulations were conducted. The

following results are first steps in analyzing thebehavior and quality of the presented approach.As there is no existing information on parameter-izing the algorithm, we assume several parametersets as first examples. Thus, it cannot be concludedthat these parameters are optimal nor feasiblefor other job scenarios. Future work will have tofurther evaluate the influence of these parameterson other workload and Grid configurations.

The simulation environment used to conductthe tests includes the Cabri-graphs [6], Tiers [14]and Network Weather Service (NWS) [31] tools.Cabri-graphs was used to generate applicationgraphs. It generates connected and undirectedgraphs (hence compatible with a TIG model) hav-ing a number of nodes varying from 2 to 256. Wegenerated TIGs with 8, 16, 32, 64 and 128 nodeswith a variable number of edges. Such sizes seemreasonable because most part of parallel appli-cations follow the “power-of-two” law and theircomplexity does not excess the 128 nodes [22].Both the RCRs for the nodes and the RQoCsfor the edges were randomly generated from auniform distribution U(1, 10).

In order to evaluate the effect of the tasks com-munication cost on the LAN configurations, TIGswere clustered changing the correlation functionto obtain 1–3, 4–6 and 7–9 tasks per clusters foreach generated TIG. Higher is the value of thecorrelation degree, higher is the communicationsimilarity value and therefore greater is the num-ber of tasks inside a cluster. Moreover, in eachtest, each clustered TIG was run by adopting fourdifferent values of MCR: 25, 50, 75 and 100.

Fig. 3 Percentage ofscheduling failures

A launch-time scheduling heuristics for parallel applications on wide area Grids 167

Fig. 4 Percentage ofscheduling failures

The Grid platforms have been generated tosatisfy the following constraints:

1. Grids structured as real wide area Grid byadopting specific topologies and realistic fea-tures for the computational machines andcommunication links;

2. The Grid topology has to be organized with ahierarchical structure (i.e. WANs, MANs andLANs);

3. The values of the bandwidth and the latencyof each link must reflect the ones of real Gridenvironments, especially by considering thedifferent communication bandwidth amongLAN, MAN and WAN networks.

The previous requirements are satisfied by theTiers tool. Tiers allows the generation of a large

number of Grid systems with different topologieshaving a number of hosts varying from 16 to 1,024.

The loads of machines and links have beenobtained by referring to the traces stored in theNetwork Weather Service (NWS) [31] web site,where information of several hosts connected indifferent networks are available.

To run all the tests, 240 application TIGs and15 Grids were generated. All the 240 TIGs weregenerated with a random number of nodes (vary-ing from 8 to 128) and a random number ofedges. The values representing the task computa-tion and edge communication requirements havebeen randomly generated as well. We generatedGrids with 32, 64, 128, 256 and 512 machines,grouped in LANs with 4, 8 and 16 machines. Then,a value representing the computational powerhas been associated to each host on the basis of

Fig. 5 Comparison ofpercentage of schedulingfailures

168 R. Baraglia, et al.

Fig. 6 Averagescheduling time

a uniform distribution with values from 600 to1,300 MFlop/s. In all the tests, about 3,600 trialswere executed, and the average values for eachtest (about 720 trials) are shown.

5.2 Performance Metrics

To validate WASE we implemented a greedy bestfit scheduling and we applied such algorithm toevery test case. To evaluate the schedules carriedout by WASE we exploited different criteria: thepercentage of mapping failures, the task-machineaffinity, the LAN hit ratio and the intra-cluster“distance” among tasks allocated on different ani.

The percentage of scheduling failures indicatesthe ratio of test applications that both algorithmsfailed to schedule. This can be due to the fact that-

it was impossible to find a schedule for the appli-cation or the selected algorithm failed to find oneof the existing solutions. In the following criteria,the average values are computed discarding thesimulations resulting in scheduling failures.

The task-machine affinity represents how manytimes the power of the machine on which a task isallocated is greater than the task minimal compu-tation requirement. Hence, this parameter givesus an accurate estimation of the satisfaction levelobtained by WASE in allocating a task.

The LAN hit ratio gives us how many tasksof a cluster are allocated onto the machines ofthe suitable LAN. According to the WASE hy-pothesis, the more the LAN hit ratio increasesthe more the throughput between communicatingtasks increases.

Fig. 7 Comparison ofaverage scheduling times

A launch-time scheduling heuristics for parallel applications on wide area Grids 169

Fig. 8 Scheduling of TIG(four hosts per LAN)

Finally, the intra-cluster “distance” representsa scheduling distribution to estimate the reductionof the communication quality among the tasks ofa cluster when some of them are not allocated inthe suitable LAN.

These results have been obtained running thealgorithm on a Pentium4 processor with a clockfrequency of 3.0 GHz and a memory of 1 GB.

5.3 Evaluation for Synthetic Grid Scenarios

To evaluate the schedules carried out by WASEwe first evaluate the percentage of schedulingfailures. Figures 3 and 4 show the percentage ofscheduling failures with respect to the numberof tasks per TIG and to the number of tasksper cluster, respectively. As expected, an expo-nential increase of the number of tasks or thesize of a cluster causes an exponential increment

of the number of failures. In Fig. 5 WASE iscompared with the scheduling carried out by agreedy scheduling algorithm. This algorithm sim-ply orders the task queue in decreasing order ofRCR and the machine queue in decreasing orderof computational power and it performs a best fitscheduling. As can be seen, in the simulations, thepercentage of failures of the greedy algorithms islower than the percentage failure of WASE.

In the following evaluations, the average valuesare computed discarding the simulations resultingin scheduling failures.

In Fig. 6 the average scheduling times of WASEthat was needed for conducting the simulationon the aforementioned machine are shown. Itgrows exponentially with the number of tasks perTIG. In Fig. 7 we compare the average schedulingtime of WASE with those obtained by a greedyalgorithm on the same simulations. It can be seen

Fig. 9 Average LAN HitRatio (WASE)

170 R. Baraglia, et al.

Fig. 10 Average LANHit Ratio (Greedy)

that WASE presents better values than the greedyheuristics. It is mainly due to the fact that thegreedy one must find the “best fit” on the queueof the machines for each task, while WASE justtries to exploit user LAN resources.

This is shown in Fig. 8. For realistic Grid sys-tems as the ones generated by Tiers, WASE wasalways able to schedule tasks in the first suitableLAN or in LANs belonging to the same MAN,efficiently exploiting the Grid communicationresources.

The LAN hit ratio is the percentage of the tasksof a cluster that are allocated into the suitableLAN. Since the allocation of the tasks in the sameLAN allows to obtain a better communicationquality, higher values of the LAN hit ratio meanan improvement of the application throughput.

Figures 9 and 10 show the results of the averageLAN hit ratio using WASE and greedy algorithm,respectively. In such tests we can observe that theLAN hit ratio of WASE is around 90%, whilethe one of the greedy algorithm spans from 20to 60%. This result demonstrate that WASE isable to allocate the tasks of a cluster in the sameLAN and this allows to avoid a degradation of thecommunication quality.

Finally, Fig. 11 shows a limit of the proposed so-lution: changing the size of the Grid, the minimumtask-machine affinity does not change. This is dueto the fact that the algorithm always search forgood resources starting from the user LAN andmoving in nearby LANs/MANs. This behaviorprevents the algorithm to explore distant LANs,even if such LANs are able to host a whole cluster

Fig. 11 Minimum TaskMachine Affinity (TIGswith 8, 16, 32, 64 and128 tasks)

A launch-time scheduling heuristics for parallel applications on wide area Grids 171

Fig. 12 Graph ofthe render-encodeapplication

of tasks. Future improvements of our solution areunder investigation to find a better compromisebetween the average scheduling time and the min-imum task-machine affinity.

5.4 Evaluation for a Parallel RenderingApplication

Figure 12 shows the structure and the RCR andRQoC values of the real application, a renderingone, structured according to the pipeline programparadigm, used in the last test. The first sequentialstage requests the rendering of a sequence ofscenes. The second one is a data-parallel modulecomposed by an emitter, a collector and fiveworkers that render each scene (exploiting thePovRay rendering engine) interpreting a scriptdescribing the 3D model of objects, their positionsand motion. The third stage collects images

rendered by the second one, and builds Groups OfPictures(GOP) that are sent to the fourth stageperforming DivX compression in a data-parallelmodule composed by two workers. The last stagecollects DivX compressed pieces and stores themin an AVI output file. RCR and RQoC valueshave been collected through a short profilingexecution of the application on a referencearchitecture.

In Fig. 13 the testbed exploited for the algo-rithm evaluation is shown. This is a Grid composedby 23 heterogeneous computational resources(seven workstations and two clusters) distributedin three LANs connected through 1 Gb/s opticallinks. The nominal computational power of eachmachines is listed in Table 1. The average initialload on each one has been set to 30%.

We have applied the clustering algorithm withthree different similarity thresholds to obtain

Fig. 13 Graph of realGrid testbed

172 R. Baraglia, et al.

Table 1 Computationalpower of resources in theGrid testbed

Machine Pi (Mflop/s) Machine Pi (Mflop/s)

rubentino 1,000 w01 650sangiovese 950 w02 625verduzzo 1,000 w03 700n01 1,350 w04 650n02 950 w05 1,250n03 950 w06 1,300n04 950 w07 1,100n05 950 w08 700n06 950 fermi 1,350n07 950 galileo 1,350n08 950 rubbia 1,300

segre 1,325

different clusters as shown in Fig. 14. Then,WASE has been applied with three differentMCR values. In any case, we have consideredISTI as the user LAN. The allocation results,obtained as function of the number of clusters, areshown in Table 2.

We can observe that an increase of the MCRvalue causes a decrease in the average task-machine affinity. This is due to the increase of

the computational power requested by the taskswithin a cluster. Moreover, the more the numberof clusters increase, the more the inter-LAN com-break munications increases (more clusters, moreedges on the cuts). This is a kind of optimality pa-rameter for the results of the allocation phase. Infact, let us consider the clustering result as the bestone for the application. If the scheduling phase isable to schedule every single cluster completely

Fig. 14 Differentclustering results of thetest application

A launch-time scheduling heuristics for parallel applications on wide area Grids 173

Table 2 Allocationresults for the use caseapplication

LAN LAN LAN AverageMCR ISTI CNIT IET affinity

3 clusters 50 14 0 0 13.35100 14 0 0 5.63150 8 1 5 4.32

5 clusters 50 14 0 0 13.93100 14 0 0 6.72150 6 1 7 4.48

10 clusters 50 14 0 0 14.75100 14 0 0 7.00150 9 1 4 4.82

Inter LAN Inter LANcommunication communication(clustering)(%) (scheduling)(%)

3 clusters 0.12 0.000.00

20.815 clusters 3.98 0.00

0.0024.48

10 clusters 30.27 0.000.00

26.04

onto a single LAN, eventually we cannot obtainan inter-LAN communication value worse thanthe previous one. During the scheduling phases,the inter-LAN communication values are influ-enced by two effects:

1. A single cluster may be too “large” to bescheduled on a single LAN, and it must besplit across several LANs. It happens with fewand big clusters, and this effect worsen thefinal inter-LAN communication value;

2. Several clusters may be scheduled on the sameLAN, if there are available resources, improv-ing the final inter-LAN communication value.

We can see both effects in the test results. In thecase of three clusters, the scheduling algorithm isforced to split the big cluster in several LANs,while in the case of 10 clusters, the heuristics isable to completely allocate the largest cluster on aLAN and to allocate several small clusters on thesame LANs.

6 Conclusions and Future Work

In this paper a new heuristics to schedule parallelapplications on Grid platforms is proposed. Theheuristics is oriented to large computational Gridsmade up of dynamic not-dedicated resources thatare generally known as wide-area Grids. The al-location of a parallel application is conducted tosatisfy the minimal computational requirements(soft QoS) of its tasks, and by exploiting the intra-LAN communications in order to maximize thethroughput between communicating tasks. A fur-ther heuristics goal is to quickly allocate a task byevaluating on-the-fly the resource availability tominimize the aging effect on the resources state.The heuristics exploits a technique based on threeelements: task computation and communicationcosts, clustering of the application graph to findregions with high communication costs, and a den-drogram representing a Grid where the resourcesare hierarchically grouped in LANs, MANs andWANs. The allocation strategy is carried out byexploiting a priority-based technique to satisfy the

174 R. Baraglia, et al.

task minimum computational power requirement.Furthermore, in order to maximize the through-put between communicating tasks the heuristicstries to allocate tasks of a given cluster by exploit-ing the machines within a LAN, where the bettercommunication quality is assumed.

In order to evaluate the quality of the schedul-ing algorithm, several tests were conducted bysimulating the execution of a large number ofparallel applications on a number of Grids bothrandomly generated. Moreover, a further test wasconducted by using a real application on a realGrid.

The performance analysis highlights that thescheduling quality is satisfactory even when thenumber of the application tasks and the numberof hosts increase.

The presented evaluation examples indicatethat the WASE algorithm is very suitable forGrid use. However, in future work more thoroughsimulations and evaluations should be conductedto further analyze the behavior and parametriza-tion for more Grid structures and applicationexamples.

Moreover, the algorithm can be improved inseveral directions. A first development shouldbe to refine the suitability function to a betterchoose of the LAN suitable for a cluster. This im-provement will reduce the allocation of the tasksbelonging to a same cluster on different LAN.Another development should be the addition ofa soft QoS based on minimum requirement of thelink bandwidth in order to minimize the commu-nications times requested for the data transfersbetween two interacting tasks.

Finally, the last improvement should be thesupport of the dynamic spawning of new tasks ofa running application. In fact, a large number ofparallel applications are able, at running time, tocarry out a dynamic load-balancing by creatingnew components or by splitting a big componentin several sub-components. With this support thefunctionalities of the heuristics could be extendedto new classes of parallel applications.

Acknowledgements This work has been supported by:the Italian MIUR FIRB Grid.it project, No. RBNE-01KNFP, on High-performance Grid platforms and tools,and the European CoreGRID NoE (European Research

Network on Foundations, Software Infrastructures andApplications for Large Scale, Distributed, GRID and Peer-to-Peer Technologies, contract no. IST-2002-004265).

References

1. Adams, R., Bischof, L.: Seeded region growing. IEEETrans. Pattern Anal. Mach. Intell. 16(6), 641–647(1994)

2. Ahmad, I., Kwok, Y.-K.: On exploiting task duplicationin parallel program scheduling. IEEE Trans. ParallelDistrib. Syst. 9(9), 872–892 (1998)

3. Allcock, B., Bester, J., Bresnahan, J., Chervenak, A.L.,Foster, I., Kesselman, C., Meder, S., Nefedova, V.,Quesnel, D., Tuecke, S.: Data management and trans-fer in high-performance computational Grid environ-ments. Parallel Comput. 28(5), 749–771 (2002)

4. Berman, F., Wolski, R., Casanova, H., Cirne, W.,Dail, H., Faerman, M., Figueira, S., Hayes, J., Obertelli,G., Schopf, J., Shao, G., Smallen, S., Spring, N., Su,A., Zagorodnov, D.: Adaptive computing on the Gridusing apples. IEEE Trans. Parallel Distrib. Syst. 14(4),369–382 (2003)

5. Bowen, N.S., Nikolaou, C.N., Ghafoor, A.: On theassignment problem of arbitrary process systems toheterogeneous distributed computer systems. IEEETrans. Comput. 41(3), 257–273 (1992)

6. Cabri-graphs: IMAG. http://www-cabri.imag.fr/CabriGraphes/ (1998)

7. Carter, B.R., Watson, D.W., Freund, R.F., E. K.,Mirabile, F., Siegel, H.J.: Generational scheduling fordynamic task management in heterogeneous comput-ing systems. Inf. Sci. 106(3,4), 219–236 (1998)

8. Choe, T.-Y., Park, C.-I.: A task duplication basedscheduling algorithm with optimality condition inheterogeneous systems. In: Proceedings of the 14th In-ternational Parallel and Distributed Processing Sympo-sium. Washington, DC, USA (2000)

9. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman,C.: Grid information services for distributed resourcesharing. In: Proceedings of the 10th IEEE InternationalSymposium on High Performance Distributed Com-puting. Washington, DC, USA (2001)

10. Dail, H., Sievert, O., Berman, F., Casanova, H.,YarKhan, A., Vadhiyar, S., Dongarra, J., Liu, C., Yang,L., Angulo, D., Foster, I.: Scheduling in the Grid ap-plication development software project. In: ResourceManagement in the Grid. Hingham, MA, USA (2003)

11. de O. Lucchese, F., Yero, E.J.H., Sambatti, F.S.,Henriques, M.A.A.: An adaptive scheduler for Grids.Journal of Grid Computing V4(1), 1–17 (2006)

12. Demaine, E., Immorlica, N.: Correlation clusteringwith partial information. In: Proceedings of the 6thInternational Workshop on Approximation Algo-rithms for Combinatorial Optimization Problems.Berlin, Germany (2003)

13. den Burger, M., Kielmann, T., Bal, H.: TOPOMON:A monitoring tool for Grid network topology. In:

A launch-time scheduling heuristics for parallel applications on wide area Grids 175

Proceedings of the International Conference on Com-putational Science. Berlin, Germany (2002)

14. Doar, M.: A better model for generating test networks(1996)

15. Eshaghian, M.M.: Heterogeneous computing. ArtechHouse, Norwood, MA, USA (1996)

16. Foster, I., Kesselman, C.: The globus project: a statusreport. Future Gener. Comput. Syst. 15(5,6), 607–621(1999a)

17. Foster, I., Kesselman, C.: The Grid: Blueprint for aNew Computing Infrastructure. Morgan Kaufmann,San Francisco, CA, USA (1999b)

18. Foster, I., Kesselman, C., Tuecke, S.: The anatomy ofthe Grid: enabling scalable virtual organizations. Int. J.Supercomput. Appl. 15(3), 200–222 (2001)

19. INRIA: ProActive. http://www-sop.inria.fr/oasis/ProActive (1999)

20. Iverson, M., Ozguner, F., Follen, G.: Parallelizing ex-isting applications in a distributed heterogeneous en-vironment. In: Proceedings of the 4th HeterogeneousComputing Workshop (1995)

21. Krauter, K., Buyya, R., Maheswaran, M.: A taxonomyand survey of Grid resource management systems fordistributed computing. Software Practice and Experi-ence 32(2), 135–164 (2002)

22. Li, H., Groep, D., Wolters, L.: Workload characteris-tics of a multi-cluster supercomputer. In: Proceedingsof the 10th Workshop on Job Scheduling Strategies forParallel processing. Berlin, Germany (2004)

23. Lo, V.M.: Heuristic algorithms for task assignmentin Distributed Systems. IEEE Trans. Comput. 37(11),1384–1397 (1988)

24. Raman, R., Livny, M., Solomon, M.: Matchmak-ing: an extensible framework for distributed re-source management. Cluster Comput. 2(2), 129–138(1999)

25. Schopf, J.M.: General Architecture for Scheduling onthe Grid. Technical report, Argonne National Labora-tory (2002)

26. Sinnen, O., Sousa, L.: A classification of graph theo-retic models for parallel computing. Technical report,Instituto Superior Tecnico, Technical University ofLisbon, Portugal (1999)

27. Topcuouglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low-complexity task scheduling for het-erogeneous computing. IEEE Trans. Parallel Distrib.Syst. 13(3), 260–274 (2002)

28. Vadhiyar, S., Dongarra, J.: A metascheduler for theGrid. In: Proceedings of the Eleventh IEEE Interna-tional Symposium on High-Performance DistributedComputing. Washington, DC, USA (2002)

29. Vanneschi, M.: The programming model of ASSIST,an environment for parallel and distributed portableapplications. Parallel Comput. 28(12), 1709–1732(2002)

30. Wang, L., Siegel, H.J., Roychowdhury, V.R.,Maciejewski, A.A.: Task matching and schedulingin heterogeneous computing environments using agenetic-algorithm-based approach. J. Parallel Distrib.Comput. 47(1), 8–22 (1997)

31. Wolski, R., Spring, N.T., Hayes, J.: The networkweather service: a distributed resource performanceforecasting service for metacomputing. Future Gener.Comput. Syst. 15(5,6), 757–768 (1999)