The Master-Slave Paradigm on Heterogeneous Systems: A Dynamic Programming Approach for the Optimal...

12
The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping q Francisco Almeida, Daniel Gonza ´lez, Luz Marina Moreno * Dpto. Estadı ´stica, I.O. y Computacio ´ n, La Laguna University, La Laguna, Tenerife 38271, Spain Received 30 June 2004; accepted 20 October 2004 Available online 23 March 2005 Abstract We study the master–slave paradigm over heterogeneous systems. According to an analytical model, we develop a dynamic programming algorithm that allows to solve the optimal mapping for such paradigm. Our proposal considers heterogeneity due both to computation and also to communication. The optimization strategy used allows to obtain the set of processors for an optimal computation. The computational results show that considering heterogeneity also on the communication increases the performance of the parallel algorithm. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Master–slave; Heterogeneous system; Optimal mapping; Dynamic programming 1. Introduction Cluster Systems are playing an important role in High Performance Computing due to the low cost and availability of commodity PCs networks. A typical resource pool comprises heterogeneous commodity PCs and complex servers. The parallel standard libraries allow an easy portability of the parallel programs to the cluster contributing also to the increasingly consideration of these architec- tural solutions. However, a disappointing handi- cap appears with the behavior of the real running time of the applications. Most of the pro- grams have been developed under the assumption of an homogeneous target architecture and this assumption has been also followed by most of the analytical and performance models [1,2]. This models cannot be directly applied to the heteroge- neous environments. We revisit in this paper the mapping problem for master–slave algorithms under this new heter- ogeneous context. Some authors [3] broach the same problem but consider heterogeneity due only 1383-7621/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2004.10.005 q This work has been partially supported by the EC (FEDER) and the Spanish MCyT (Plan Nacional de I+D+I, TIC2002- 04498-C05-05 and TIC2002-04400-C03-03). * Corresponding author. Fax: +34 922 31 92 02. E-mail addresses: [email protected] (F. Almeida), dgon- [email protected] (D. Gonza ´lez), [email protected] (L.M. Moreno). Journal of Systems Architecture 52 (2006) 105–116 www.elsevier.com/locate/sysarc

Transcript of The Master-Slave Paradigm on Heterogeneous Systems: A Dynamic Programming Approach for the Optimal...

Journal of Systems Architecture 52 (2006) 105–116

www.elsevier.com/locate/sysarc

The master–slave paradigm on heterogeneous systems:A dynamic programming approach for the optimal mapping q

Francisco Almeida, Daniel Gonzalez, Luz Marina Moreno *

Dpto. Estadıstica, I.O. y Computacion, La Laguna University, La Laguna, Tenerife 38271, Spain

Received 30 June 2004; accepted 20 October 2004

Available online 23 March 2005

Abstract

We study the master–slave paradigm over heterogeneous systems. According to an analytical model, we develop a

dynamic programming algorithm that allows to solve the optimal mapping for such paradigm. Our proposal considers

heterogeneity due both to computation and also to communication. The optimization strategy used allows to obtain the

set of processors for an optimal computation. The computational results show that considering heterogeneity also on

the communication increases the performance of the parallel algorithm.

� 2005 Elsevier B.V. All rights reserved.

Keywords: Master–slave; Heterogeneous system; Optimal mapping; Dynamic programming

1. Introduction

Cluster Systems are playing an important role

in High Performance Computing due to the lowcost and availability of commodity PCs networks.

A typical resource pool comprises heterogeneous

commodity PCs and complex servers. The parallel

standard libraries allow an easy portability of the

1383-7621/$ - see front matter � 2005 Elsevier B.V. All rights reserv

doi:10.1016/j.sysarc.2004.10.005

q This work has been partially supported by the EC (FEDER)

and the Spanish MCyT (Plan Nacional de I+D+I, TIC2002-

04498-C05-05 and TIC2002-04400-C03-03).* Corresponding author. Fax: +34 922 31 92 02.

E-mail addresses: [email protected] (F. Almeida), dgon-

[email protected] (D. Gonzalez), [email protected] (L.M. Moreno).

parallel programs to the cluster contributing also

to the increasingly consideration of these architec-

tural solutions. However, a disappointing handi-

cap appears with the behavior of the realrunning time of the applications. Most of the pro-

grams have been developed under the assumption

of an homogeneous target architecture and this

assumption has been also followed by most of

the analytical and performance models [1,2]. This

models cannot be directly applied to the heteroge-

neous environments.

We revisit in this paper the mapping problemfor master–slave algorithms under this new heter-

ogeneous context. Some authors [3] broach the

same problem but consider heterogeneity due only

ed.

106 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116

to the difference in processing speeds of worksta-

tions. However, it is a stated fact that the heteroge-

neity in the speed of workstations can have a

significant impact on the communication send/

recv overhead [4], even when the communicationcapabilities of the processors are the same. Our

proposals consider heterogeneity due both to com-

putation and communication, what provides a

very general overview of the system and a very

wide range of applicability.

The master–slave paradigm has been exten-

sively studied and modeled in homogeneous archi-

tectures [5–8]. However the modelization of thisparadigm on heterogeneous platforms is still an

open question. A very good work that studies

the scheduling of parallel loops at compile time

in heterogeneous systems was presented in [9].

They cover both homogeneous and heterogeneous

systems and regular and irregular computations,

giving optimal or near optimal solutions. An ana-

lytical model and a strategy for an optimal distri-bution of the tasks on a master–slave paradigm

has been presented in [10]. However the accuracy

of the predictions with their model is still to be

proved. In [3] the authors provide effective models

in the framework of heterogeneous computing re-

sources for the master–slave paradigm. They for-

mulate the problem and approach different

variants but they consider heterogeneity only dueto differences in the processing speed of the work-

stations. In [11] a very elegant structural model

has been presented to predict the performance of

master–slave algorithms on heterogeneous sys-

tems. The model is quite general and covers most

of the situations appearing in practical problems.

The structural model can be viewed as an skeleton

where the models for master–slave applicationswith different implementations or environments

must be instantiated. The main drawback of the

approach is that the instantiation of the model

for the particular cases is not a trivial task. It does

not find optimal distribution of works among the

processors. In [12] we presented an analytical

model that predicts the execution time of master–

slave algorithms on heterogeneous systems. Themodel includes heterogeneity due both to compu-

tation and to communication. Our proposal covers

some of the variants studied in [3,11]. The compu-

tational results prove the efficiency of our scheme

and the accuracy of the predictions.

The scheduling problem for master–slave com-

putations has been extensively studied in many

of the aforementioned works, however, most ofthese works consider the problem of the distribu-

tion of work over a set of fixed processors. We

claim that to solve the optimal scheduling, the

set of processors must be considered also as a

parameter in the minimization problem. The pre-

dictive ability of the model that we propose in

[12] constitutes the basis for further studies for

the master–slave paradigm. The main contributionof this work is the development of an optimal

mapping strategy for master–slave computations

based in the analytical model introduced in [12].

In particular, this analytical model allows to for-

mulate the problem as a resource optimization

problem that has been studied in depth [13].

According to this formulation we derive strategies

for the optimal assignment of tasks to processors.Our proposal improves the classical approach of

assigning tasks of size proportional to the process-

ing power of each processor and includes the ele-

ments to obtain the set of processors for an

optimal computation.

The paper is organized as follows. Section 2

introduces the heterogeneous network model and

some notation used along the paper. Section 3specifies the problem and describes an inductive

analytical model that predicts the running time

of a master–slave program in an heterogeneous

cluster of PCs for a given distribution of work.

This analytical model has been validated in Sec-

tion 4. The problem of finding the optimal distri-

bution of work among the processors (the

mapping problem) is studied in Section 5. Basedon the analytical model presented in Section 3 we

propose a general approach that formulates the

mapping problem as a particular instance of the

Resource Allocation Problem (RAP). We solve

this optimization problem using a dynamic pro-

gramming algorithm. The dynamic programming

approach allows also to solve the problem of find-

ing the optimal number of processors. The compu-tational results presented in Section 6 prove that

considering heterogeneity on the communication

capabilities increases the quality of the predictions.

F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 107

The paper is finished in Section 7 with some con-

cluding remarks and future lines of work.

2. Heterogeneous network model

A Heterogeneous Network (HN) can be ab-

stracted as a connected graph HN(P, IN) [14]

where:

• P = {P1, . . . , Pp} is the set of heterogeneous

processors (p is the number of processors).

The computation capacity of each workstationis determined by the power of its CPU, I/O

and memory access speed.

• IN is a standard interconnection network for

the processors where the interconnection links

between any pair of the processors may have

different bandwidth.

Let M be the problem size, expressed as thenumber of basic operations needed to solve the

problem, and let Tsi be the response time of the se-

rial algorithm for processor i and let Ti be defined

as the time elapsed between the beginning and the

end of the parallel execution in processor i. The

parallel run time T par ¼ maxpi¼1T i is the time that

elapses from the instant that a parallel computa-

tion starts until the last processor finishes the exe-cution [15].

The average Computational Power (CPi) of

processor i, in a heterogeneous system, can be de-

fined [16] as the amount of work finished during a

unit time span in that processor. CP depends on

the node�s physical features but also on the partic-

ular algorithm implementation that is actually

being processed. The average Computational

Power for processor i under a specific workload

M can be expressed as: CP i ¼ MT si. From a practical

point of view, the average Computational Power in

a set of processors can be calculated by executing a

sequential version of the algorithm under consider-

ation in each of them, using a problem size large

enough to prevent estimation errors due to cache

memory effects.Several authors provide definitions for the

speedup of the heterogeneous network [17], we

follow the definition Sp ¼min

pi¼1

T si

T p, where the speed-

up is calculated using the faster processors of the

network as the reference processor [18].

The cost of a communication in the target

architecture is an important factor to be consid-

ered. For the sake of the simplicity it is commonlyassumed that the cost of a communication can be

modeled by the classical linear model b + sw. band s stands for the latency and per-word transfer

time respectively and w represents the number of

bytes to be sent. This communication cost involves

the time for a processor to send data to a neighbor

and for it to be received. Although according to

the LogP cost model some other parametersshould be considered, it is also a commonly ac-

cepted fact that, due to the current routing tech-

niques and to the small diameter of the network,

the inter-processor distance can be neglected [19].

In the heterogeneous network where the communi-

cation capabilities may be different among proces-

sors, the parameters b and s must be computed

independently for any pair of processors.

3. The master–slave paradigm

Under the master–slave paradigm it is assumed

that the work W, of size m, can be divided into a

set p of independent tasks work1, . . . , workp, of

arbitrary sizes m1, . . . , mp,Pp

i¼1mi ¼ m, that canbe processed in parallel by the slave processors

1, . . . , p. We abstract the master–slave paradigm

by the following code:

/* Master Processor */

for (proc = 1; proc <= p; proc++)

send(proc, work[proc]);

for (proc = 1; proc <= p; proc++)

receive(proc, result[proc]);

}

/* Slave Processor */

receive(master, work);

compute(work, result);

send(master, result);

The total amount of work W is assumed to beinitially located at the master processor, processor

p0. The master, according to an assignment policy,

divides W into p tasks and distributes worki to

108 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116

processor i. After the distribution phase, the mas-

ter processor collects the results from the proces-

sors. Processor i computes worki and returns the

solution to the master. The master processor and

the p slaves are connected through a network, typ-ically a bus, that can be accessed only in exclusive

mode. At a given time-step, at most one processor

can communicate with the master, either to receive

data from the master or to send solutions back to

it. The results are returned back from the slaves as

a synchronous round robin.

We will denote by Ci the computing time of task

i on processor i. In the context of heterogeneoussystems, Ci depends on the size of the task and

on the Computational Power of the processor to

be assigned.

Since the heterogeneity is due to the differences

in speeds, both in computation and communica-

tion, not only the computing times are different

but also the transmission time from/to the master

processor may be different for the different slaves.The transmission time from the master processor

to the slave processor i will be denoted by Ri =

(bi + siwi) where wi stands for the size in bytes

associated to worki and bi and si represent the la-

tency and per-word transfer time respectively. This

communication cost involves the time for the mas-

ter to send data to a slave and for it to be received.

The latency and transfer time may be different onevery combination master–slave and they must be

calculated separately. The transmission time of the

solution from the slave to the master is obtained

by Si = (bi + sisi) where si denotes the size in bytes

of the solution associated to task worki. Again, this

communication cost involves the time for a slave

Fig. 1. Timing diagram: Ci + Si = Ri+1 + Ci+1. Left—A perfect loa

computation is well balanced but on the dark area Si cannot overlap

to send the results to the master and for it to be re-

ceived. The definitions done of Ri and Si consider

different sizes for the tasks and the results gener-

ated. We will denote by Ti to the time needed by

processor i to finish its processing. Ti includesthe time to compute the task, the time to receive

it (Ri) and to send the solution back(Si) and the

times for processors j = 1, . . . , i � 1 to receive their

tasks, R1, . . . , Ri�1.

The master–slave paradigm has been exten-

sively studied and many scheduling techniques

have been proposed both in the case of irregular

computations and in the case of heterogeneousarchitectures. A very good work on this topic

can be found in [9]. They cover most of the situa-

tions on homogeneous and heterogeneous plat-

forms and regular and irregular computations,

but they do not consider the scheduling for conten-

tion avoidance in heterogeneous platforms. In [10]

is presented a scheduling strategy to solve this

case. It can be considered a natural extension ofthe work developed in [9]. They consider that the

optimal scheduling is obtained when Ci + Si =

Ri+1 + Ci+1 (Fig. 1—Left). The underlying ideas

are the overlapping of computation and communi-

cation and a load balance among the processors.

The static schedule is obtained through the resolu-

tion of a linear system of p � 1 equations like the

former.However, when the number of processors is

large enough, the former strategy may generate

non-optimal mappings. In Fig. 1 (Right) we can

see as a solution to the former system of equation

yields to a non-optimal solution due to the com-

munication contention. The former solution as-

d balanced computation. Right—Wrong timing diagram, the

Ri.

Fig. 2. Left—corrected timing diagram. Right—improved schedule.

Fig. 3. The optimal schedule uses 4 processors.

F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 109

sumes that the first processor will finish the com-

putation after the last processor has started to

work. Fig. 2 (Left) shows the corrected timing

diagram and a corrected mapping where the con-

tention avoidance is considered. The wall clock

time is reduced using five processors and a differ-

ent distribution of work (Fig. 2—Right). How-

ever, the Fig. 3 represents the optimal mappingfor the considered instance, it shows the fact that

using the best distribution on five processors

is slower than the optimal schedule with four

processors.

We claim that to obtain the optimal mapping,

not only the size of the problem should be consid-

ered but the number of processors is an important

minimization parameter. In [12] we presented ananalytical model for heterogeneous systems that

considers both parameters to generate the optimal

mapping. This analytical model can be considered

a particular instantiation of the structural model

described in [11]. According to our model, the run-

ning time for the whole computation can be ob-

tained with Tpar = Tp, that can be inductively

calculated as

T p ¼ T p�1 þ Sp þmaxð0; dpÞ¼ T p�2 þ Sp�1 þ Sp þmaxð0; dp�1Þ þmaxð0; dpÞ

¼ T 1 þXp

i¼2

Si þXp

i¼2

maxð0; diÞ

¼ R1 þ C1 þXp

i¼1

Si þXp

i¼1

maxð0; diÞ

where d1 = 1 and, for i > 1,

di ¼gi if di�1 P 0

gi � jdi�1j in other case

(with

gi ¼ ðRi þ CiÞ � ðCi�1 þ Si�1Þ:

di holds the accumulated delay between the first

i � 1 processors and processor i and, gi stands

for the gap between processor i � 1 and processor

i.Fig. 4 illustrates the situation with three

processors:

g3 = (R3 + C3) � (C2 + S2) denotes the gap be-tween slaves 2 and 3,

d3 ¼g3 if d2 P 0

g3 � jd2j in other case

is the accumulated delay and, the total running

time is T3 = T2 + S3 + max(0, d3).

The next constraint of the model assures that no

overlap amongst the problem sending and data

collection is produced:

Xp

i¼2

Ri <¼ C1

Fig. 4. Timing diagram: heterogeneous master–slave, p = 3.

Up—the slowest processor is the second one. Down—the

slowest processor is the last one.

110 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116

The following sections validate and prove the effec-

tiveness of our proposal. The optimal distribution

of work among the set of processors should be such

Table 1

Accuracy of the model: m = 950, m = 1500

p Configuration mFast mSlow

m = 950

3 F F S 475 475

3 F F S 634 316

5 F F F S S 238 237

5 F F F S S 317 158

9 S F F F F S S S S 113 112

9 S F F F F S S S S 150 75

m = 1500

3 F F S 750 750

3 F F S 1000 500

5 F F F S S 375 375

5 F F F S S 500 250

9 S F F F F S S S S 188 187

9 S F F F F S S S S 250 125

p is the number of processors and Configuration show the configuration

the Fast processors, mSlow represents the work assigned to the Slow pr

Model � Predicted shows the values obtained with the analytical mod

that the analytical complexity function, Tpar = Tp,

be minimized. An analytical approximation to

minimize this function becomes extremely difficult.

Instead of an analytical approach we will perform

an algorithmic proposal to obtain the minimum.

4. Validating the analytical model

We develop in this section a computational

experience to validate our analytical complexity

function. We consider as test problem a typical

academic example used along the literature, themaster–slave approach to solve a matrix product.

The sequential algorithm implemented is the basic

O(m3) one. For the parallel matrix product A · B,

we have considered squared matrices of size m, the

matrix B has been previously broadcasted to the

whole set of slaves and mi rows of matrix A should

be sent to processor i, withPp

i¼1mi ¼ m.The computational experience has been devel-

oped under a cluster of PCs. The cluster has two

groups of processors, the first one is composed

by four AMD Duron (tm) 800 MHz PCs with

256 MB memory. These machines are referred as

fast nodes (F). The second group of processors,

the slow nodes (S), contemplates six AMD-K6

(tm) 500 MHz PCs with 256 MB memory. All

RealTime Model � Predicted Error

214.99 221.15 �0.029

143.02 147.28 �0.0298

109.94 111.18 �0.011

73.53 74.98 �0.020

45.39 47.45 �0.045

30.44 32.71 �0.075

852.29 884.90 �0.038

568.41 592.73 �0.043

442.76 445.09 �0.005

294.88 299.47 �0.016

220.33 224.25 �0.018

147.64 153.37 �0.039

of the processors selected. mFast represents the work assigned to

ocessors, RealTime shows the experimental time in seconds and

el for the selected distribution.

Table 2

Sequential running time ratio Slow/Fast (S/F)

Size 100 150 200 250 300 1800 1900

S/F 2.37 1.92 2.07 1.75 1.89 2.17 1.79

The sequential algorithm has been executed for squared

matrices of different sizes.

Table 3

Running times of the sequential algorithm on the faster

machine

m 900 1000 1500

Sequential time 71.65 98.78 616.10

F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 111

the machines run Debian Linux Operative System.

They are connected through a 100 Mbit/s fast

Ethernet switch.

To characterize the Computational Power of the

machines, we execute the sequential algorithm in

the different nodes. Tables 2 and 3 shows the ratios

amongst these executions for several instances. A

ratio of 2 is considered as the average value. Theheterogeneity of the cluster is clearly stated through

theComputational Power. This value is employed in

proportional data distribution layouts.

The analytical model presented in Section 3 uses

architectural dependent parameters as the comput-

ing time (Ci for processor i) and the parameters re-

lated to the communication network (latency and

bandwidth).Since Ci depends on the Computational Power

of the processors, in order to validate our pro-

posal, a wider execution of the serial matrix prod-

uct algorithm in the different types of processors is

performed. Fig. 5 plots these running times when

Fig. 5. The computing time Ci.

varying the size of the matrices. Ci is estimated

from the sequential execution for several sizes.

We have also characterized point to point com-

munications on the various type of processors in

the heterogeneous network. The blocking MPI_Send and MPI_Receive calls have been used in the

ping–pong experiment. The linear fit of the sending

overhead is plotted as a function of the message

sizes for combinations of both the slow and fast

nodes of our testbed architecture. Since the para-

meters are measured using a ping–pong the values

obtained for the Fast–Slow and Slow–Fast combi-

nations are considered equal. Fig. 6 shows theasymptotic effect of the parameters b and s.

Table 1 collects the computational results of the

experiment developed to validate our model. We

measure the real time expressed in seconds (Real

Time) obtained by an execution of the parallel

algorithm, the time predicted by the model

(Model � Predicted) and the relative error made

by the prediction for different data distributionlayouts using various configurations of fast/slow

processors. We have considered configurations

involving 2, 4 and 8 slaves. Column labeled Config-

uration in the table shows the configuration of pro-

cessors selected, F stands for a fast processor and S

for a slow processor. The first processor of the

configuration is the master processor. In Table 1,

mFast denotes the number of rows assigned to a fastprocessors and mSlow denotes the number of rows

Fig. 6. Linear fit for the transfer time for point-to-point

communication using different types of processors. S = slow

processor, F = fast processor.

112 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116

assigned to a slow processors. Two series of exper-

iments were developed. The first one considers

mFast = mSlow. The second one assumes that mi

should be proportional to the processing speed

and assigns mFast � 2mSlow since we have experi-mentally verified that, for the problem considered,

this is the ratio between the processing speeds of

the fast and slow processors. The accuracy of the

model is confirmed by the very low error made

in most of the cases. The relative error made is

not greater than 7.5% using eight slaves on the

proportional distribution of the problem with size

m = 950.

5. An exact method for the optimal mapping

The predictive capabilities of the analytical

model presented in Section 3 can be used to predict

the sizes of the parameters for an optimal execu-

tion, i.e., to predict the optimal amount of workto be assigned to each slave processor providing

the minimum execution time. To be more precise,

we have to find sizes m1, . . . , mp for the tasks

work1, . . . , workp such that the analytical complex-

ity function Tp reaches the minimum. This section

describes an exact method that solves the general

optimization problem.

We will now formulate the problem of findingthe optimal distribution of work as a Resource

Allocation Problem and we will solve this resource

allocation problem through a dynamic program-

ming algorithm with complexity O(pm2).

The classical Resource Allocation Problem

[20,13] asks to allocate limited resources to a set

of activities to maximize their effectiveness. The

simplest form of such problem involves only onetype discrete resource. Assume that we are provided

M units of an indivisible resource and N activities

and for each activity i, the benefit (or cost) function

fi(x) gives the income (or penalty) obtained when a

quantity x of resource is allocated on the activity i.

Mathematically it can be formulated as

minXNi¼1

fiðxiÞ

s:t:XNi¼1

xi ¼ M and a 6 xi

where a is the minimum number of resource units

that can be allocated to any activity.

The schedule of a master–slave computation can

be managed as a Single Resource Allocation Prob-

lem. We consider that the units of resource in theformer problem are now the sizes of subproblems,

mi, that can be assigned to a processor i, and the

activities are represented by the processors, i.e.,

the number of activities N is equal to the number

p of processors, N = p. The assignment of one unit

of work to a processor means to assign a task work

of size one. The cost functions can be expressed in

terms of the analytical complexity function, f1 = T1

and fi = Si + max(0, di) so thatPp

i¼1fi ¼ T p. Then,

the optimization problem turns into

min T par ¼ T p

s:t:Xp

i¼1

mi ¼ m and 0 6 mi

That means that we propose to minimize the total

running time of the parallel algorithm, Tpar, on the

p processor machine, assuring that the total

amount of work,Pp

i¼1mi ¼ m, has been distributedamongst the set of processors. A solution to the

former optimization problem will provide an opti-

mal schedule for the master–slave computation.

The assignment of a value mi = 0 to processor i will

imply that considering processor i will not reduce

the running time of the computation.

Now, the dynamic programming strategy to

solve the optimization problem is presented. Dy-namic programming is an important problem solv-

ing technique used to solve combinatorial

optimization problems. The underlying idea of dy-

namic programming [20] is to avoid calculating the

same thing twice, usually by keeping a table of

known results that fills up as the subproblems

are solved. Dynamic Programming on the other

hand is a bottom-up technique. Usually it startswith the smallest, and hence the simplest subprob-

lems. By combining their solutions, the answers to

subproblems of increasing size are obtained until

finally the solution of the original instance is com-

puted. Dynamic programming is an algorithm de-

sign method that can be used when the solution to

a problem can be viewed as the result of a sequence

of decisions. The method enumerates all decisionsequences and pick out the best.

F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 113

Let us denote by G[i][x] the minimum running

time obtained when using the first i processors

to solve the subproblem of size at most x,

i = 1, . . . , p and x = 1, . . . , m. The optimal value

for the general master–slave problem is repre-sented by G[p][m], i.e., to solve the problem of size

m, using p processors. The principle of optimality

holds for this problem [20,21] and the following

recurrence equation will supply the values for

G[i][x]:

G½i�½x� ¼ minfG½i� 1�½x� j� þ fiðjÞ=0 < j 6 xgi ¼ 2; . . . ; p

G½1�½x� ¼ f1ðxÞ; 0 < x 6 m and

G½i�½x� ¼ 0; i ¼ 1; . . . ; p; x ¼ 0

The problem can be solved by a very simple multi-

stage dynamic programming algorithm in time

O(pm2) that can be very easily parallelized follow-

ing the approach presented at [21] with a complex-

ity O(p + m2) using p processors. Note that noparticular consideration has been done about the

type of the processor considered. The formulation

obtained is general and the complexity of the dy-

namic programing algorithm is the same if the het-

erogeneous system is composed of processor with

completely different computational capabilities.

To obtain the optimal mapping for a given prob-

lem and environment, the user just need to run thisalgorithm using as input parameters the set of pro-

cessors to be considered, and the architectural and

problem dependent constants.

An important advantage of the dynamic pro-

gramming approach is that the topic of sensitivity

analysis can be applied to the mapping problem

and the effect of allocating a unit of resource to

a processor can be deeper analyzed. The optimalnumber of processors to be used in an optimal exe-

cution can be obtained as a direct application of

the sensitivity analysis.

The set of processor to be used in the computa-

tions is determined by the amount of work assigned

to them. Those processors receiving an assignment

of 0 rows to compute are excluded from the compu-

tation. The optimal number of processors poptimal

can be detected as the number processors in this

set. Assumed that the faster processors should be

the processors receiving the highest amount of

work, if the set of processors is arranged according

to their computational power (a non-increasing

arrangement) poptimal is obtained as the first proces-

sor poptimal such that poptimal + 1 does not receive

any resource to compute. Note that in this hetero-geneous context, to develop a realistic sensitivity

analysis the total set of processors must be consid-

ered when computing the dynamic programming

table, see Table 6 in Section 6.

6. Computing the optimal mapping

We develop in this section a computational

experience to verify that the assignment strategies

proposed by our technique introduce a significant

improvement. We contrast the running time ob-

tained when using the distribution of work pro-

vided by our Dynamic Programming procedure

against distributions of work that only consider

the computational power of the processors.We use again as test problem the master–slave

approach to solve a matrix product. Square matri-

ces of sizes m = 900, 1000, 1500 have been consid-

ered. The speedups have been calculated using the

basic O(m3) algorithm executed over the faster

machine, even when slow processors are intro-

duced in the parallel computation. The speedup

has been calculated as the ratio of the serial runtime on the fastest machine to the time taken by

the parallel algorithm on the set of processors

considered.

A schedule that does not consider heterogeneity

due to differences on the communication capabili-

ties, would consider a distribution of work propor-

tional to the processing speeds of the processors. In

Section 4 has been stated that for thematrix productproblem in our cluster, the Computational Power of

a fast machine is approximately a factor of two

times the Computational Power a slow machine.

Table 4 shows the running time of the parallel

algorithm for the proportional and optimal distri-

bution (columns labeled PTime and OTime respec-

tively) and the execution time of the program that

computed the optimal distribution (column la-beled Schedule Time). Table 5 shows the sizes of

the subproblems assigned to the fast processors

(mFast) and to the slow processors (mSlow) for the

Table 6

Sensitivity analysis for a small problem m = 100

Processors Distribution Predicted Time

10 Fast mFast = 10 0.03891

10 Fast mFast = 10 0.03891

1 Slow mSlow = 0

10 Fast mFast = 9 0.03869

10 Slow mSlow1¼ mSlow2

¼ 5

mSlowi ¼ 0, i = 3, . . . , 10

Fig. 7. Relative improvements obtained for different sizes of

problems. Data have been extracted from column Relative

Improvement in Table 4.

Table 4

Running times of the parallel algorithm

m PTime OTime Relative

Improvement (%)

Schedule

Time

p = 3 Configuration: F F S

900 116.45 79.27 31.93 0.87

1000 161.43 105.01 34.95 1.08

1500 576.02 448.09 22.21 2.49

p = 5 Configuration: F F F S S

900 60.21 46.81 22.26 2.62

1000 85.54 63.43 25.85 3.23

1500 287.89 217.19 24.56 7.37

p = 7 Configuration: F F F F S S S

900 40.64 34.12 16.04 4.37

1000 57.63 44.81 22.26 5.40

1500 179.38 148.01 17.49 12.32

p = 9 Configuration: S F F F F S S S S

900 29.85 26.69 10.59 6.11

1000 43.43 35.73 17.73 7.55

1500 146.32 109.92 24.88 17.20

114 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116

proportional and optimal distribution (columns la-

beled PSize and OSize respectively).

The percentage of relative improvement

achieved is also presented (Relative Improvement

in Table 4 and Fig. 7) and has been calculated asðPTimeÞ�ðOTimeÞ

ðPTimeÞ � 100. For all the sizes considered,

obviously, the optimal distribution always presents

the better behavior. A relative improvement of

34.95% is observed for the problem m = 1000 using

Table 5

Predicting the parameters for an optimal execution

m PSize OSize

p = 3 Configuration: F F S

900 mFast = 600 mFast = 701

mSlow = 300 mSlow = 199

1000 mFast = 666 mFast = 794

mSlow = 334 mSlow = 206

1500 mFast = 1000 mFast = 1191

mSlow = 500 mSlow = 309

p = 7 Configuration: F F F F S S S

900 mFast = 200 mFast = 216

mSlow = 100 mSlow = 84

1000 mFast = 222 mFast = 216

mSlow = 111 mSlow = 88�88�86

1500 mFast = 334 mFast = 399

mSlow = 166 mSlow = 101

p = 3 processors. Fig. 8 shows the speedup ob-

tained using the two distributions with m =

1000, 1500 for all the configurations of processors.

Note that as an heterogeneous environment is

PSize OSize

p = 5 Configuration: F F F S S

mFast = 300 mFast = 333

mSlow = 150 mSlow = 117

mFast = 334 mFast = 370

mSlow = 166 mSlow = 130

mFast = 500 mFast = 585

mSlow = 250 mSlow = 165

p = 9 Configuration: S F F F F S S S S

mFast = 150 mFast = 158

mSlow = 75 mSlow = 67

mFast = 166 mFast = 181

mSlow = 84 mSlow = 69

mFast = 250 mFast = 295

mSlow = 125 mSlow = 80

Fig. 8. Speedups obtained with the parallel algorithm for

different sizes of problems and different configurations of

processors.

F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 115

under consideration, on every set of processors fast

and slow machines have been included. This fact

explains the speedups observed, not so impressive

as they use to be for matrix multiplication algo-

rithms in homogeneous architectures. However,the illustration reflects an important increase of

the speedup for the scheduling strategy provided

for the Dynamic Programming algorithm.

Table 6 allows to develop a sensitivity analysis

to determine the optimal number of processors

for a small problem, m = 100. Let us assume that

we have been provided with a set of 10 fast proces-

sors and 10 slow processors. The first row of thetable shows the optimal distribution considering

just the first 10 fast processors, the second row

considers 10 fast processors and 1 slow processor

and the third one considers the whole set of

processors. The table also provides the execution

time predicted by the model for that distributions.

The first situation reflects the homogeneous

case where 10 identical processors are considered,the optimal distribution obviously comes from a

balanced distribution. The second situation con-

siders the fact that to introduce just one slow

processor does not improves the computation,

the slow processor would not receive any amount

of work at all. When considering the total set

of processors provided (the third row) the opti-

mal distribution assigns work just to two slowprocessors.

We have also applied our proposals to the bidi-

mensional Fast Fourier Transform (FFT-2D).

This is a problem that plays an important role in

many scientific and engineering applications. The

FFT-2D exhibits a ratio computation to commu-

nication quite different to the matrix multiplication

problem. The complexity order of the data to be

sent in this case, is the same as the complexity

order of the data to be computed. Using our ap-

proach we tune a master–slave FFT-2D algorithmto be efficiently executed in a heterogeneous cluster

of PCs.

7. Concluding remarks and future work

We have proposed an algorithmic approach to

solve the optimal mapping problem of master–slave algorithms on heterogeneous systems. The

algorithmic approach considers an exact dynamic

programming algorithm based on an inductive

analytical model. The dynamic programming algo-

rithm solves the general resource allocation opti-

mization problem involved into a master–slave

computation. Through the sensitive analysis, the

optimal set of processors can also be determined.There are several challenges and open questions

to broach in the incoming future. Although the

class of master–slave problems considered repre-

sents a wide range of problems, the master–slave

programs where the results are accepted from the

slave as soon as they are available, are not well

modeled with this model. We plan to develop the

same methodology with the aim of covering thisclass of problems also. This methodology can be

introduced into a tool to provide the optimal map-

ping of master–slave algorithms over heteroge-

neous platforms following the algorithmic

strategies proposed.

References

[1] S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci, A

portable programming interface for performance evalua-

tion on modern processors, The International Journal of

High Performance Computing Applications 1 (2000) 189–

204.

[2] J. Gonzalez, C. Rodrıguez, G. Rodrıguez, F. de Sande, M.

Printista, A tool for performance modeling of parallel

programs, Scientific Computing (2003) 191–198.

[3] O. Beaumont, A. Legrand, Y. Robert, The master–slave

paradigm with heterogeneous processors, Rapport de

recherche de l� INRIA-Rhone-Alpes RR-4156, 2001, 21 p.

116 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116

[4] M. Banikazemi, J. Sampathkumar, S. Prabhu, D.K.

Panda, P. Sadayappan, Communication modeling of

heterogeneous networks of workstations for performance

characterization of collective operations, in: International

Workshop on Heterogeneous Computing (HCW�99), in

Conjunction with IPPS�99, 1999, pp. 159–165.[5] G. Jones, M. Goldsmith, Programming in Occam 2,

Prentice Hall, 1988.

[6] P. Hansen, Studies in Computational Science, Prentice

Hall, 1995.

[7] J. Roda, C. Rodrıguez, F. Almeida, D. Morales, Prediction

of parallel algorithms performance on bus based networks

using PVM, in: Sixth Euromicro Workshop on Parallel and

Distributed Processing, 1998, pp. 57–63.

[8] D. Turgay, Y. Paker, Optimal scheduling algorithms for

communication constrained parallel processing, in: Euro-

par2002Lecture Notes in Computer Science, vol. 2400,

Springer-Verlag, 2002, pp. 197–211.

[9] M. Cierniak, M. Zaki, W. Li, Compile-time scheduling

algorithms for heterogeneous network of workstations,

The Computer Journal 40 (60) (1997) 236–239.

[10] M. Drozdowski, P. Wolnievicz, Experiments with sched-

uling divisible tasks in clusters of workstations, in: Euro-

par2000Lecture Notes in Computer Science, vol. 1900,

Springer-Verlag, 2000, pp. 311–319.

[11] J. Schopf, Structural prediction models for high-perfor-

mance distributed applications, in: Cluster Computing

Conference, 1997.

[12] F. Almeida, D. Gonzalez, L. Moreno, C. Rodriguez, J.

Toledo, On the prediction of master–slave algorithms over

heterogeneous clusters, in: Eleventh Euromicro Conference

on Parallel, Distributed and Network-Based Processing,

2003, pp. 433–437.

[13] T. Ibaraki, N. Katoh, Resource Allocation Problems.

Algorithmic Approaches, The MIT Press, 1988.

[14] X. Zhang, Y. Yan, Modeling and characterizing parallel

computing performance on heterogeneous networks of

workstations, in: 7th IEEE Symposium on Parallel and

Distributed Processing (SPDP�95), 1995, pp. 25–34.[15] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction

to Parallel Computing Design and Analysis of Algorithms,

The Benjamin/Cummings Publishing Company, 1994.

[16] L. Pastor, J. Bosque, An efficiency and scalability model

for heterogeneous clusters, in: 2001 IEEE International

Conference on Cluster Computing (CLUSTER 2001),

2001, pp. 427–434.

[17] Y. Yan, X. Zhang, Y. Song, An effective and practical

performance prediction model for parallel computing on

nondedicated heterogeneous NOW, Journal of Parallel and

Distributed Computing 41 (2) (1997) 63–80.

[18] V. Donaldson, F. Berman, R. Paturi, Program speedup in

a heterogeneous computing network, Journal of Parallel

and Distributed Computing 21 (3) (1994) 316–322.

[19] D. Culler, K. Richard, D. Patterson, A. Sahay, K.

Schauser, E. Santos, R. Subramonian, T. von Eicken,

LogP: towards a realistic model of parallel computation,

in: 4th ACM SIGPLAN, Sym. Principles and Practice of

Parallel Programming, 1993.

[20] T. Ibaraki, Enumerative approaches to combinatorial

optimization, part ii, Annals of the Operational Research

11.

[21] D. Gonzalez, F. Almeida, J. Roda, C. Rodrıguez, From

the theory to the tools: parallel dynamic programming,

Concurrency: Practice and Experience (2000) 21–34.

Francisco Almeida is an Assistant Pro-

fessor of Computer Science at the

University of La Laguna in Tenerife

(Spain). His research interests lie

primarily in the areas of parallel com-

puting, parallel algorithms for optimi-

zation problems, parallel systems

performance analysis and prediction,

and skeleton tools for parallel pro-

gramming. Almeida received the BS

degree in mathematics and the MS and

PhD degrees in computer sciences from the University of La

Laguna.

Luz Marina Moreno is currently a PhD

student in Computer Science at the

Statistics, Operational Research and

Computer Science department at Uni-

versity of La Laguna. She now is

Assistant Professor at the same depart-

ment. She is mainly interested in heter-

ogeneous systems and in scheduling.