Combining static and dynamic scheduling on distributed-memory multiprocessors

11
Combining Dynamic and Static Scheduling on Distributed-Memory Multiprocessors O. Plata F.F. Rivera July 1994 Technical Report No: UMA-DAC-94/09 Published in: 8th ACM Int’l. Conf. on Supercomputing Manchester, UK, July 11-15, 1994, pp 186-195 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 4114 E-29080 Malaga Spain

Transcript of Combining static and dynamic scheduling on distributed-memory multiprocessors

Combining Dynamic and Static Schedulingon Distributed-Memory Multiprocessors

O. PlataF.F. Rivera

July 1994

Technical Report No: UMA-DAC-94/09

Published in:

8th ACM Int’l. Conf. on SupercomputingManchester, UK, July 11-15, 1994, pp 186-195

University of MalagaDepartment of Computer ArchitectureC. Tecnologico • PO Box 4114 • E-29080 Malaga • Spain

Combining Static and Dynamic Schedulingon Distributed-Memory MultiprocessorsOscar Plata Francisco F. [email protected]. Electr�onica y Computaci�onUniversity of Santiago de CompostelaE-15706 Santiago de Compostela, SPAINAbstractLoops are a large source of parallelism for many numeri-cal applications. An important issue in the parallel exe-cution of loops is how to schedule them so that the work-load is well balanced among the processors. Most existingloop scheduling algorithms were designed for shared-memorymultiprocessors, with uniform memory access costs. Theseapproaches are not suitable for distributed-memorymultipro-cessors where data locality is a major concern and communi-cation costs are high. This paper presents a new schedulingalgorithm in which data locality is taken into account. Ourapproach combines both worlds, static and dynamic schedul-ing, in a two-level (overlapped) fashion. This way data local-ity is considered and communication costs are limited. Theperformance of the new algorithm is evaluated on a CM-5message-passing distributed-memory multiprocessor.Keywords: Distributed-memory multiprocessors, mes-sage-passing, loop scheduling, dynamic and static schedul-ing, load balancing.1 IntroductionLoops exhibit most of the parallelism present in numericalprograms. Therefore, distributing the workload of a loopevenly among the processors is a critical factor in the ef-�cient execution of this kind of parallel programs. A loopscheduling strategy assigns iterations to the processors of themachine in such a way that all of them �nish their work-load at more or less the same time. A simple strategy isstatic scheduling, which determines the assignment of itera-tions to processors at compile-time. But in many situations,the workload of the iterations is unpredictable at compile-time. Dynamic scheduling strategies have been developedto handle such situations, solving the assignment problemat run-time.When considering cross-iteration data dependencies wecan classify the loops into three major types, doserial, doalland doacross [Pol88]. In this paper we are primarily con-cerned with doall or parallel loops, in which there is no datadependence between any pair of their iterations. Otherwise,Published in Proc. ACM Int'l Conf. on Supercomputing,Manchester, UK, July 11-15, 1994

the loop is taken as serial. From the point of view of thescheduling a loop is uniform if the execution times of dif-ferent iterations are the same (e.g. matrix multiplication),semi-uniform, if they depend on the index of the loop (e.g.adjoint convolution), or non-uniform, if they depend on thedata (e.g. transitive closure).We can always obtain an optimal (or near optimal) staticschedule for uniform loops. If we have a parallel machinewith P processors, we must simply partition the set of iter-ations of the loop into P chunks of dP=Ne iterations each,where N is the length of the loop. Now we assign each ofthe chunks to each of the processors. We can still obtain an(near) optimal static schedule for semi-uniform loops. Forexample, consider the parallelized adjoint convolution algo-rithm which exhibits a triangular iteration space,DOALL 10 I = 1, N*NDO 10 K = I, N*NA(I) = A(I) + X*B(K)*C(I-K)10 CONTINUEWe can establish a (near) optimal static schedule using akind of cyclic loop distribution scheme. We assign the iter-ation i+ rP (i from 1 to P and r = 0, 1, 2, ...) to processori (if r even) or to processor P � i+ 1 (if r odd).Clearly, it is not possible to de�ne an optimal staticschedule for a non-uniform loop because the data is knownonly at run-time. These loops are frequent in many appli-cations such as image processing, sparse matrix computa-tion and partial di�erential equations, among others. Mostof the dynamic scheduling strategies used for balancing theworkload of a non-uniform parallel loop were designed forshared-memory machines, with uniform memory access cost(UMA). Therefore, the data locality is not taken into con-sideration. Nevertheless, in a distributed-memory machinethe data locality is a major concern and must be optimized.This paper concentrates on non-uniform parallel loopsand discusses an e�cient solution for scheduling this kind ofloops on distributed-memory parallel machines. Our strat-egy combines features from static and dynamic schedulingtechniques and succeeds in keeping the cost associated to re-mote memory accesses low. Section 2 describes some schedul-ing strategies proposed in the literature, specially for distri-buted-memory machines. Our approach for performing loopscheduling is explained in Section 3. Section 4 shows theexperimental results obtained on the CM-5 and, �nally, thepaper is concluded in Section 5.1

2 Related workWe can �nd some static scheduling solutions proposed in theliterature. In [PKP89] an optimal compile-time schedulingstrategy is proposed for perfectly-nested parallel loops withconstant bounds. Sarkar and Hennessy [SH86] proposed ascheduling method based on splitting a program into sequen-tial tasks. In [BB91] the scheduling is solved as a linear opti-mization problem. Tawbi and Feautrier [TF92] tested someheuristics to solve the processor allocation and schedulingproblem. The application of symbolic analysis to derive anoptimal static scheduling can be found in [HP93].For non-uniform problems it is necessary to balance theload dynamically, at run-time. In self-scheduling (SS) [TY86],the simplest strategy, each processor repeatedly selects andexecutes one iteration until all iterations are executed. Anexperimental comparison of self-scheduling and static pre-scheduling algorithms can be found in [Zha91]. (Fixed-size)chunk scheduling [KW85] reduces the high synchronizationoverhead found in SS by scheduling chunks of more than oneiteration as units. GSS [PK87] schedules large chunks at thebeginning of the loop (low synchronization overhead) andsmall ones toward the end of the loop, trying to balance theworkload. In Factoring scheduling [HSF92] the allocation ofiterations to processors proceeds in phases. In each phase,a part of the remaining iterations is divided equally amongthe available processors. Trapezoid scheduling [TN93] is avariation of GSS where the size of the successive chunksdecreases linearly instead of exponentially. In [Luc92] a ta-pering scheduling is proposed, similar to GSS and Trapezoidscheduling but with a di�erent function for the decreasingsize of the chunks.All of above dynamic scheduling were designed for shared-memory parallel machines, where an e�cient method of syn-chronization and data sharing is found. For distributed-memory parallel machines, there is a big di�erence betweenthe access time to the local memory and to the remote mem-ories. Then, the synchronization overhead and the data-sharing cost is high in comparison with access cost of localdata. A much better scheduling performance is obtained ifthe locality of the data is exploited. It was not until veryrecently that some dynamic scheduling strategies which ex-ploit the data locality have appeared in the literature.In [ML94], the a�nity dynamic scheduling (ADS) is in-troduced for NUMA machines. This algorithm, as GSS orTrapezoid scheduling, assigns large chunks of iterations atthe start of loop execution and uses a deterministic assign-ment policy to ensure that a chunk is always assigned tothe same processor. This way, data locality is exploited,but it only works in the case that the same data is reusedseveral times. That happens, for example, when the par-allel loop is surrounded by a sequential one, and then thesame chunk is executed repeatedly. A locality-based dy-namic scheduling (LDS), designed for NUMA machines, isdescribed in [LTSS93]. LDS considers the data space is par-titioned throughout the processors. It computes the sizes ofthe chunks using a GSS-like scheme. Each chunk is assignedto a processor by demand and it includes those iterationswhose data is stored in the local memory of the processorassigned.In the case of message-passing machines, Rudolph andPolychronopoulos [RP89] implemented and evaluated static,self-, chunk and guided-self scheduling in the iPSC/2 hyper-cube multicomputer. It was the �rst attempt to evaluatesome shared-memory dynamic scheduling algorithms on adistributed-memory multiprocessor.

do foreverf /* executed by all processors */Static scheduling level();Dynamic scheduling level();gFigure 1: Hybrid scheduling (HS) algorithmIn distributed-memory machines purely dynamic schedul-ing strategies are poor performers due to their relativelyhigh synchronization overhead. On the other hand, staticscheduling schemes are not suitable for non-uniform prob-lems. In both situations, a hybrid scheme can perform bet-ter, as it can exploit data locality and exhibit relatively lowsynchronization overhead, and it can balance the workloaddynamically at the same time.An example of such an approach is presented in [LSL92],where the two-phase Safe Self-Scheduling (SSS) is described.In the �rst phase, a subset of the iterations of the loop isdistributed uniformly among the processors. The secondphase of SSS is activated at run-time when a processor be-comes idle. A chunk of not yet executed iterations is cho-sen and assigned dynamically to each requesting processor.This strategy considers a centralized queue of iterations andthat all processors can access all data. Distributed Safe Self-Scheduling (DSSS), an adaptation of SSS to message-passingmachines is discussed in [SLL93]. The data is partitionedinto small blocks of the same size and distributed with par-tial redundancy among all the processors. Now DSSS pro-ceeds as SSS but the chunks of iterations are assigned tothe processors that have the corresponding data. DSSS isgeneralized in [LS93].The dynamic phase in DSSS is �red when a processorbecomes idle. This processor is in charge of the schedulingduties and distributes by demand the rest of the workloadamong all the other processors. Inherently this dynamicredistribution of load may imply a low performance.This paper describes a new approach to hybrid schedul-ing. We propose a two-level scheme in which both phases(static and dynamic) overlap. ADS uses a similar two-levelscheme, where the exploitation of data-locality is a majorconcern and the load-balancing is accomplished by demand(when a processor becomes idle). In our case, data-localityis considered, and we try to predict in advance the work-load unbalances in order to obtain a better load-balanceand to reduce the communication overhead. While the pro-cessors are executing their statically distributed workload,a dynamic redistribution of the rest of the workload is inaction. Hence the communication overhead associated withthe dynamic level of the scheduling is hidden by the localcomputations. Our experiments show that with such a two-level strategy we can obtain a good performance.3 Scheduling schemeThe parallel loop we consider in the rest of the paper hasthe following simple structure,DOALL 10 I = 1, Nloop body(I) (1)10 CONTINUEThe hybrid scheduling (HS) we propose is accomplishedin two levels, as shown in Figure 1.Static level: First, we statically partition and evenly dis-2

tribute the iteration space of the parallel loop among theprocessors of the machine. If we denote by P the number ofprocessors, each processor will execute a total of n = dNP eiterations of the parallel loop (1),DO 10 i = 1, nIF ( f(i,�) .LT. N )local loop body(i)10 CONTINUE(note that we use lowercase indices to represent local vari-ables and uppercase for the global ones). If we denote by� the numeric address of each processor, � = 0; 1; :::;P � 1,then the relation between the local and global indices, I= f(i;�), depends on the static distribution scheme we haveused. For example, in the case of a block distributed loop,we have I = i +n � �.The form of the local loop body depends on the datadistribution scheme chosen. If we choose to partition thedata throughout the local memories of all the processors,then local loop body(i) is identical to loop body(i). Ifwe prefer to replicate all the data on all local memoriesthen local loop body(i) is equal to loop body(I), whereI = f(i;�). Other combinations can be found if a partialreplication of the data is considered.Dynamic level: The second level requires splitting eachlocal iteration space in a set of chunks. The sizes of thechunks may be constant or variable. Indeed, we may use oneof the dynamic scheduling algorithms found in the literaturein order to accomplish the partitioning of the local iterationspaces. For example, if the GSS algorithm is considered,the set of n local iterations would be partitioned into a totalof S iteration groups (chunks), where d(1� 1P )i�1 nP e is thesize of chunk number i, i = 1; 2; :::;S. Therefore, the localloop is transformed intoupper_bound = 0DO 10 chunk = 1, Slower_bound = upper_bound + 1upper_bound = chunk_size(chunk)+ upper_boundDO 10 i = lower_bound, upper_boundlocal_loop_body(i)10 CONTINUEwhere chunk size(c) computes the size of chunk number c.All the actions associated with the dynamic level of HSare carried out at the boundary between the execution ofone chunk and the following one. This way, we can establishthe hybrid degree between the two extremes, static and dy-namic, by choosing the size (and the number) of the chunks.When a certain unbalance of the workload is detected thedynamic level of HS becomes activated and tries to reducethe unbalance below a given threshold. During this redistri-bution the local computations are still going on.Speci�cally, just after the execution of each chunk of it-erations and before starting with the next one, all processorscarry out some actions in order to arrange themselves fromthe fastest to the slowest one (that is, from the most lightlyloaded to the most heavily loaded processor) at that pointof execution. Then, a redistribution of load (chunks of it-erations) is executed, if it is necessary. This redistributionprocess overlaps the execution of the local chunks.A critical point in HS is how to arrange e�ciently suchclassi�cation of the processors, in order to determine whichprocessors send load and which receive it. We solve this

Workers:FirstTime= YES;if ( my local queue is not empty )fSend a load msg to the scheduler;Remove a chunk from my local queue;Execute the chunk removed;gelse if ( FirstTime == YES )fSend a load needed msg to the scheduler;FirstTime= NO;gScheduler:FirstTime= YES;if ( my local queue is not empty )fIncrement my workload counter;Remove a chunk from my local queue;Execute the chunk removed;gelse if ( FirstTime == YES )fSelect mhlw, the most heavily loaded worker;if ( load[mhlw] > TransferLimit&& mhlw != myself )Send a load request msg to mhlw;FirstTime= NO;gif ( all workers are dead&& my local queue is empty&& no messages traveling ) exit();if ( load msgs has been received )WorkloadRedistribution();Figure 2: Static scheduling levelproblem assigning the classi�cation work to one of the pro-cessors. All the processors are workers and one of themis, in addition, in charge of the dynamic scheduling duties.Therefore, all of the workers must send messages periodi-cally to the scheduler with the information needed in orderto establish their classi�cation at the right times (at chunkboundaries).We have implemented HS in a message-passing distribu-ted-memory machine using a simple protocol of seven dif-ferent types of messages. The meaning of these messages isdescribed in Table 1. The rest of this section is dedicated toa detailed description of both levels in HS.3.1 Static scheduling levelFigure 2 shows a detailed description of the static level inHS from both points of view, workers and scheduler (whichis also a worker). Under this level, each worker sends one(empty) load message to the scheduler just before executingone of its local chunks. Therefore, the time elapsed betweentwo consecutive load messages is equal to the time elapsedbetween the beginning of the execution of one local chunkand the next one. The scheduler carries out the classi�cationof all the workers just after the execution of each of its localchunks (which is coded in the WorkloadRedistribution()function in Figure 2).When a worker becomes idle it sends a load needed mes-sage to the scheduler, requesting a non-executed chunk fromother worker. The search of this worker is accomplished bythe scheduler during its dynamic level, unless the requestingworker is the scheduler itself. Note that, in this case, the loadrequestmessage is sent only if the selected worker (mhlw) hasa minimum number of non-executed chunks, parameterizedby the constant TransferLimit. With this mechanism the3

Type MeaningLoad A worker sends it to the scheduler before the execution of each local chunkLoad needed A worker sends it to the scheduler when it becomes idleLoad migrated A worker sends it to a requesting processor with its next non-executed local chunkRequest failed A worker sends it to a requesting processor when its local queue is almost emptyData returned It is used to return the remotely processed data block to its original placeLoad request The scheduler requests to a worker to send its next non-executed chunk to a processorLoad �nished There are no more chunks to be execute remotelyTable 1: Messages used in the HS implementationScheduler:Read all load msgs received andupdate workload counters;Classify the corresponding workersaccording to their workload;while ( there are workers in the classi�cation list )fRemove mllw, the most lightly loaded workerin the classi�cation list;Choose mhlw, the most heavily loaded worker,from the classi�cation list;if ( load[mhlw] > load[mllw] � 1 )fif ( load[mhlw] > TransferLimit )fif ( mhlw == scheduler )Send a load migrated msg to mllw withmy next non-executed local chunk;elseSend a load request msg to mhlw withmllw as the requesting processor;gif ( no message has been sent )fif ( mllw has sent a load needed msg )fSend a load �nished msg to mllw;Reduce number of workers alive;if ( all workers are dead&& my local queue is empty&& no messages traveling ) exit();Figure 3: Workload redistributionmigration of the remaining chunks is avoided as the com-munication latency would probably eliminate the bene�tsassociated with balancing this remaining load (this is spe-cially true if we use algorithms like GSS to partition thelocal loops, as the last chunks are very small).3.2 Workload redistributionPeriodically, the scheduler tries to evenly redistribute theload among all the workers. After the execution of each ofthe local chunks, the scheduler reads all the load messagesreceived and by counting them it can determine the relativelocal \speed" of all the workers. Note that this is a rather\instantaneous" measure, because the scheduler counts onlythe load messages received during the execution of its lastlocal chunk. Based on this information, the scheduler clas-si�es the workers that have sent load messages in order ofdecreasing speed. Once the classi�cation is completed, thescheduler orders the heavy-loaded processors to send chunksto the light-loaded ones (see Figure 3). This load redistri-bution is accomplished while the the rest of the workers areexecuting their local chunks.

A worker do not send a load message before starting toexecute a remote chunk (see next subsection). This way, afast worker will receive a number of successive remote chunksand the frequency of arrival of its load messages will be re-duced as a consequence. Thus, the scheduler will eventuallyclassify this worker as slow and will not send more remoteload to it until it becomes fast at a later stage (if that hap-pens).With this mechanism, the scheduler tries to periodicallydelay the fastest processors and to alleviate the load of theslowest ones. There is a tradeo� between the accuracy ofthe measures of the relative speeds and the communicationcost associated with that measure. The more accuracy weachieve the more sensitive the scheduler is to changes in thelocal loads and thus the better is the balancing of the load,because the response to these changes is faster and moreprecise. But in order to obtain high accuracy the workersmust send loadmessages with a high frequency (by reducingthe size of the chunks). That can increase the communi-cation overhead to intolerable levels. The tradeo� betweenclassi�cation accuracy and associated communication cost isestablished by choosing the size of the chunks of iterations.In order to prevent thrashing of remote chunks, we in-corporate two simple mechanisms. First, a remote chunk isalways executed in the destination processor (redistributionof remote chunks to several workers is prohibited). Second,the workload counters used to select the slow workers (load[]vector in Figure 3) are modi�ed during the load redistribu-tion process. If a chosen mhlw sends one of its non-executedlocal chunks, then its workload counter is decremented ac-cordingly. This way a single slow (heavy loaded) workeris prevented from sending too many chunks with the riskof having that processor lose all of its local chunks. Like-wise the workload counter of the corresponding destinationworker is incremented for each remote chunk assigned, try-ing to prevent bulk arrivals.3.3 Dynamic processor coordinationWe hide (overlap) most of the communication latency ofredistributing load by executing local computations at thesame time. The processors continue to execute local chunkswhile the transferred chunks are traveling through the in-terconnection network. Between each local chunk and thefollowing one, the workers check the arrival of a control mes-sage, with, possibly, a remote chunk to be executed. If itis not arrived, the workers immediately begin the executionof the following local chunk. Otherwise, they execute theremote chunk and, �nally, return the processed remote datato the owner. This last action is only necessary if we do notwant to modify the original (static) data distribution. Insome cases this modi�cation is irrelevant, for example, when4

Workers:while ( there are pending messages )fswitch ( type of the message received )fcase load �nished msg:if ( there are no pending messages ) exit();break;case load request msg:if ( load[myself] > Transfer Limit )Send a load migrated msg to the requestingprocessor with my next non-executed chunk;elseSend a request failed msg tothe requesting processor;break;case load migrated msg:Execute migrated chunk;Send a data returned msg with the migrated datato its owner (if necessary);if ( no load �nished msg received )FirstTime= YES;else if ( no messages traveling ) exit();break;case request failed msg:if (no load �nished msg received )FirstTime= YES;else if ( no messages traveling ) exit();break;case data returned msg:Store the remotely processed data blockin its original location;else if ( no messages traveling ) exit();break;ggFigure 4: Dynamic scheduling level (workers)we want to execute the same parallel loop several times, oneafter another. In this case, we can delay the reconstruc-tion of the original data distribution until the last time theparallel loop is executed.Due to the communication latency, we lose part of theload balance because the decision of redistributing load andthe redistribution itself are carried out at di�erent times.For example, it is possible for a fast processor to enter aheavily loaded part of its local computation before it re-ceives a previously transferred remote chunk. In this case,we further increase the load of a, currently, slow processor.Nevertheless, the increase of the error in balancing the loadis overcome by the bene�t of reducing the communicationoverhead. The net result is bene�cial, as our experimentsshow.During the workload redistribution phase explained inthe above section the scheduler determines the set of work-ers which must send chunks and the set which must receivethem. Now we need to establish a dynamic processor co-ordination in order to e�ciently transfer the chunks amongthe selected workers. During this coordination all workerscheck di�erent kinds of messages and execute some actionsaccording to the control scheme shown in �gures 4 and 5).One can argue that if the number of workers is very largethen the scheduler can become a bottleneck. With a largenumber of workers, the scheduler would probably executejust the �rst of its local chunks and then redistribute therest of them among the workers. But that depends on the

Scheduler:while ( there are pending messages )fswitch ( type of the message received )fcase load needed msg:Select mhlw, the most heavily loaded worker;if ( load[mhlw] > Transfer Limit )fif ( mhlw == scheduler )Send a load migrated msg to the requestingprocessor with my next non-executed chunk;else Send a load request msg to mhlw;elsefSend a load �nished msg tothe requesting processor;Reduce number of workers alive;if ( all workers are dead&& my local queue is empty&& no messages traveling ) exit();break;case load migrated msg:Execute migrated chunk;Send a data returned msg with the migrated datato its owner (if necessary);FirstTime= YES;break;case request failed msg:FirstTime= YES;break;case data returned msg:Store the remotely processed data blockin its original location;else if ( no messages traveling ) exit();break;ggFigure 5: Dynamic scheduling level (scheduler)static load initially assigned to the scheduler in relation withthe loads of the workers. Indeed our experiments show that,in general, the e�ciency of the parallel loop is lower if thescheduler is one of the initially lightly loaded processors.Moreover the scheduler is in charge of the periodic classi-�cation of the workers, core of the workload redistributionprocess. This may introduce another potencial performancebottleneck. In this paper we show results using just onescheduler but the possibility of having several schedulers willbe considered in the future.4 Experimental evaluationWe have implemented HS on the CM-5 using the CMMDmessage-passing library [Thi91]. In addition to HS and inorder to make comparisons, the fully static and fully dy-namic schedulings were implemented as well. We then mea-sured the performance of all the scheduling algorithms usingtwo benchmark programs.4.1 Benchmark programsWe have chosen matrix multiplication (MM) and (a simpli-�ed) transitive closure (TC) as our benchmark programs.MM is a typical example of a very regular (uniform) prob-lem, and TC represents an irregular (non uniform) one,where the workload of the parallel loop depends heavily onthe data content.5

� Matrix Multiplication: This program has three per-fectly nested loops, where the innermost one containsa reduction operation. The other two loops are fullyparallel. In our experiments we have parallelized theprogram at the outermost loop.DOALL 10 I = 1, NDO 10 J = 1, NC(I,J) = 0DO 10 K = 1, NC(I,J) = C(I,J) + A(I,K)*B(K,J)10 CONTINUEThe regularity of this program will be useful in evalu-ating the e�ectiveness of HS as compared to (optimal)static scheduling.� Transitive closure: The TC program is a 3-depth nestedloop as well. The outermost loop is serial and the thenext one is parallel. For this reason the �rst loop wasdropped in our experiments (that is, just one iterationof the outermost loop was considered), as follows,J = 1DOALL 10 I = 1, NIF ( A(I,J) .EQ. 1 )DO 10 K = 1, NIF ( A(J,K) .EQ. 1 )A(I,K) = 110 CONTINUEThe sequential innermost loop may or may not be exe-cuted, and hence the computation is non-uniform, de-pending on the input data. This program will allow usto evaluate the performance of HS in a fully dynamicenvironment.4.2 Scheduling algorithmsWe have implemented the following scheduling algorithms:static scheduling (Static), HS with data partitioned through-out the local memories (HSp), HS with data replicated in allthe local memories (HSr) and dynamic scheduling with datapartitioned throughout the local memories, considering thescheduler as a worker too (DynamicP) or in charge of thescheduling duties only (DynamicPS).Static scheduling consists in a block-wise static distribu-tion of the iterations of the parallelized loop. The matricesare partitioned and block distributed as well. In the caseof MM, matrices A and C are block partitioned by rows(and the partitions stored in the respective local matrices aand c) and matrix B is replicated. The local code for eachprocessor is as follows,DO 10 i = 1, nDO 10 J = 1, Nc(i,J) = 0DO 10 K = 1, Nc(i,J) = c(i,J) + a(i,K)*B(K,J)10 CONTINUEwhere n = dNP e.For the TC, matrix A is block partitioned by rows aswell (and stored in local matrix a) and the �rst row of A isreplicated (and renamed as A1[]). The local TC is now asfollows,

J = 1DO 10 i = 1, nDO 10 t = 1, LoopBodySizeIF ( a(i,J) .EQ. 1 )DO 10 K = 1, NIF ( A1(K) .EQ. 1 )a(i,K) = 110 CONTINUEDue to the fact that the execution time of the loop body ofthe parallelized loop in TC is very small, we have introducedanother loop surrounding it (granularity control loop), witha number of LoopBodySize iterations. This loop permitsparameterizing the loop body size and studying its e�ect inour experiments.In order to obtain an initially poor workload balance forstatic scheduling in the TC benchmark, we have taken aninput matrix A with 1's in the �rst half of its rows, and 0's inthe rest of them. With this data, static scheduling performsaround 50% e�ciency. This way we can clearly evaluatethe bene�ts of including dynamic actions in the schedulingstrategy for this irregular problem.We have considered the same data distribution schemefor the static level in HSp. With HSp and HSr we can ana-lyze the global impact of data migration on the performanceof HS, as the partitioning scheme may imply a great amountof data migration. In any case, the scheme chosen to parti-tion the local loops is GSS(0) for the scheduler and GSS(2)for the rest of the workers. With GSS(2) we obtain a rea-sonably small communication overhead due to the relativelyreduced number of messages sent to the scheduler by theworkers (as the chunks presents bigger sizes). On the otherhand, using GSS(0) for the scheduler we improve its \sensi-tiveness" to dynamic workload changes and hence obtain areasonable load balancing. We have selected the value 2 (1)for the TransferLimit parameter for the workers (sched-uler). The same algorithm, GSS(0), was chosen for the fullydynamic schedulings.In all the cases where data was migrated we have alwaysconsidered that it returns to the original owner just afterthe execution of the remote chunk.4.3 Results and interpretationFigure 6 (a) displays the e�ciency versus the number of pro-cessors for MM in the case of two di�erent matrix sizes, com-paring the performance of the three scheduling algorithms,static, hybrid and dynamic. In the computation of the e�-ciency we have executed the same scheduling algorithm onone processor. We can see that the best performer is staticscheduling and the worst performer is the dynamic one. HSperforms quite similar to static scheduling, especially in thecase of su�ciently large matrices. This behavior is explainedby the fact that all the workers process (almost) the sameworkload (because the loop was evenly distributed acrossthe local memories at compile-time). All the workers sendload messages more or less at the same time to the sched-uler. If the execution time of the chunks is su�ciently largethen the messages arrive to the scheduler with a di�erencein time (due to di�erent communication paths taken) thatis smaller than the execution time of a single chunk. Thisway, the workload counters contain the same number (or atmost with a di�erence of one unit) for all the processors.Thus the scheduler never decides to migrate any of the lo-cal chunks and HS works similar to the static strategy (withsome overhead associated with the control messages). More-over, a possible load unbalance may appear only during the6

HSpStatic

DynamicPDynamicPS

MM (100*100)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.31 2 4 8 16 32

Number of processors

Eff

icie

ncy (a) Static

HSp

DynamicPSDynamicP

Number of processors

1

0.9

0.8

0.7

0.6

0.5

0.4

0.31 2 4 8 16 32

MM (400*400)

Eff

icie

ncy

1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.551 2 4 8 16 32

StaticHSpHSr

MM (100*100)

Number of processors

Eff

icie

ncy (b) HSp

Static

HSr

MM (400*400)

Number of processors

1

0.99

0.98

0.97

0.96

0.95

0.94

0.931 2 4 8 16 32

Eff

icie

ncy

Figure 6: E�ciency for matrix multiplicationlast local chunk for some of the processors. In this case theTransferLimit parameter prevents the migration of theremaining chunks. Besides this explains that both versions,HSp and HSr, perform roughly in a similar way, as shownin Figure 6 (b).For the TC program, HSp performs best for large loopbodies (see Figure 7). We have experimented with two ma-trix sizes and three values for the upper bound of the granu-larity control loop (LoopBodySize), 10, 100 and 1000. If thedimension of the input matrix and/or the size of the paral-lelized loop body is su�ciently large the e�ciency of HSprises to around 75 to 80%. Instead the best purely dynamicscheduling is always below the static one. These resultsshows the advantages of considering data locality (staticlevel in HS) and the overlapping between the chunk migra-tions and the local computations (dynamic level in HS). Ifthe data or the size of the loop body are small the chunkmigration costs are only partially hidden by the local compu-tation time. In this case the performance of HS is very poorand becomes worse as the number of processors increases.Note that for a large input matrix the e�ciency of HSp re-mains around 75% for any number of processors considered.Therefore the use of just one scheduler is su�cient in orderto balance the load among a few dozens of processors.The input matrix chosen (half 1's and half 0's) splitsthe processors into two classes, heavily loaded and lightlyloaded processors. With the initial static loop distribution,half of the processors are in charge of all of the computa-tions and the other half are idle. We have experimented thein uence of the initial workload assigned to the scheduler onthe performance of HS. The results are shown in Figure 8,where the data partitioned and replicated versions of HS areconsidered too. On the one hand, it is important to pointout that, for large loop bodies, the e�ciency of HS is notgreatly a�ected by the data distribution scheme chosen. Onthe other hand, the e�ciency is noticeably worse if a lightlyloaded (initially idle) scheduler is chosen. This is reasonablebecause a scheduler of this kind goes into the workload redis-tribution level very frequently. It thus sends many messagesto the heavily loaded processors ordering them to migratemany of their chunks. Hence the total communication costis high. This behavior is prevented if the scheduler has aheavy workload, but at expense of a probably worse work-load balance. Indeed, we have found many cases where theload is balanced worse if the scheduler is a heavily loadedworker than if it is not. Nevertheless the e�ciency of theformer case is better due to its limited communication cost.7

DynamicPS

HSpDynamicP

Static

TC (100*100) [10]

Number of processors

0.6

0.5

0.4

0.3

0.2

0.1

02 4 8 16 32

Eff

icie

ncy

StaticHSp

DynamicPSDynamicP

Number of processors

TC (500*500) [10]

0.6

0.5

0.4

0.3

0.2

2 4 8 16 32

Eff

icie

ncy

StaticHSp

DynamicPDynamicPS

TC (100*100) [100]

Number of processors

0.7

0.6

0.5

0.4

0.3

0.22 4 8 16 32

Eff

icie

ncy

StaticHSp

DynamicPDynamicPS

Number of processors

TC (500*500) [100]

0.8

0.7

0.6

0.5

0.4

0.3

2 4 8 16 32

Eff

icie

ncy

StaticHSp

DynamicPDynamicPS

Number of processors

TC (100*100) [1000]

0.7

0.6

0.5

0.4

0.3

2 4 8 16 32

Eff

icie

ncy

StaticHSp

DynamicPDynamicPS

Number of processors

TC (500*500) [1000]

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.22 4 8 16 32

Eff

icie

ncy

Figure 7: E�ciency for transitive closure8

5 ConclusionsWe have studied the problem of loop scheduling on distribu-ted-memory multiprocessors. If the loop is regular a compile-time (static) scheduling is a (near) optimal solution, wherethere is no need to migrate load among the processors. Ifthe loop is irregular (the execution time of the loop bodydepends on the data, for example) we will thus need to con-sider runtime (dynamic) techniques in order to manage loadimbalances well. Most dynamic scheduling algorithms dotheir work well but in a shared-memory environment. If thememory is distributed we will have to take data locality intoaccount if we want to keep communication overhead undercertain limit.We believe that the best solution would be to combinestatic and dynamic strategies in a loop scheduling algorithmif we want to obtain high performance on a distributed-memory multiprocessor. We have described such an algo-rithm, Hybrid Scheduling (HS), in which the static and dy-namic nature is presented in two levels. This conceptionpermits overlapping the local computations with the migra-tion of part of the workload. This way a large part of thehigh communication overhead associated with migrationsdoes not contribute to the completion time. In order toorchestrate the dynamic workload balance one of the work-ers is, at the same time, in charge of the dynamic schedulingduties.We have evaluated the performance of HS on a CM-5.For a regular program (matrix multiplication), HS performsnear static scheduling, depending on data size. In this case,HS does not carry out any load migration and the overheadassociated with the dynamic level is very low. For an irreg-ular program (transitive closure), HS performs best if theloop body is su�ciently large. For many data sizes the loadis not very well balanced but the communication overheadassociated with migrations is low. Our results indicate thatthe combination of both e�ects is in general bene�cial for amachine as the CM-5.An important result is that the performance of HS isnot very \sensitive" to the data distribution scheme cho-sen for the benchmarks tested. This is important in manyapplications in which it is not easy to �nd the best data dis-tribution. Further investigations include the in depth studyof the in uence of the data distribution chosen at the staticlevel on the performance. Moreover we are working on amulti-scheduler extension of HS and in obtaining more ex-perimental results in a di�erent kind of distributed-memorymachine.AcknowledgementsThis research has bene�ted signi�cantly from discussionswith and suggestions from Constantine D. Polychronopou-los. Most of this material and all the experimental resultswere obtained during a research stay in the Center for Su-percomputing Research and Development, University of Illi-nois at Urbana-Chaimpaign. We would like to thank DavidPadua and his compiler group for the support they gave us.This research was supported in part by the CICYT undergrant TIC92-0942-C03-03 and the Xunta de Galicia undergrant XUGA20604A90

References[BB91] Belkhale, K. P. and Banerjee, P., "A SchedulingAlgorithm for Parallelizable Dependent Tasks",in 5th IEEE Int. Parallel Processing Symp., Apr.1991, pp. 500{506.[HP93] Haghighat, M. and Polychronopoulos, C. D.,"Symbolic Analysis: A Basis for Parallelization,Optimization and Scheduling of Programs", in6th Work. on Languages and Compilers for Par-allel Computing, Aug. 1993.[HSF92] Hummel, S. F., Schonberg, E. and Flynn, L. E.,"Factoring, a Method for Scheduling ParallelLoops", Communications of the ACM, vol. 35,no. 8, Aug. 1992, pp. 90{101.[KW85] Kruskal, C. P. and Weiss, A., "Allocating Inde-pending Subtasks on Parallel Processors", IEEETrans. on Software Engineering, vol. SE{11, no.10, Oct. 1985, pp. 1001{1016.[LS93] Liu, J. and Saletore, V. A., "Self-Scheduling onDistributed-Memory Machines", in IEEE Super-computing Conf., Nov. 1993, pp. 814{823.[LSL92] Liu, J., Saletore, V. A. and Lewis, T. G.,"Scheduling Parallel Loops with Variable LengthIteration Execution Times on Parallel Comput-ers", in ISMM Int. Conf. on Parallel and Dis-tributed Computing Systems, Oct. 1992.[LTSS93] Li, H., Tandri, S., Stummu, M. and Sevcik, K.,"Locality and Loop Scheduling on NUMA Multi-processors", in IEEE Int. Conf. on Parallel Pro-cessing, Aug. 1993, pp. 140{147.[Luc92] Lucco, S., "A Dynamic Scheduling Method for Ir-regular Parallel Programs", in ACM Int. Conf. onProgramming Languages Design and Implementa-tion, Jun. 1992, pp. 200{211.[ML94] Markatos, E. P. and LeBlanc, T. J., "UsingProcessor A�nity in Loop Scheduling on Shared-Memory Multiprocessors", IEEE Trans. on Par-allel and Distributed Systems, vol. 5, no. 4, Apr.1994, pp. 379{400.[PK87] Polychronopoulos, C. D. and Kuck, D. J.,"Guided Self-Scheduling: A Practical Schedul-ing Scheme for Parallel Supercomputers", IEEETrans. on Computers, vol. C{36, no. 12, Dec.1987, pp. 1425{1439.[PKP89] Polychronopoulos, C. D., Kuck, D. J. and Padua,D., "Utilizing Multidimensional Loop Parallelismon Large-Scale Parallel Processor Systems", IEEETrans. on Computers, vol. C{38, no. 9, Sep. 1989,pp. 1285{1307.[Pol88] Polychronopoulos, C. D., Parallel Programmingand Compilers. Kluwer Academic Pub., Norwell,MA, 1988.[RP89] Rudolph, D. C. and Polychronopoulos, C. D.,"An E�cient Message-Passing Scheduler Basedon Guided Self-Scheduling", in ACM Int. Conf.on Supercomputing, Ju. 1989, pp. 50{61.9

HSp-hHSp-lHSr-hHSr-l

Number of processors

TC (100*100) [10]

0.7

0.6

0.5

0.4

0.3

0.2

0.1

02 4 8 16 32

Eff

icie

ncy

HSr-hHSr-l

HSp-hHSp-l

Number of processors

TC (500*500) [10]

0.7

0.6

0.5

0.42 4 8 16 32

Eff

icie

ncy

HSp-lHSp-h

HSr-hHSr-l

Number of processors

TC (100*100) [1000]

0.7

0.6

0.5

0.4

0.32 4 8 16 32

Eff

icie

ncy

HSp-lHSp-h

HSr-hHSr-l

0.8

0.7

0.6

0.5

2 4 8 16 32Number of processors

TC (500*500) [1000]

Eff

icie

ncy

Figure 8: E�ciency for transitive closure[SH86] Sarkar, V. and Hennessy, J., "Compile-Time Par-titioning and Scheduling of Parallel Programs",ACM Sigplan Notices, vol. 21, no. 7, July 1986.[SLL93] Saletore,V. A., Liu, J. and Lam, B. Y., "Schedul-ing Non-Uniform Parallel Loops on DistributedMemory Machines", in IEEE Int. Conf. on Sys-tem Sciences, Jan. 1993, pp. 516{525.[TF92] Tawbi, N. and Feautrier, P., "Processor Alloca-tion and Loop Scheduling on Multiprocessor Com-puters", In ACM Int. Conf. on Supercomputing,Jul. 1992, pp. 63{71.[Thi91] Thinking Machines Corporation, Cambridge,MA. The Connection Machine CM-5 TechnicalSummary, 1991.[TN93] Tzen, T. H. and Ni, L. M., "Trapezoid Self-Scheduling: A Practical Scheduling Scheme forParallel Compilers. IEEE Trans. on Parallel andDistributed Systems, vol. 4, no. 1, Jan. 1993, pp.87{98.[TY86] Tang, P. and Yew, P. C., "Processor Self-Scheduling for Multiple Nested Parallel Loops",in IEEE Int. Conf. on Parallel Processing, Aug.1986, pp. 528{535.[Zha91] Zhang, X., "Dynamic and Static Load SchedulingPerformance on NUMA Shared Memory Multi-processors", in ACM Int. Conf. on Supercomput-ing, Jun. 1991, pp. 128{135.

10