Efficient hybrid parallelisation of tiled algorithms on SMP clusters

Efficient hybridparallelization of tiledalgorithms on SMP clusters

Nikolaos Drosinos and Nectarios KozirisNational Technical University of Athens,School of Electrical and Computer Engineering,Computing Systems Laboratory,Iroon Polytexneioy 9, 15780 Zografou Campus, Athens, GreeceE-mail: ndros,[email protected]

Abstract: This article emphasizes on load balancing issues associated with hybridprogramming models for the parallelization of tiled algorithms onto SMP clusters.Although tiled algorithms usually account for relatively regular computation and com-munication patterns, their hybrid parallelization often suffers from intrinsic load im-balance between threads. This observation mainly reflects the fact that most existingmessage passing libraries generally provide limited multi-threading support, thus al-lowing only the master thread to perform inter-node message passing communication.In order to mitigate this effect, we propose a generic method for the application ofload balancing on the coarse-grain hybrid model for the appropriate distribution ofthe computational load to the working threads. We adopt both a static, as well asa dynamic load balancing approach, and implement three alternative balancing vari-ations. All implementations are experimentally evaluated against kernel benchmarks,in order to demonstrate the potential of such load balancing schemes for the extractionof maximum performance out of hybrid parallel programs.

Keywords: nested loops; tiling transformation; load balancing; hybrid parallelization;SMP clusters; MPI; OpenMP

Reference to this article should be made as follows: Drosinos, N. and Koziris, N.(2006) ‘Efficient hybrid parallelization of tiled algorithms on SMP clusters’, Interna-tional Journal of Computational Science and Engineering, Vol. x, No. x, pp.xxx–xxx.

Biographical notes: Nikolaos Drosinos is currently a Ph.D. candidate in the Schoolof Electrical and Computing Engineering at the National Technical University ofAthens (NTUA). He received his Diploma in Electrical Engineering from the NTUAin 2001. His research interests include parallel processing (automatic parallelization,loop scheduling, hybrid parallel programming models, load balancing), parallel archi-tectures and interconnection networks. He is a student member of the IEEE, and alsoa member of the IEEE Computer Society.Nectarios Koziris received his Diploma in Electrical Engineering from the NationalTechnical University of Athens (NTUA) and his Ph.D. in Computer Engineering fromNTUA (1997). He is currently an Assistant Professor in the Computer Science Depart-ment, School of Electrical and Computer Engineering at NTUA. His research interestsinclude parallel architectures, loop code optimizations, interaction between compilers,OS and architectures, communication architectures for clusters (OS and compiler sup-port) and resource scheduling (CPU and storage) for large scale computer systems. Hehas published more than 60 research papers in international refereed journals and con-ferences/workshops, as well as two Greek textbooks ”Mapping Algorithms into ParallelProcessing Architectures”, and ”Computer Architecture and Operating Systems”. Heis a recipient of the IEEE IPDPS 2001 best paper award. He serves as a reviewer inInternational Journals and various HPC Conferences (IPDPS, ICPP etc). He servedas a Program Committee member in HiPC-2002 & ICPP-2005 conferences, CAC03 &CAC04 workshops (organized with IPDPS), and Program Committee co-Chair for theACM SAC03-PDS, SAC04-PDSN and SAC05-DSGC Tracks. He is a member of IEEEComputer Society, member of IEEE-CS TCPP and TCCA, ACM and chairs the GreekIEEE Computer Society Chapter. He also serves as a Board Member and DeputyVice-Chair for the Greek Research and Education Network.

1 INTRODUCTION

SMP clusters have dominated the high performance com-puting domain by providing a reliable, cost-effective solu-tion to both the research and the commercial communities.An immediate consequence stemming from the emersion ofthis new architecture is the consideration of new parallelprogramming models, which might exploit the underlyinginfrastructure more efficiently. Currently, message pass-ing parallelization via the MPI library has become thede-facto programming approach for the development ofportable code for a variety of high performance platforms.Although message passing parallel programs are genericenough, so as to be directly adapted to multi-layered, hi-erarchical architectures in a straightforward manner, thereis an active research interest in considering alternative par-allel programming models, that could be more appropriatefor such platforms.

Hybrid programming models on SMP clusters resort toboth message passing and shared memory access for inter-and intra-node communication, respectively, thus imple-menting a two-level hierarchical communication pattern.Usually, MPI is employed for the inter-node communica-tion, while a multi-threading API, such as OpenMP, is usedfor the intra-node processing and synchronization. Thereare mainly two hybrid programming variations addressedin related work, namely the fine-grain incremental hybridparallelization, as well as the coarse-grain SPMD-like alter-native. Naturally, other cluster parallelization approachesalso exist, such as using a parallel programming language,like HPF (Merlin and Hey (1995)) or UPC (El-Ghazawiet al. (2002)), and relying on the compiler for efficiency,or even visualizing a single shared memory system imageacross the entire SMP cluster, implemented with the aidof a Distributed Shared Virtual Memory (DSVM) software(Morin and Puaut (1997), Hu et al. (2000)). Nevertheless,these techniques are not as popular as either the messagepassing approach, or even hybrid parallel programming,hence they will not be discussed here.

Tiled loop algorithms account for a large fraction ofthe computational intensive part of many existing scien-tific codes. A typical representative of this applicationmodel stems from the discretization of Partial DifferentialEquations (PDEs). We consider generic N +1-dimensionaltiled loops, which are parallelized across the outermost N

dimensions, so as to perform sequential execution alongthe innermost dimension in a pipeline fashion, interleav-ing computation and communication phases. These algo-rithms impose significant communication demands, thusrendering communication-efficient parallelization schemescritical in order to obtain high performance. Moreover, thehybrid parallelization of such algorithms is a non-trivial is-sue, as there is a trade-off in programming complexity andparallel efficiency.

Hybrid parallelization is a popular topic in related litera-ture, although it has admittedly delivered controversial re-

Copyright c© 200x Inderscience Enterprises Ltd.

sults (Cappello and Etiemble (2000), Drosinos and Koziris(2004), Henty (2000), Dong and Karniadakis (2004), Loftet al. (2001), Ayguade et al. (2004), Majumdar (2000),Nakajima (2003), Su et al. (2004) etc). In practice, it isstill a very open subject, as the efficient use of an SMPcluster calls for appropriate scheduling methods and loadbalancing techniques. Most message passing libraries pro-vide a limited multi-threading support level (e.g. fun-neled), allowing only the master thread to perform mes-sage passing communication. Even under multiple threadsupport, the additional complexity associated with ensur-ing thread safety can potentially diminish the load bal-ancing advantages of full multi-threading support. There-fore, additional load balancing must be applied, so asto equalize the per tile execution times of all threads.This effect has been theoretically spotted in related lit-erature (Rabenseifner and Wellein (2003), Legrand et al.(2004), Chavarrıa-Miranda and Mellor-Crummey (2003),Darte et al. (2003)), but to our knowledge no generic loadbalancing technique has been proposed and, more impor-tantly, evaluated.

In this article we propose both static and dynamic loadbalancing for the coarse-grain funneled hybrid paralleliza-tion of tiled loop algorithms. The computational loadassociated with a tile is appropriately distributed amongthreads, either statically, based upon the estimation andmodeling of basic system parameters, or at run-time, bysampling relative computation and communication timesand dynamically applying appropriate load balancing. Wedistinguish between two variations of static load balancing,namely both a constant and a variable scheme, dependingon whether the same task distribution is applied on a globalor a per process base, respectively. We emphasize on theelements of applicability and simplicity, and evaluate theefficiency of the proposed scheme against two popular ker-nel benchmarks, namely ADI integration and a backwarddiscretization of the Diffusion Equation. The experimentalevaluation eloquently demonstrates the merit of load bal-ancing when following the hybrid parallelization approach,and yields to significant overall performance improvementover the dominant message passing model.

The rest of this article is organized as follows: Section 2discusses our target algorithmic model and introduces thenotation used throughout the article. Section 3 refers toa pure message passing parallelization of tiled loop algo-rithms, while Section 4 presents the hybrid programmingalternatives. Section 5 focuses on the proposed load bal-ancing schemes, whereas Section 6 experimentally evalu-ates the efficiency of these schemes against ADI and DEkernels. Last, Section 7 concludes the paper and summa-rizes the performance results.

2 ALGORITHMIC MODEL - NOTATION

Our algorithmic model concerns tiled algorithms, that canbe formally described as in Alg. 1. Such algorithms canbe typically obtained by applying tiling transformation to

2

fully permutable loops. Tiling is a popular loop transfor-mation and can be applied in order to implement coarsegranularity in parallel programs. Tiling partitions the orig-inal iteration space of an algorithm into atomic units of ex-ecution (tiles). This partitioning facilitates the paralleliza-tion of the algorithm, as a tile-to-process allocation schemecan be selected to implement domain decomposition in astraightforward manner. In our case, each process assumesthe execution of a sequence of tiles, successive along thelongest dimension of the original iteration space.

Algorithm 1: iterative algorithm model

foracross tile1 ← 1 to H(X1) do1

. . .2

foracross tileN ← 1 to H(XN) do3

for tileN+1 ← 1 to H(Z) do4

Receive(−→tile);5

Compute(−→tile);6

Send(−→tile);7

Formally, we assume an N + 1-dimensional algorithmwith an iteration space of X1 × · · · ×XN × Z, which hasbeen partitioned using a tiling transformation defined byfunction H. Z is considered the longest dimension of the it-eration space, and should be brought to the innermost loopthrough permutation, in order to simplify the generationof efficient parallel tiled code (Wolf and Lam (1991)). Inthe above code, tiles are identified by an N +1-dimensional

vector−→tile = (tile1, . . . , tileN+1). foracross implies par-

allel execution, as opposed to sequential execution (for).

In each tile denoted by a specific instance of vector−→tile,

a process first receives data required for the computationsassociated with the current tile (Receive), then performsthese computations (Compute) and finally transmits com-puted data required by its neighboring processes (Send).

Generally, tiled code is associated with a particular tile-to-process distribution strategy, that enforces explicit datadistribution and implicit computation distribution, accord-ing to the computer-owns rule. For homogeneous platformsand fully permutable iterative algorithms, related scientificliterature (Calland et al. (1997)) has proven the optimalityof the columnwise allocation of tiles to processes, as long assequential pipelined execution along the longest dimensionis assumed. Therefore, all parallel algorithms consideredin this paper implement computation distribution acrossthe N outermost dimensions, while each process computesa sequence of tiles along the innermost N +1-th dimension.

In most practical cases, the data dependencies of thealgorithm are of several orders of magnitude smaller com-pared to the iteration space dimensions. Consequently,only neighboring processes need to communicate, assum-ing reasonably coarse parallel granularities, which are com-mon for the distributed memory architectures addressedhere. According to the above, we only consider unitaryprocess communication directions for our analysis, sinceall other non-unitary process dependencies can be satis-

fied according to indirect message passing techniques, suchas the ones described in Tang and Zigman (1994). How-ever, in order to preserve the communication pattern ofthe application, we consider a weight factor di for eachprocess dependence direction i, implying that if iteration~j = (j1, . . . , ji, . . . , jN+1) is assigned to a process ~p, and

iteration ~j′ = (j1, . . . , ji + di, . . . , jN+1) is assigned to a

different process ~p′, ~p 6= ~p′, then data calculated at iter-ation ~j from ~p need to be sent to ~p′, since they will berequired for the computation of data at iteration ~j′.

In the following, P1×· · ·×PN and T1×· · ·×TN denote theprocess and thread topology, respectively. P =

∏Ni=1 Pi is

the total number of processes, while T =∏N

i=1 Ti the to-tal number of threads. Usually, if Pmpi the total numberof processes for the message passing programming model,whereas Phybrid the total number of processes and Thybrid

the respective number of threads for the hybrid model,both quantities Pmpi and Phybrid × Thybrid equal the to-tal number of available processors, in order to fully exploitthe computational infrastructure for the particular classof algorithms we address in this article. Nevertheless, forthe sake of generality, we will only be considering pro-cesses and threads, as opposed to processors, where vector~p = (p1, . . . , pN ), 0 ≤ pi ≤ Pi − 1 identifies a specific pro-cess, while ~t = (t1, . . . , tN ), 0 ≤ ti ≤ Ti − 1 refers to a par-ticular thread. Throughout the text, we will use MPI andOpenMP notations in the proposed parallel algorithms.

3 MESSAGE PASSING PARALLELIZATION

The proposed pure message passing parallelization for thealgorithms described above is based on the tiling transfor-mation and is schematically depicted in Alg. 2. Each pro-cess is identified by N -dimensional vector ~p, while differenttiles correspond to different instances of N +1-dimensional

vector−→tile. The N outermost coordinates of a tile specify

its owner process ~p, while the innermost coordinate tileN+1

iterates over the set of tiles assigned to that process. z

denotes the tile height along the sequential execution di-mension, and determines the granularity of the achievedparallelism: higher values of z imply less frequent com-munication and coarser granularity, while lower values ofz call for more frequent communication and lead to finergranularity. The investigation of the effect of granular-ity on the overall completion time of the algorithm andthe selection of an appropriate tile height z are beyondthe scope of this paper. Generally, we consider z to bea user-defined parameter, and perform measurements forvarious granularities, in order to experimentally determinethe value of z that delivers minimal execution time. Moreon the effect of granularity on the parallel performance canbe found in Kumar et al. (2003), Hodzic and Shang (1998),Andonov et al. (2003).

Furthermore, an advanced scheduling that allows forcomputation-communication overlapping is adopted as fol-lows: In each time step, a process ~p = (p1, . . . , pN )

3

Algorithm 2: pure message passing model

/*determine tile sequence from pid ~p */

for i← 1 to N do1

tilei = pi;2

/*main loop: traverse all tiles */

for tileN+1 ← 1 to⌈

Zz

⌉

do3

/*for each active transmission direction−→dir...

*/

foreach−→dir ∈ S~p do4

/*...pack previously computed data... */

Pack(−→dir,tileN+1 − 1,~p);5

/*...and send to process ~p +−→dir */

MPI Isend(~p +−→dir);6

/*for each active reception direction−→dir... */

foreach−→dir ∈ R~p do7

/*...receive from process ~p −−→dir for next

tile */

MPI Irecv(~p−−→dir);8

/*compute current tile */


/*complete pending communication */

MPI Waitall ;10



/*...appropriately unpack received data */

Unpack(−→dir,tileN+1 + 1,~p);12

concurrently computes a tile (p1, . . . , pN , tileN+1), re-ceives data required for the computation of the next tile(p1, . . . , pN , tileN+1 + 1) and sends data computed at theprevious tile (p1, . . . , pN , tileN+1 − 1). S~p denotes the setof valid data transmission directions of process ~p, that is,

if−→dir ∈ S~p for a non-boundary process ~p, then ~p needs to

send data to process ~p +−→dir. Similarly, R~p corresponds to

the valid data reception directions of process ~p, implying

that process ~p should receive data from process ~p −−→dir if

−→dir ∈ R~p. S~p and R~p are determined both by the datadependencies of the original algorithm, as well as by theselected process topology of the parallel implementation.

For the true overlapping of computation and communi-cation, as theoretically implied in the above scheme bycombining non-blocking communication primitives withthe overlapping scheduling, the usage of advanced CPU of-floading features is required, such as zero-copy and DMA-driven communication. Unfortunately, experimental eval-uation over a standard TCP/IP based interconnection net-work, such as Ethernet, combined with the ch p4 ADI-2device of the MPICH implementation, prohibits such ad-vanced non-blocking communication, but nevertheless thesame limitations hold for our hybrid model, and are thusnot likely to affect the relative performance comparison.However, this fact does complicate our theoretical analy-sis, since we will assume in general distinct, non-overlappedcomputation and communication phases, an assumptionthat to some extent misrepresents the efficiency of the mes-sage passing communication primitives.

4 HYBRID PARALLELIZATION

The potential for hybrid parallelization is directly asso-ciated with the multi-threading support provided by themessage passing library. From that perspective, there aremainly five levels of multi-threading support addressed inrelevant scientific literature:

1. single No multi-threading support.

2. masteronly Message passing routines may be called,but only outside of multi-threaded parallel regions.

3. funneled Message passing routines may be called evenwithin multi-threaded parallel regions, but only by themaster thread. Other threads may run applicationcode at this time.

4. serialized All threads are allowed to call message pass-ing routines, but only one at a time.

5. multiple All threads are allowed to call message pass-ing routines, without restrictions.

Each category is a superset of all previous ones. Cur-rently, popular non-commercial message passing librariesprovide support up to the funneled or serialized thread sup-port level, thus effectively restraining the message passing

4

communication capabilities within multi-threaded regions,while only few proprietary libraries allow for full multi-threading support (multiple thread support level). Due tothis fact, most attempts for hybrid parallelization of ap-plications, that have been proposed or implemented, aremostly restricted to the first three thread support levels.

Two major hybrid parallel programming variations arediscussed in related literature, namely fine-grain andcoarse-grain. The fine-grain model applies incrementalmulti-threading parallelization solely to specific computa-tional code parts, and therefore requires minimum multi-threading support on behalf of the message passing li-brary, as even the masteronly thread support level is suffi-cient. On the other hand, the coarse-grain hybrid program-ming model enforces an SPMD-like programming style byspawning threads only once close to the beginning of theprogram, thus calling for at least a funneled thread supportlevel, as message passing will have to be conducted withinmulti-threaded parallel regions. The particular thread sup-port level provided by the message passing library leads totwo further flavors of the coarse-grain hybrid model, de-pending on whether only the master thread is allowed toperform message passing communication, or if all threadsare enabled to call message passing primitives. The appli-cation of the three hybrid models (fine-grain, coarse-grainfunneled and coarse-grain multiple) in the parallelizationprocess of tiled loop algorithms will be the subject of thisSection.

4.1 Fine-grain Hybrid Parallelization

The fine-grain hybrid programming paradigm, also re-ferred to as masteronly in related literature, is the mostpopular hybrid programming approach, although it raisesa number of performance deficiencies. The popularity ofthe fine-grain model over the coarse-grain one is mainlyattributed to its programming simplicity: in most cases,it is a straightforward incremental parallelization of puremessage-passing code by applying block distribution worksharing constructs to computationally intensive code parts(usually loops). Because of this fact, it does not require sig-nificant restructuring of the existing message passing code,and is relatively simple to implement by submitting an ap-plication to performance profiling and further parallelizingperformance critical parts with the aid of multi-threadingprocessing. Also, fine-grain parallelization is the only fea-sible hybrid approach for those message passing libraries,that support only masteronly multi-threading.

However, the efficiency of the fine-grain hybrid modelis directly associated with the fraction of the code thatis incrementally parallelized, according to Amdahl’s law:since message passing communication can be applied onlyoutside of parallel regions, other threads are essentiallysleeping when such communication occurs, resulting topoor CPU utilization and inefficient overall load balanc-ing. Also, this paradigm suffers from the overhead of re-initializing the thread structures every time a parallel re-gion is encountered, since threads are continually spawned

and terminated. The thread management overhead canbe substantial, especially in case of a poor implementa-tion of the multi-threading library, and generally increaseswith the number of threads. Moreover, incremental loopparallelization is a very restrictive multi-threading paral-lelization approach for many real algorithms, where suchloops either do not exist or cannot be directly enclosed byparallel regions.

Algorithm 3: fine-grain hybrid model

/*determine group sequence from pid ~p */

for i← 1 to N do1

groupi = pi;2

/*main loop: traverse all group instances */

foreach groupN+1 ∈ G~p do3

/*for each active transmission direction−→dir...

*/


/*...pack data computed at previous

group... */

Pack(−→dir,groupN+1 − 1,~p);5

/*...and send data to process ~p +−→dir */




/*...receive from ~p −−→dir for next group */


/*multi-threaded parallel construct */

#pragma omp parallel9

/*calculate candidate tile to execute... */

for i← 1 to N do10

tilei = piTi + ti;11

tileN+1 = groupN+1 -∑N

i=1 tilei;12

/*...and compute if valid tile */

if 1 ≤ tileN+1 ≤⌈

Zz

⌉

then13



MPI Waitall ;15



/*...appropriately unpack received data */

Unpack(−→dir,groupN+1 + 1,~p);17

The proposed fine-grain hybrid implementation for iter-ative algorithms is depicted in Alg. 3. Hyperplane schedul-ing categorizes the tiles assigned to all threads of a specificprocess into groups, which can be concurrently executed.Each group contains all tiles, that can be safely executed inparallel by the specified number of threads T , without vio-lating the data dependencies of the initial algorithm. Eachgroup is identified by a N + 1-dimensional vector −−−→group,where the N outermost coordinates denote the owner pro-cess ~p, and the innermost one iterates over the distincttime steps. G~p corresponds to the set of execution time

5

steps of process ~p, and depends both on the process andthread topology. Formally, vector −−−→group and set G~p aredefined as follows:

−−−→group = (group1, . . . , groupN , groupN+1)

groupi =

pi, 1 ≤ i ≤ N

g ∈ G~p, i = N + 1

G~p =

g ∈ N|

N∑

i=1

piTi + 1 ≤ g ≤

N∑

i=1

piTi +

+N

∑

i=1

Ti − 1+

⌈

Z

z

⌉

For each instance of vector −−−→group, each thread determines

a candidate tile−→tile for execution, and further evaluates

an if -clause to check whether that tile is valid and shouldbe computed at the current time step.

X1

X2

Z

P1 = 3P2 = 2T1 = 3T2 = 2

)0,1(=pr)0,0(=tr

43 =tile

43 =group

processthread

Figure 1: Hybrid parallel program for 3D algorithm and 6processes × 6 threads

All message passing communication is performed out-side of the parallel region (lines 4-8 and 15-17), while themulti-threading parallel computation occurs in lines 9-14.Note that no explicit barrier is required for thread syn-chronization, as this effect is implicitly achieved by exitingthe multi-threaded parallel region. Note also that onlythe code fraction in lines 9-14 fully exploits the underly-ing processing infrastructure, thus effectively limiting theparallel efficiency of the algorithm. Fig. 1 clarifies some ofthe notation used in the hybrid algorithms.

4.2 Coarse-grain Funneled Hybrid Parallelization

According to the coarse-grain model, threads are onlyspawned once and their ids are used to determine theirflow of execution in the SPMD-like code. Obviously, mes-sage passing communication will be performed within themulti-threaded parallel region, hence the coarse-grain pro-gramming model requires at least a funneled thread sup-port level. Assuming funneled multi-threading support,the master thread will have to undertake the entire mes-sage passing communication required for the inter-nodedata transfer, though other threads will be allowed to per-form computation at the same time. We shall briefly re-fer to this case as coarse-grain funneled model, or simplycoarse-grain model in short.

The additional promising feature of the coarse-grain ap-proach is the potential for overlapping multi-threaded com-putation with message passing communication. However,due to the restriction that only the master thread is allowedto perform message passing, a naive straightforward imple-mentation of the coarse-grain model suffers from load im-balance between the threads, if equal portions of the com-putational load are assigned to all threads. Therefore, ad-ditional load balancing must be applied, so that the masterthread will assume a relatively smaller computational loadcompared to the other threads, thus equalizing the per tileexecution times of all threads. Moreover, the coarse-grainmodel avoids the overhead of re-initializing thread struc-tures, since threads are spawned only once, and can po-tentially implement more generic parallelization schemes,as opposed to its limiting fine-grain counterpart.

The pseudo-code for the coarse-grain parallelization ofthe tiled algorithms is depicted in Alg. 4. Note that theinter-node communication (lines 8-13 and 17-20) is con-ducted by the master thread, per communication direc-tion and per owner thread, incurring additional complexitycompared both to the pure message passing and the fine-grain model. Also, note the bal(~p,~t) parameter in thecomputation, that corresponds to a balancing factor forthread ~t of process ~p and optionally implements load bal-ancing between threads, as will be described in Section 5.

4.3 Coarse-grain Multiple Hybrid Parallelization

Should the message passing library provide full multi-threading support, another variation of the coarse-grainhybrid parallelization scheme is feasible. In fact, in thatcase of a thread-safe message passing library, relatively lessprogramming effort is required compared to the funneledparadigm, as each thread can in theory satisfy its own com-munication needs. Moreover, the multiple thread supportlevel allows for a more balanced distribution of the com-munication load to the available threads, as opposed to thefunneled model, where the master thread undertakes theentire task of inter-process communication.

However, as message passing libraries consider only pro-cesses as peer communicating entities, additional care mustbe taken to implement point-to-point communication be-tween threads residing on different processes with the aidof the message passing primitives. For instance, accordingto the MPI standard, there is no direct way for thread ~tiresiding on process ~p to address thread ~tj of a different

process ~p′. Message passing communication can be estab-lished only between the two processes ~p and ~p′, as opposedto a finer level between threads. As a workaround to thisproblem, it is possible to implicitly integrate the thread idinformation into the MPI message tag, so that a messagecoming from a remote thread is indirectly matched onlyby the appropriate local thread.

Alg. 5 outlines the coarse-grain multiple implementa-tion. The algorithm is similar to Alg. 4, except that themaster directives have been removed, since all threadscontribute to the inter-process communication. S~p,~t cor-

6

Algorithm 4: coarse-grain funneled hybrid model



/*determine group and tile sequences */

for i← 1 to N do2

groupi = pi;3

tilei = piTi + ti;4





i=1 tilei;6

/*only master thread communicates */

#pragma omp master7

/*for each active process transmission

direction... */


/*...pack communication data for all

threads... */

for th← 1 to T do9

Pack( ~dir,groupN+1 − 1,~p,th);10

/*...and send to neighbor process */


/*for each active process reception

direction... */


/*...receive data from neighbor

process */


/*compute (balanced?) if valid tile */


Zz

⌉

then14

Compute(−→tile,bal(~p,~t));15

/*only master thread communicates */

#pragma omp master16


MPI Waitall ;17

/*for each active process reception

direction... */


/*...unpack communication data for

all threads */

for th← 1 to T do19

Unpack(−→dir,groupN+1 + 1,~p,th);20

/*synchronize threads for next time step */

#pragma omp barrier21

Algorithm 5: coarse-grain multiple hybrid model



/*determine group and tile sequences */

for i← 1 to N do2

groupi = pi;3

tilei = piTi + ti;4





i=1 tilei;6

/*for each active thread transmission

direction... */

foreach−→dir ∈ S~p,~t do7

/*...pack communication data... */

Pack( ~dir,groupN+1 − 1,~p,~t);8

/*...and send to neighbor process */

MPI Isend(~p +−→dir,tag(~t));9

/*for each active thread reception

direction... */

foreach−→dir ∈ R~p,~t do10

/*...receive data from neighbor process

*/

MPI Irecv(~p−−→dir,tag(~t));11

/*compute if valid tile */


Zz

⌉

then12



MPI Waitall ;14

/*for each active thread reception

direction... */

foreach−→dir ∈ R~p,~t do15

/*...unpack communication data */

Unpack(−→dir,groupN+1 + 1,~p,~t);16

/*synchronize threads for next time step */

#pragma omp barrier17

7

responds to the set of valid transmission directions for

thread ~t of the owner process ~p. In particular, if−→dir ∈ S~p,~t,

then data computed by thread ~t need to be sent to pro-

cess ~p +−→dir, assuming ~p denotes a non-boundary process.

Similarly, if−→dir ∈ R~p,~t, then process ~p needs to receive

from its neighbor process ~p −−→dir, in order to satisfy de-

pendencies related to data computed by thread ~t of ~p. Asall threads call message passing primitives based on thesets S~p,~t, R~p,~t, the parameter tag() emphasizes the role ofthe message tag in matching communicating thread pairs.Note also that we will not be considering load balancingfor the coarse-grain multiple case, as this hybrid imple-mentation primarily aims to be as simple as possible, bybenefiting from the full multi-threading support providedby the message passing library. For the sake of simplic-ity, we shall refer to this programming model as multiplehybrid for the rest of this article.

5 LOAD BALANCING FOR THE HYBRID MODEL

Since most existing message passing libraries allow onlythe master thread to perform inter-process communica-tion, the coarse-grain hybrid models suffer from intrinsicload imbalance and require an appropriate computationdistribution scheme in order to achieve good performance.In the opposite case, the master thread will inevitably haveto perform a larger fraction of work compared to the otherthreads, that is, if equal computational loads are assignedindiscriminately to all threads.

The hyperplane scheduling scheme enables a more effi-cient load balancing between threads: Since the compu-tations of each time step are essentially independent ofthe communication data exchanged at that step, the for-mer can be arbitrarily distributed among threads. Thus,it would be meaningful for the master thread to assume asmaller part of computational load, so that the total com-putation and the total communication associated with theowner process is evenly distributed among all threads.

In order to achieve this, we propose both a static anda dynamic (adaptive) approach for the calculation of thebalancing factor(s). According to the static approach, loadbalancing is applied at compile time based upon a theoret-ical estimation of the relative communication vs computa-tion cost of a specific algorithm on the particular underly-ing infrastructure. Only fundamental system characteris-tics (average iteration execution time, network bandwidthand latency) are included in this analysis, so as to keep thestatic approach applicable in practice. On the other hand,the dynamic alternative makes for a more obtrusive ap-proach, where the relative communication vs computationcost of the algorithm is sampled at run-time, and no priorassumptions are made regarding the theoretical behaviourof the underlying system.

In this Section, we shall refer to the three proposed loadbalancing schemes, namely two variations of the static ap-

proach (constant and variable load balancing), as well asthe dynamic, adaptive load balancing methodology.

5.1 Static Balancing

We have implemented two alternative static load balancingschemes. The first one (constant balancing) requires thecalculation of a constant balancing factor, which is com-mon for all processes, irrespective of the selected processtopology. For this purpose, we consider a non-boundaryprocess, that performs communication across all N processtopology dimensions, and theoretically determine the com-putational fraction of the master thread, that equalizes tileexecution times on a per thread basis. We then apply thisconstant balancing factor to all processes and quantita-tively specify how much less computational load should beassigned to the master thread, so as to achieve the equal-ization of the total tile execution times of all threads.

The second scheme (variable balancing) requires furtherknowledge of the process topology, and ignores communi-cation directions cutting the iteration space boundaries,since these do not result to actual message passing. Ac-cording to the variable balancing scheme, we compute adifferent balancing factor for boundary and non-boundaryprocesses, since the former are from the communicationperspective less heavily burdened than the latter. Depend-ing on the position of its owner process in the topology,each master thread is assigned a different balancing factor,as opposed to the constant balancing scheme, where thesame balancing factor was applied to all master threads.

For both cases (constant and variable balancing), thebalancing factor(s) can be obtained by the followinglemma:

Lemma 1. Let X1×· · ·×XN ×Z be the iteration space ofan N+1-dimensional iterative algorithm, that imposes datadependencies [d1, . . . , 0]

T, . . . , [0, . . . , dN+1]

T. Let P =

P1 × · · · × PN be the process topology and T the numberof threads available for the parallel execution of the hybridfunneled implementation of the respective tiled algorithm.The overall completion time of the algorithm is minimalif the master thread assumes a portion bal

Tof the process’s

computational load, where

bal = 1−T − 1

tcomp

(

XzP

)

N∑

i=1i∈S~p

tcomm

(

diPiXz

XiP

)

(1)

tcomp(x) The computation time required for x iterations

tcomm(x) The transmission time of an x-sized message

z The tile height for each execution step

S~p Valid data transmission directions of process ~p

X Equal to∏N

i=1 Xi

The proof of the lemma is presented in the Appendix.We assume the computation time tcomp to be a linear

8

function of the number of iterations, that is, we assumetcomp(ax) = atcomp(x). Note that if condition i ∈ S~p isevaluated independently for each process, variable balanc-ing is enforced, as each communication term is only in-cluded in the sum of (1) as long as it contributes to mes-sage passing along the particular dimension for the specificprocess. If the above check is omitted, (1) delivers the con-stant balancing factor.

The constant balancing scheme only requires knowledgeof the underlying computational and network infrastruc-ture, but also tends to overestimate the communicationload for boundary processes. On the other hand, the vari-able balancing scheme can be applied only after selectingthe process topology, as it uses that information to calcu-late a different balancing factor for each process. Accord-ing to the variable scheme, the balancing factor is slightlysmaller for non-boundary processes. However, as the ac-tive communication directions decrease for boundary pro-cesses, the balancing factor increases, so as to preserve thedesirable thread load balancing.

X1

X2

87% 87% 87% 92%

95% 95% 95% 100%

Algorithm X1=4X2 , [(d,0)T , (0,d)T]

Z

8 processes (4x2)2 threads/process (2x1)

thread 0 thread 1process (0,0)

process (3,1)

Figure 2: Variable balancing for 8 processes × 2 threadsand 3D algorithm with iteration space X1×X2×Z,X1 =4X2

Fig. 2 demonstrates the variable load balancing schemefor a 3D tiled algorithm, that has been mapped onto a 2Dvirtual process topology on a dual SMP cluster. The dif-ferent balancing factors have been computed by applying(1), where functions tcomp() and tcomm() have been approx-imated as in equations (4) and (5) of Section 6, whereasN = 2, P = 8, P1 = 4, P2 = 2, T = 2, X1 = 4X2, d1 = d2 =d. Generally, a factor bal, 0 ≤ bal ≤ 1, for load balancing T

threads implies that the master thread assumes balT

of theprocess’s computational share, while each other thread isassigned a fraction of T−bal

T (T−1) of that share. Consequently,

larger balancing factors tantamount to the master threadassuming larger fractions of the process’s computationalload, with a balancing factor of 100% meaning equal dis-tribution of that load among all existing threads (no bal-ancing). For instance, since process (0, 1) does not haveto perform communication along dimension X2, its masterthread assumes a greater balancing factor (95% vs 87%)compared to the master thread of process (0, 0), which hasto perform message passing communication along both X1

and X2 directions. Moreover, process (3, 1) will not betransmitting any communication data, hence no balancingis applied to its master thread.

Clearly, the most important aspect for the effectivenessof both static load balancing schemes is the accurate archi-tectural modeling of both the underlying infrastructure, aswell as its basic performance parameters, such as the sus-tained network bandwidth and latency, and the require-ments imposed by the specific algorithm in terms of pro-cessing power. In order to preserve the simplicity and ap-plicability of the methodology, we avoid complicated in-depth software and hardware modeling, and adopt simple,yet partly crude, approximations for the system’s structureand behavior. A dynamic scheme adopting an adaptive,run-time balancing approach, could potentially overcomesome of the drawbacks associated with static balancing andits approximations, and is hence considered next.

5.2 Dynamic Balancing

Instead of applying thread load balancing a priori, accord-ing to a particular static modeling of application-systemperformance, it is possible to adaptively calculate appro-priate balancing factors, by slightly restructuring the hy-brid coarse-grain paradigm. Under a more obtrusive ap-proach, each thread can time its computation and commu-nication performance at run-time, so that load balancingis applied mainly based upon that information, and by fur-ther relying on a minimum set of assumptions. Obviously,as with most dynamic, obtrusive programming approaches,adaptive balancing incurs an additional profiling and pro-cessing overhead, mainly due to the insertion of timers,though only for a relatively short part of the code.

Suppose we apply an initial factor bal for the balancingof the T threads of a specific process. bal can be computedby using either of the static load balancing techniques men-tioned above. Assuming P processes, our intention is tosample program execution for more than PT time steps, sothat the execution wavefront reaches the most distant pro-cess, as we would like to time communication performancein a full parallel pipeline state. Based on the informationgathered at the sampling period, a more appropriate loadbalancing factor bal′ can be applied for the rest of the al-gorithm, according to the following lemma:

Lemma 2. Assuming a balancing factor bal and T

threads, let tmcomp, tmcomm be the average tile computationand communication times of the master thread, profiled atrun-time. A more efficient balancing factor bal′ can becomputed using the following expression:

bal′ = 1− balT − 1

T

tmcomm

tmcomp

(2)

The proof of Lemma 2 is presented in the Appendix. Itshould be noted that tmcomm refers to the non-overlappedcommunication time of the master thread, that is, exclud-ing any DMA-driven communication overlapped with com-putation, hence it can easily be profiled. In practice, thecoarse-grain model depicted in Alg. 4 is unfolded into twoparts, the sampling period and the main execution period.During the sampling period, the initial balancing factor bal

9

is applied, and the master thread profiles its partial tilecomputation and communication times (tm

comp and tmcomm,respectively), that approximately constitute the total tileexecution time. Based on this measurements, the new bal-ancing factor bal′ is computed, and further applied for themain execution period of the program.

The adaptive balancing approach is expected to incursome additional performance penalty, mainly due to thetiming requirements and re-balancing process. More im-portantly, it adopts two basic assumptions: first, that thecommunication time is invariant to the computation dis-tribution between threads, and second that the computa-tion time is proportional to the number of iterations. Thefirst assumption misrepresents the potential of overlappingcomputation and communication, while the second fails toconsider cache effects. However, these simplifications en-sure the generality and applicability of the proposed dy-namic balancing scheme. In fact, the dynamic scheme re-quires even less input to be supplied by the user concern-ing application and platform characteristics, as it basicallyextracts that information by monitoring and profiling pro-gram execution performance at run-time.

6 EXPERIMENTAL RESULTS

In order to test the efficiency of the proposed load bal-ancing schemes, we compared the performance of all pro-posed programming models (message passing, fine-grainhybrid, coarse-grain hybrid) against two kernel bench-marks, namely Alternating Direction Implicit integration(ADI), as well as a second order backward discretization ofthe Diffusion Equation (DE). ADI is a stencil computationused for solving partial differential equations (Karniadakisand Kirby (2002)). Essentially, ADI is a simple three-dimensional perfectly nested loop algorithm, that imposesunitary data dependencies across all three space directions.On the other hand, DE is a similar three-dimensional al-gorithm, that can be derived from the following unsteadydiffusion PDE with the aid of backward discretization:

∂Θ

∂t= ∇2Θ

Θ(x, y, 0) = f(x, y) (3)

Θ(x, y, t) = g(x, y, t) on ∂Ω

Both kernels have an iteration space of X1 × X2 × Z,where Z is considered to be the longest algorithm dimen-sion. Furthermore, both applications are suitable for theperformance evaluation of all parallel programming modelsand techniques discussed here, as they are typical repre-sentatives of nested loop algorithms and comply with ourtarget algorithmic model. More importantly, they imposecommunication in all three iteration space dimensions, andthus facilitate the comparison of the various programmingmodels in terms of communication performance. In fact,in the DE kernel, a process transmits three times the com-munication volume along each iteration space dimension

compared to ADI, therefore the relative performance of thetwo kernels enables scalability evaluation of the differentprogramming models in respect to the inherent communi-cation needs of a specific application.

We use MPI as the message passing library and OpenMPas the multi-threading API. Our experimental platform isan 8-node Pentium III dual-SMP cluster interconnectedwith 100 Mbps FastEthernet. Each node has two PentiumIII CPUs at 800 MHz, 256 MB of RAM, 16 KB of L1 ICache, 16 KB L1 D Cache, 256 KB of L2 cache, and runsLinux with 2.4.26 kernel. For the support of OpenMP di-rectives, we use Intel C++ compiler v.8.1 with the follow-ing optimization flags: -O3 -mcpu=pentiumpro -openmp

-static. We also used two MPI implementations, namelythe popular open-source MPICH library (v.1.2.6), as wellas the proprietary ChaMPIon/Pro MPI library (v.1.1.1).Both MPI libraries have been appropriately configured,in order to perform SMP intra-node communication effi-ciently (e.g. through SYS V shared memory). MPICHprovides at most a funneled thread support level, whileChaMPIon/Pro is fully thread-safe, and will be mainlyused for the performance comparison of the multiple hy-brid implementation with the other hybrid alternatives.Some fine-tuning of the MPICH communication perfor-mance for our experimental platform indicated using amaximum socket buffer size of 104KB, so the respectiveenvironment variable was appropriately set to that valuefor all cluster nodes.

Under the MPICH library, we are able to fully exploitour cluster architecture, as we will be using 16 processesfor the pure message passing experiments, and 8 processeswith 2 threads per process for all hybrid programs. Withthe ChaMPIon/Pro library, the number of available li-censes compelled us to use a maximum of 8 processesfor the message passing case, and only 4 processes forthe hybrid implementations, with each process spawning2 threads. Furthermore, all experimental results are aver-aged over three independent executions for each case.

Regarding static thread load balancing, a simplistic ap-proach is adopted in order to model the behavior of theunderlying infrastructure, so as to approximate quantitiestcomp and tcomm of (1). As far as tcomp is concerned, we as-sume the computational cost involved with the calculationof x iterations to be x times the average cost required fora single iteration. On the other hand, the communicationcost is considered to consist of a constant start-up latencyterm, as well as a term proportional to the message size,that depends upon the sustained network bandwidth onapplication level. Formally, we define

tcomp(x) = xtcomp(1) (4)

tcomm(x) = tstartup +x

Bsustained

(5)

Since our primary objective was preserving simplicityand applicability in the modeling of environmental param-eters, we intentionally overlooked at more complex phe-nomena, such as cache effects or precise communication

10

modeling. For instance, we ignored cache effects in the es-timation of the computation cost, as we intended to avoidperforming a memory access pattern analysis of the tiledapplication in respect to the memory hierarchy configura-tion of the underlying architecture. Also, a major diffi-culty we encountered was modeling the TCP/IP commu-nication performance and incorporating that analysis inour static load balancing scheme. Assuming distinct, non-overlapping computation and communication phases andrelatively high sustained network bandwidth allowed us tobypass this restriction. However, this hypothesis underes-timates the communication cost for short messages, sincethese are mostly latency-bound and sustain relatively lowthroughput. On the other hand, it overestimates the re-spective cost in the case of longer messages, where DMAtransfers alleviate the CPU significantly. For our analysis,we considered tcomp(1) = 288nsec, tstartup = 107usec andBsustained = 100Mbps.

Finally, as regards adaptive balancing, we considered aduration of 2PT time steps for the sampling period. Forinstance, given P = 8 and T = 2, this corresponds to 32time steps, while each process will be executing 82 to 8ktiles, depending on the tile height (z ranges from 2 to 200,while dimension Z is always equal to 16k, for all iterationspaces). Note that for some cases, the duration of thesampling period is not negligible, compared to the mainexecution period.

6.1 ADI Integration

Initially, we performed an overall performance comparisonof all proposed programming models against the ADI ker-nel, by using both the MPICH and the ChaMPIon/ProMPI implementations. It should be noted that althoughthis comparison may be useful towards the efficient usageof SMP clusters, it cannot be generalized beyond the cho-sen hardware-software combination, as it largely dependsupon the comparative performance of the specific program-ming APIs (MPICH, ChaMPIon/Pro, OpenMP supporton Intel compiler), as well as how efficiently the MPI li-brary supports multi-threaded programming.

We tried five different iteration spaces for ADI, in or-der to investigate the performance variation of each modelfor different process topologies. For each iteration spaceand programming model, we selected among all feasiblecartesian process topologies the one that minimizes thetotal execution time. The selected topologies for each caseare depicted in Table 1. The variety of iteration spacesand the potential for multiple alternative process topolo-gies eloquently demonstrates that the comparison of thepure message passing model with the hybrid approach isa non-trivial issue, even from the communication perspec-tive. Generally, the total communication volume is re-duced when assuming the hybrid approach, however thisreduction is not necessarily proportional to the numberof threads T , but also depends on the particular processtopology. For instance, considering the iteration space32 × 256 × 16k, we expect the process topology assumed

0.85

0.9

0.95

1

1.05

1.1

1.15

256x256x16k128x256x16k64x256x16k32x256x16k16x256x16k

Ratio

message passingfine-grain hybridcoarse-grain hybrid, unbalancedcoarse-grain hybrid, constant balancecoarse-grain hybrid, variable balancecoarse-grain hybrid, adaptive balance

Figure 3: Comparison of hybrid models (ADI integration,8 dual SMP nodes, MPICH, various iteration spaces)

with the hybrid implementations to be quite beneficial, asit eliminates the need for communication across the longX2 = 256 iteration space dimension. On the other hand,for the iteration space 256 × 256 × 16k, the communica-tion data is reduced by approximately 33%, while thereare only half as many entities to perform message passingunder funneled multi-threading support (16 processes inthe pure message passing model vs 8 master threads in thehybrid alternatives). Thus, even theoretically, the advan-tages of the hybrid approach may be diminished withoutproper thread load balancing.

iteration space message passing hybrid16× 256× 16k 1× 16 1× 832× 256× 16k 2× 8 1× 864× 256× 16k 2× 8 1× 8128× 256× 16k 2× 8 2× 4256× 256× 16k 4× 4 2× 4

Table 1: Selected process topologies for the message pass-ing and for the hybrid models

The overall experimental results for ADI integration andthe above iteration spaces are depicted in Fig. 3 (MPICH)and Fig. 5 (ChaMPIon/Pro). These results are normalizedin respect to the message passing execution times, so as toallow for straightforward quantitative comparison. Fur-thermore, granularity measurements for various iterationspaces are depicted in Fig. 4 for the MPICH library.

Following conclusions can be drawn from the thoroughinvestigation of the obtained performance measurements(MPICH library):

• Unoptimized hybrid parallelization (fine-grain, coarse-grain unbalanced) is often worse than the pure mes-sage passing approach. If no load balancing is appliedto the hybrid model, its relative performance com-pared to the pure message passing model exclusivelydepends on the appropriateness of the selected pro-

11

1.1

1.2

1.3

1.4

1.5

1.6

1.7

0 50 100 150 200

Tim

e (s

ec)

Tile Height

Total execution time for 16x256x16k iteration spaceand 16 processes vs 8 processes x 2 threads

message passigfine-grain hybridcoarse-grain hybrid, unbalancedcoarse-grain hybrid, constant balancecoarse-grain hybrid, variable balancecoarse-grain hybrid, adaptive balance

(a) 16 × 256 × 16K

4.8

5

5.2

5.4

5.6

5.8

0 50 100 150 200

Tim

e (s

ec)

Tile Height



(b) 64 × 256 × 16K

19

20

21

22

23

24

0 50 100 150 200

Tim

e (s

ec)

Tile Height


eager rendezvous


(c) 256 × 256 × 16K

Figure 4: Granularity results for ADI integration (MPICH,8 dual SMP nodes)

0.85

0.9

0.95

1

1.05

1.1

1.15


Ratio

message passingfine-grain hybridcoarse-grain hybrid, unbalancedcoarse-grain hybrid, constant balancecoarse-grain hybrid, variable balancecoarse-grain hybrid, adaptive balancemultiple hybrid

Figure 5: Comparison of hybrid models (ADI integration, 4dual SMP nodes, ChaMPIon/Pro, various iteration spaces)

cess topology, given a specific algorithm and iterationspace. As anticipated, the hybrid implementation per-forms better for the 32 × 256 × 16k iteration space,even without load balancing, as the assumed processtopology of the hybrid implementation eliminates thelargest fraction of the message passing communicationrequired. In almost all other cases, the pure messagepassing model performed better than the unbalancedhybrid implementations.

• The coarse-grain hybrid model generally performs bet-ter than the fine-grain alternative for most iterationspaces. However, the coarse-grain model is not al-ways more efficient than its fine-grain counterpart(e.g. 128 × 256 × 16k, 256 × 256 × 16k). This obser-vation reflects the fact that the poor load balancingof the simple coarse-grain model diminishes its advan-tages compared to the fine-grain alternative.

• When applying constant static balancing, in somecases the balanced coarse-grain implementation isless effective than the unbalanced alternatives (e.g.,16×256×16k, 32×256×16k). This can be attributedboth to inaccurate theoretical modeling of the systemparameters for the calculation of the balancing factors,as well as to the inappropriateness of the constant bal-ancing scheme for boundary processes. Generally, theconstant balancing approach performs well for rela-tively symmetric process topologies, with few of theprocesses lying at the topology boundary.

• When applying variable static balancing, the coarse-grain hybrid model was able to deliver superior perfor-mance to the message passing alternative in all cases.The performance improvement lies in the range of 2-12%, also depending upon the selected process topol-ogy in each case.

• The same performance improvement is also observedfor the adaptive dynamic balancing approach, thus

12

confirming that the required information for the ap-plication of efficient load balancing can be obtained atrun-time, with minimal overhead. Overall, the vari-able static and the adaptive dynamic balancing tech-niques have delivered the best parallel performancein almost all cases, and have proven to be the mostreliable and efficient parallelization approaches.

The granularity results in Fig. 4 reveal that these twobalancing techniques (variable, adaptive) perform betterthan any other implementation for almost all granulari-ties. Some performance degradations observed at certainthreshold values of the tile height z can be ascribed to thetransition from the eager to the rendezvous MPICH mes-sage protocol, occurring at 128000 bytes (for instance, seeFig. 4(c) at z = 125). Depending on the message size,MPICH resorts to three different message passing commu-nication protocols, namely short, eager and rendezvous,each assuming a different approach in terms of intermedi-ate buffering and/or receiver notification. Generally, thereare certain trade-offs when transitioning between theseprotocols, that mainly correlate sender-receiver synchro-nization with the overhead of intermediate buffering, there-fore performance implications should not be surprising.

The performance evaluation of all models with the aidof the ChaMPIon/Pro implementation mainly aimed atproviding additional insight for the multiple hybrid imple-mentation, since we simply repeated the same experimentson a smaller scale. Fig. 5 indicates that an appropriatelybalanced hybrid funneled implementation is usually evenslightly better than the multiple hybrid implementation,confirming our intuition that appropriate load balancingcould mitigate the limited multi-threading support andalso avoid the overhead associated with ensuring threadsafety of the message passing library. As many popu-lar message passing libraries currently provide only lim-ited multi-threading support, appropriately load balanc-ing the funneled hybrid model arises as a feasible and effi-cient parallel programming alternative. In addition, evenif the thread safety of the message passing library allowsall threads to perform inter-node communication, the un-derlying network infrastructure might be too restrictive forthe efficient utilization of this feature (e.g., one NIC to beused by multiple threads).

6.2 Diffusion Equation

We conducted similar experimental evaluation for the DEkernel, by using both the MPICH and the ChaMPIon/Prolibraries. The results are summarized in Fig. 6 for theMPICH library, and in Fig. 7 for the ChaMPIon/Pro im-plementation. In all cases, the execution times have beennormalized to the message passing model.

We have observed similar behavior as in the ADI bench-mark. Variable and adaptive balancing deliver the bestresults, by providing a performance improvement in therange 3-22%. The relative performance improvement inthe DE kernel is significantly higher in all iteration spaces

0.7

0.8

0.9

1

1.1

1.2

1.3


Ratio


Figure 6: Comparison of hybrid models (Diffusion equa-tion, 8 dual SMP nodes, MPICH, various iteration spaces)

0.9

1

1.1

1.2

1.3


Ratio

message passingfine-grain hybridcoarse-grain hybrid, unbalancedcoarse-grain hybrid, constant balancecoarse-grain hybrid, variable balancecoarse-grain hybrid, adaptive balancemultiple hybrid

Figure 7: Comparison of hybrid models (Diffusion equa-tion, 4 dual SMP nodes, ChaMPIon/Pro, various iterationspaces)

13

compared to ADI, reflecting the fact that hybrid pro-gramming can be particularly beneficial in algorithms withhigh communication to computation demands. The readershould also note that although DE imposes more communi-cation overhead than ADI, it is still heavily computation-bound. Since the advantage of hybrid programming liesin the efficient utilization of the underlying architecturefor communication purposes, it could be even more suit-able for the parallelization of communication-bound ap-plications, as long as these do not saturate the availablememory bandwidth.

Fig. 7 further confirms that multiple hybrid paralleliza-tion is not more efficient than either the variably or theadaptively balanced coarse-grain funneled hybrid imple-mentation, and even leads to slight performance degrada-tion in the 64×256×16k iteration space. Also, the relativeperformance improvement is significantly higher for the DEkernel (more than 5% in all cases) in comparison to ADI(around 3% for the various iteration spaces), thus the samequalitative behavior observed at the MPICH experimentsis exhibited also for the ChaMPIon/Pro library, though ona smaller scale.

7 CONCLUSIONS

This paper discusses load balancing issues regarding hy-brid parallelization of tiled loop algorithms. We proposethree load balancing schemes, namely two static techniques(constant, variable), as well as a dynamic (adaptive) one.Static balancing is based on application-system modeling,so as to determine a suitable task distribution betweenthreads. On the other hand, dynamic balancing extractssimilar information at run-time, by profiling the actualapplication performance and re-evaluating the initial loadbalancing strategy. We have compared both message pass-ing and hybrid implementations, and demonstrated theperformance improvements that can be attained by re-sorting to an appropriately balanced coarse-grain hybridimplementation.

The experimental evaluation has confirmed that for thealgorithms considered here unoptimized hybrid paralleliza-tion may perform poorly, even worse than the monolithicmessage passing paradigm. Moreover, coarse-grain SPMD-like hybrid parallelization does not always outperform thefine-grain incremental alternative, as the poor load bal-ancing of the former diminishes its relative advantages inrespect to the latter (e.g. higher parallelization efficiency,overlapping computation with communication, no threadre-initialization overhead). However, when applying vari-able or adaptive load balancing, the coarse-grain funneledhybrid model was able to deliver superior performance inrespect to message passing parallelization, and even exhibitsimilar performance to the coarse-grain multiple alterna-tive, but without requiring additional thread-safe supportfrom the message passing library.

Generally, the performance improvements are propor-tional to the intrinsic communication demands of the par-

allelized algorithm, and also depend on the efficiency of theprocess topology assumed in each case. Conclusively, effi-cient load balancing can effectively mitigate limited multi-threading support, and delivers superior performance com-pared to standard message passing implementations. Allmodels that have been proposed can easily be adopted tomore generic parallelization techniques, as we have empha-sized on the elements of applicability and simplicity.

A Static Load Balancing

Proof. For the sake of simplicity, and without loss of gen-erality, we assume that all divisions result to integers, inorder to avoid over-complicating our equations with ceil

and floor operators. Thus, each process will be assignedZz

tiles, each containing XzP

iterations. Furthermore, undera balancing factor bal, the master thread will assume theexecution of a fraction bal

Tof the process’s computational

load, while each of the other T −1 threads will be assigneda fraction of T−bal

T (T−1) of the remaining computations. Note

that under the funneled hybrid model, only the masterthread is allowed to perform inter-node communication,and should therefore take care of both its own communi-cation data, as well as the communication data of the otherthreads. The overall completion time of the algorithm canbe approximated by multiplying the total number of exe-cution steps with the execution time required for each tile(step) ttile. It holds

ttile = maxtmtile, totile (6)

where

tmtile = tcomp

(

bal

T

Xz

P

)

+

N∑

i=1i∈S~p

tcomm

(

diPiXz

XiP

)

(7)

the tile execution time of the master thread, while

totile = tcomp

(

T − bal

T (T − 1)

Xz

P

)

(8)

the tile execution time of a non-master thread.In order to minimize the overall completion time, or

equivalently the execution time for each tile (since thenumber of execution steps does not depend on the loaddistribution between threads), tm

tile must be equal to totile.If this is not the case, that is if tm

tile 6= totile, there canalways be a more efficient load balancing strategy, by as-signing more work to the more lightly burdened thread(s).Consequently, for minimal completion time under the as-sumed mapping, it holds

tmtile = totile (9)

Assuming that the computation time tcomp is a linear func-tion of the number of iterations, that is, tcomp(ax) =atcomp(x), (7), (8) and (9) can be easily combined to de-liver (1).

14

B Dynamic Load Balancing

Proof. We assume that the initial thread load balancing isin general non-optimal in practice, that is, if tm

comp, tmcomm

the average tile computation and communication times ofthe master thread and tocomp the average tile computationtime of a non-master thread, it holds

tmtile 6= totile ⇒

tmcomp + tmcomm 6= tocomp

Such suboptimal load balancing between the threads canbe ascribed to inaccurate theoretical system modeling, aswell as to the various approximations in performance esti-mation. Our goal is to determine a new balancing factorbal′, such that the average tile execution times becomeequalized for all threads, that is

t′mcomp + t′

mcomm = t′

ocomp (10)

We assume that the time required for the computation ofa tile is proportional to the number of iterations associatedwith that tile. Formally, we assume

tmcomp ∼bal

T

∏Ni=1 Xiz

P(11)

tocomp ∼T − bal

T (T − 1)

∏Ni=1 Xiz

P(12)

Because of (11), it holds

t′mcomp =

bal′

baltmcomp (13)

and due to (12), it also holds

t′ocomp =

T − bal′

T − baltocomp (14)

Moreover, assuming communication time invariance, weapproximate

t′mcomm ' tmcomm (15)

By combining (13), (14) and (15), (10) delivers

bal′

baltmcomp + tmcomm =

T − bal′

T − baltocomp ⇒

bal′ =balT tocomp − bal(T − bal)tmcomm

(T − bal)tmcomp + baltocomp

(16)

By further approximating tocomp by T−balbal(T−1) t

mcomp because

of (11) and (12), we can easily deduce (2)

ACKNOWLEDGMENT

The Project is co-funded by the European Social Fund(75%) and National Resources (25%).

REFERENCES

Andonov, R., Balev, S., Rajopadhye, S., and Yanev, N.(2003). Optimal Semi-oblique Tiling. IEEE Trans. onParallel and Distributed Systems, 14(9):944–960.

Ayguade, E., Gonzalez, M., Martorell, X., and Jost, G.(2004). Employing Nested OpenMP for the Paral-lelization of Multi-zone Computational Fluid Dynam-ics Applications. In Proceedings of the 18th Interna-tional Parallel and Distributed Processing Symposium2004 (IPDPS 2004), Santa Fe, New Mexico, USA.

Calland, P., Dongarra, J., and Robert, Y. (1997). Tilingwith Limited Resources. In Thiele, L., Fortes, J., Vis-sers, K., Taylor, V., Noll, T., and Teich, J., editors, Ap-plication Specific Systems, Architectures and Processors,pages 229–238. IEEE Computer Society Press.

Cappello, F. and Etiemble, D. (2000). MPI versusMPI+OpenMP on IBM SP for the NAS Benchmarks. InProceedings of the 2000 ACM/IEEE conference on Su-percomputing (CDROM), page 12, Dallas, Texas, USA.IEEE Computer Society.

Chavarrıa-Miranda, D. G. and Mellor-Crummey, J. M.(2003). An Evaluation of Data-parallel Compiler Sup-port for Line-sweep Applications. Journal of InstructionLevel Parallelism, 5:1–29.

Darte, A., Mellor-Crummey, J., Fowler, R., and Chavarrıa-Miranda, D. (2003). Generalized Multipartitioning ofMulti-dimensional Arrays for Parallelizing Line-sweepComputations. Journal of Parallel and Distributed Com-puting, 63(9):887–911.

Dong, S. and Karniadakis, G. E. (2004). Dual-level Paral-lelism for High-order CFD Methods. Journal of ParallelComputing, 30(1):1–20.

Drosinos, N. and Koziris, N. (2004). Performance Com-parison of Pure MPI vs Hybrid MPI-OpenMP Paral-lelization Models on SMP Clusters. In Proceedings ofthe 18th International Parallel and Distributed Process-ing Symposium 2004 (IPDPS 2004), page 15, Santa Fe,New Mexico.

El-Ghazawi, T., Carlson, W., and Draper, J. (2002). UPCLanguage Specification (V 1.1.1). In 2nd UPC workshop,Washington DC, USA.

Henty, D. S. (2000). Performance of Hybrid Message-passing and Shared-memory Parallelism for Discrete El-ement Modeling. In Proceedings of the 2000 ACM/IEEEconference on Supercomputing (CDROM), page 10, Dal-las, Texas, United States. IEEE Computer Society.

Hodzic, E. and Shang, W. (1998). On Supernode Trans-formation with Minimized Total Running Time. IEEETrans. on Parallel and Distributed Systems, 9(5):417–428.

Hu, Y. C., Lu, H., Cox, A. L., and Zwaenepoel, W. (2000).OpenMP for Networks of SMPs. Journal of Parallel andDistributed Computing, 60(12):1512–1530.

15

Karniadakis, G. E. and Kirby, R. M. (2002). Parallel Sci-entific Computing in C++ and MPI : A Seamless Ap-proach to Parallel Algorithms and their Implementation.Cambridge University Press.

Kumar, V., Grama, A., Gupta, A., and Karypis, G. (2003).Introduction to Parallel Computing, pages 205–208. Ad-dison Wesley.

Legrand, A., Renard, H., Robert, Y., and Vivien, F.(2004). Mapping and Load-balancing Iterative Compu-tations. IEEE Trans. on Parallel and Distributed Sys-tems, 15(6):546–558.

Loft, R. D., Thomas, S. J., and Dennis, J. M. (2001).Terascale Spectral Element Dynamical Core for Atmo-spheric General Circulation Models. In Proceedingsof the 2001 ACM/IEEE conference on Supercomputing(CDROM), page 18, Denver, Colorado. ACM Press.

Majumdar, A. (2000). Parallel Performance Studyof Monte Carlo Photon Transport Code on Shared-,Distributed-, and Distributed-Shared-Memory Architec-tures. In Proceedings of the 14th International ParallelProcessing Symposium (IPPS 2000), page 93, Cancun,Mexico.

Merlin, J. and Hey, A. (1995). An Introduction to HighPerformance Fortran. Scientific Programming, 4(2):87–113.

Morin, C. and Puaut, I. (1997). A Survey of Recover-able Distributed Shared Virtual Memory Systems. IEEETrans. on Parallel and Distributed Systems, 8(9):959–969.

Nakajima, K. (2003). OpenMP / MPI Hybrid vs. Flat MPIon the Earth Simulator: Parallel Iterative Solvers for Fi-nite Element Method. In Proceedings of the 5th Inter-national Symposium on High Performance Computing(ISHPC 2003), pages 486–499, Tokyo-Odaiba, Japan.

Rabenseifner, R. and Wellein, G. (2003). Communica-tion and Optimization Aspects of Parallel ProgrammingModels on Hybrid Architectures. International Journalof High Performance Computing Applications, 17(1):49–62.

Su, M., El-Kady, I., Bader, D., and Lin, S. (2004). ANovel FDTD Application Featuring OpenMP-MPI Hy-brid Parallelization. In Proceedings of the InternationalConference on Parallel Processing (ICPP’04), pages373–379, Montreal, Canada.

Tang, P. and Zigman, J. (1994). Reducing Data Com-munication Overhead for DOACROSS Loop Nests. InProceedings of the 8th International Conference on Su-percomputing (ICS’94), pages 44–53, Manchester, UK.

Wolf, M. and Lam, M. (1991). A Loop Transforma-tion Theory and an Algorithm to Maximize Paral-lelism. IEEE Trans. on Parallel and Distributed Sys-tems, 2(4):452–471.

16

Efficient hybrid parallelisation of tiled algorithms on SMP clusters

Documents

Transcript of Efficient hybrid parallelisation of tiled algorithms on SMP clusters