Efficient loop partitioning for parallel codes of irregular scientific computations

11
Efficient loop partitioning for Parallel Codes of Irregular Scientific Computations Minyi Guo, Li Li Dept. of Computer Software The University of Aizu Aizu-Wakamatsu City Fukushima, 965-8580, Japan minyi, r1096501 @u-aizu.ac.jp Weng-Long Chang Dept. of Info. Management Southern Taiwan University of Technology Tainan, Taiwan, 710, R.O.C. [email protected] Abstract In most cases of distributed memory computations, node programs are executed on processors according to the owner computes rule. However, owner computes rule is not best suited for irregular application codes. In irregular ap- plication codes, use of indirection in accessing left hand side array makes it difficult to partition the loop iterations, and because of use of indirection in accessing right hand side elements, we may reduce total communication by using heuristics other than owner computes rule. In this paper, we propose a communication cost reduction computes rule for irregular loop partitioning, called least communication computes rule. We partition a loop iteration to a proces- sor on which the minimal communication cost is ensured when executing that iteration. Then, after all iterations are partitioned into various processors, we give global vs local data transformation rule, indirection arrays remapping and communication optimization methods. The experimental re- sults show that, in most cases, our approaches achieved bet- ter performance than other loop partitioning rules. Keywords Parallelizing compilers, Irregular scientific application, Communication optimization, Owner com- putes rule, Loop partitioning, Distributed memory multi- computers. 1 Introduction In recent years, there have been major efforts in devel- oping approaches to parallelization of scientific applica- tions. Distributed memory machines provide the comput- ing power that scientists and engineers need for research This research was supported in part by the Grant-in-Aid for Scien- tific Research (C)(2) 14580386 and The Japanese Okawa Foundation for Information and Telecommunications under Grant Program 01-12. and development. Parallelizing compilers play an impor- tant role by automatically customizing programs for com- plex processor architectures, improving portability and pro- viding high performance to non-expert programmers. As the scientists attempt to model and compute more complicated problems, they have to envisage to develop ef- ficient parallel code for sparse and unstructured problems in which array accesses are made through a level of indirec- tion. This means that the data arrays are indexed through the values in other arrays, which are called indirection ar- rays or index arrays. The use of indirect indexing causes the data access patterns, i.e. the indices of the data arrays being accessed, to be highly irregular. Such a problem is called ir- regular problem, in which the dependency structure is deter- mined by variable causes known only at runtime. Irregular applications are found in unstructured computational fluid dynamic (CFD) solvers, molecular dynamics codes, diag- onal or polynomial preconditioned iterative linear solvers, and n-body solvers. Exploiting parallelism for irregular problems becomes very difficult due to their irregular data access pattern. Fig- ure 1 illustrates a typical irregular code segment. Here, el- ements are moved across the rows of a 2-D array based on the information provided in indirection array ielem. The elements of array cell are shuffled and stored in array new_cell. From this example we can observe that accesses to data array new_cell are dictated by the contents of the index array ielem in the irregular loop. Because of changes in access patterns, compilers make analysis of data distribu- tion, locality and cache optimization, and communication optimization more difficult. Researchers have demonstrated that the performance of irregular parallel code can be improved by applying a com- bination of computation and data layout transformations. Some researches focus on providing primitives and libraries 1

Transcript of Efficient loop partitioning for parallel codes of irregular scientific computations

Efficient loop partitioning for Parallel Codes of Irregular ScientificComputations

Minyi Guo, Li LiDept. of Computer Software

The University of AizuAizu-Wakamatsu City

Fukushima, 965-8580, Japan�minyi, r1096501 � @u-aizu.ac.jp

Weng-Long ChangDept. of Info. ManagementSouthern Taiwan University

of TechnologyTainan, Taiwan, 710, [email protected]

Abstract

In most cases of distributed memory computations, nodeprograms are executed on processors according to theowner computes rule. However, owner computes rule is notbest suited for irregular application codes. In irregular ap-plication codes, use of indirection in accessing left handside array makes it difficult to partition the loop iterations,and because of use of indirection in accessing right handside elements, we may reduce total communication by usingheuristics other than owner computes rule. In this paper,we propose a communication cost reduction computes rulefor irregular loop partitioning, called least communicationcomputes rule. We partition a loop iteration to a proces-sor on which the minimal communication cost is ensuredwhen executing that iteration. Then, after all iterations arepartitioned into various processors, we give global vs localdata transformation rule, indirection arrays remapping andcommunication optimization methods. The experimental re-sults show that, in most cases, our approaches achieved bet-ter performance than other loop partitioning rules.

Keywords Parallelizing compilers, Irregular scientificapplication, Communication optimization, Owner com-putes rule, Loop partitioning, Distributed memory multi-computers.

1 Introduction

In recent years, there have been major efforts in devel-oping approaches to parallelization of scientific applica-tions. Distributed memory machines provide the comput-ing power that scientists and engineers need for research

�This research was supported in part by the Grant-in-Aid for Scien-

tific Research (C)(2) 14580386 and The Japanese Okawa Foundation forInformation and Telecommunications under Grant Program 01-12.

and development. Parallelizing compilers play an impor-tant role by automatically customizing programs for com-plex processor architectures, improving portability and pro-viding high performance to non-expert programmers.

As the scientists attempt to model and compute morecomplicated problems, they have to envisage to develop ef-ficient parallel code for sparse and unstructured problemsin which array accesses are made through a level of indirec-tion. This means that the data arrays are indexed throughthe values in other arrays, which are called indirection ar-rays or index arrays. The use of indirect indexing causes thedata access patterns, i.e. the indices of the data arrays beingaccessed, to be highly irregular. Such a problem is called ir-regular problem, in which the dependency structure is deter-mined by variable causes known only at runtime. Irregularapplications are found in unstructured computational fluiddynamic (CFD) solvers, molecular dynamics codes, diag-onal or polynomial preconditioned iterative linear solvers,and n-body solvers.

Exploiting parallelism for irregular problems becomesvery difficult due to their irregular data access pattern. Fig-ure 1 illustrates a typical irregular code segment. Here, el-ements are moved across the rows of a 2-D array based onthe information provided in indirection array ielem. Theelements of array cell are shuffled and stored in arraynew_cell.

From this example we can observe that accesses to dataarray new_cell are dictated by the contents of the indexarray ielem in the irregular loop. Because of changes inaccess patterns, compilers make analysis of data distribu-tion, locality and cache optimization, and communicationoptimization more difficult.

Researchers have demonstrated that the performance ofirregular parallel code can be improved by applying a com-bination of computation and data layout transformations.Some researches focus on providing primitives and libraries

1

DO 100 i = 1, rowsC size(i) is the number ofC elements in the i-th row

DO 200 j = 1, size(i)new_cell(ielem(i,j),& next(ielem(i,j))) = cell(i,j)next(ielem(i,j)) = next(ielem(i,j))+1

200 CONTINUE100 CONTINUE

Figure 1. A typical irregular loop

for runtime support [2, 14, 4, 11], some provide languagesupport such as add irregular facilities to HPF or Fortran 90[18, 13, 20], and some works attempt to utilize caches andlocality efficiently [5].

Hwang et al. [14] presented a library called CHAOS,which helps user implement irregular programs on dis-tributed memory machines. The CHAOS library providesefficient runtime primitives for distributing data and compu-tation over processors; it supports index translation mecha-nisms and provides users high-level mechanisms for opti-mizing communication. In particular, it provides supportfor parallelization of adaptive irregular programs where in-direction arrays are modified during the course of com-putation. The same working group as the above, Pon-nusamy et al. extended the CHAOS runtime procedureswhich are used by a prototype Fortran 90D compiler tomake it possible to emulate irregular distribution in HPFby reordering elements of data arrays and renumbering in-direction arrays [18]. Also, in their paper [4], Das. et al.discussed some primitives to support communication opti-mization of irregular computations on distributed memoryarchitectures. These primitives coordinate inter-processordata movement, manage the storage of, and access to,copies of off-processor data, minimize inter-processor com-munication requirements and support a shared name space.

Other works investigated to provide efficient runtime andcompiler of parallel irregular reductions [9], and efficientimplementation of sparse matrix computations (i.e. LU de-composition, multiplication, etc.) [3].

In this paper, we propose an approach to minimize thecommunication cost in pre-processing for compiling irregu-lar loops. In our approach, neither owner computes rule noralmost owner computes rule, which was proposed in [18],is used in parallel execution of a loop iteration for irreg-ular computation. A communication cost reduction com-putes rule, called least communication computes rule, is

proposed. For a given irregular loop (if a loop body includesassignments with indirection array references, it is called”irregular loop”), two sets �������������� and ����������������� for all processors �����������! " are defined, where�������#����$�� is a set of processors which have to senddata to processor �� before the iteration is executed, and��������������$�� is a set of processors to which processor %�has to send data after the iteration is executed. According tothese information we partition the loop iteration to a proces-sor on which the minimal communication is ensured whenexecuting that iteration. Then, after all iterations are par-titioned into various processors, we give global vs. localdata transformation rule, data remapping and communica-tion optimization methods.

2 Should Owner Computes Rule Always BeUsed?

In the following discussion, we assume the irregular loopbody has only loop-independent dependence, but no loop-carried dependence (it is very difficult to test irregular loop-carried dependence since dependence testing methods forlinear subscripts are completely disabled), because most ofpractical irregular scientific applications have this kind ofloops. Consider the irregular loop below:

Example 1

DO 100 i = 1, NS1: Y(iy(i)) = X(ix(i))+Z(iz(i))S2: X(ix(i)) = X(ix(i))+Y(iy(i))+Z(iz(i))S3: Z(iz(i)) = X(ix(i))+Y(iy(i))100 CONTINUE

Generally, in distributed memory compilation, loop iter-ations are partitioned to processors according to the ownercomputes rule [1]. This rule specifies that, on a single-statement loop, each iteration will be executed by the pro-cessor which owns the left hand side array reference of theassignment for that iteration.

For the loop in Example 1, if owner computes rule isapplied, the first step is to distribute the loop into three indi-vidual loops each of which includes the statement S1, S2,and S3, respectively. Without loss of generality, for itera-tion &(' , assuming that )*�+&-,.�&/'0 1 32145�+&-67�+&8'9 1 , and :;�+&8<=�+&/'� / are distributed onto ?>�2@$A , and $B , respectively. Then theiteration &(' of executing S1, S2, and S3 would be parti-tioned to processor CAD21.> , and EB , respectively. Thus ifany references to array elements on the right-hand side isnot owned by the processor executing the statement (say,

2

Table 1. The owner of executing assignments and required communications for the example loop.Statement Owner array elements required communication

S1: EA , )*�+&-,.�&8'9 / 9�.>�� � EA9 �2 :;�+&8<=�+&8'9 / 9� B�� � EA� S2: .> , 4 �+&-67�+&8'0 / ���$A�� � .> �2 :;�+&8<��&8'� / ���.B�� � .> S3: B , ) �+&-, �+&8'9 1 ��.>�� � .B 32@4 �+&-67�+&8'9 1 ���$A�� � B0

Table 2. The required communications for the example loop if the almost computes rule is used.Statement Owner array elements required communication

S1: > �8 �� 4?23: �&8<��& ' 1 ��� B � � > S2: > noneS3: > �8 �� :%2@4 �& 67�& ' / ��� A � � >

After the iteration is executed, scattering �8 �� 4 ,�8 �� : to the owners of 4 �&-6 �&1' / and :;�+&8<=�+&('� 1 .

an off-processor reference), the array data on the right-handside would have to be communicated to the owner. TheTable 1 shows the owner of executing assignments and re-quired communications for the example loop.

However, owner computes rule is often not best suitedfor irregular codes. This is because of two reasons: Use ofindirection in accessing left hand side array makes it diffi-cult to partition the loop iterations according to the ownercomputers rule, secondly, because of use of indirection inaccessing right hand side elements, total communicationmay be reduced by using heuristics other than owner com-putes rule. Therefore, in CHAOS library, Ponnusamy et al.[18, 19] proposed a heuristic method for irregular loop par-titioning called almost owner computes rule, in which aniteration is executed on the processor that is the owner ofthe largest number of distributed array references in the it-eration.

For the above irregular loop, if the almost owner com-putes rule is used, because each of �>#2@$A and $B ownsone element participating iteration &�' , we can select one ofthem, for example, > , as the owner. This means that all thestatements of iteration & ' must be computed on processor > . The required communications are shown as in Table 2,where �8 �� 4 and �8 �� : mean the values obtained at theloop executing owner but need to send back to the ownersof 4 �+&-67�+& ' 1 and :;�+&8<��& ' 1 , respectively:

Obviously the communication cost is reduced as com-pared to the owner computes rule. Some HPF compil-ers employ this scheme by using EXECUTE-ON-HOMEclause [20]. However, when we parallelize a fluid dynam-ics solver ZEUS-2D code by using almost owner computesrule, we find that the almost owner computes rule is not op-timal manner in minimizing communication cost — eithercommunication steps or elements to be communicated. An-other drawback is that -¿ it is not straightforward to choose

optimal owner if several processors own the same numberof array references.

For example, in the above loop, if %B is selected as theexecuting processor, we found that the required communi-cations only need three times �) �+&-, �+& '9 1 ��.>� � B ,and scattering �8 �� ) 2/�8 �� 4 to owners of ) �+&-, �+&3'9 / and4 �+&-67�+&8' / ). Shown in Example 2 is a more complicated ir-regular loop, which is a simplified version extracted fromZEUS-2D code [17]:

Example 2

DO 10 t = 1, time_stepC Outer loop takes the execution timesC of irregular loop

DO 100 i = 1, NS1: X(j2(i)) = X(j1(i))S2: X(j4(i)) = X(j3(i))S3: Y(j1(i)) = Y(j1(i))+Z(j3(i))-X(j1(i))S4: Y(j4(i)) = Y(j3(i)) - X(j3(i))

100 CONTINUE......

10 CONTINUE

However, if we consider communication overhead whenthe iteration is partitioned to �> , we can obtain the commu-nication pattern as follows:� Import communication before the loop iteration is ex-

ecuted:

1. X(j3(i)), Y(j3(i)), Z(j3(i)): B � � >� Export communication after the loop iteration is exe-

cuted:

3

1. tmp Xj2: > � � A2. tmp Xj4, tmp Yj4: > � � ��

We assume this loop would be executed on�

pro-cessors in parallel. If the array element Y(i) isaligned with X(i) in the initial distribution, clearlyY(j1(i)) is also distributed onto the same processorwith X(j1(i)). So we can assume that > 2@ A 21 B , and � own [X(j1(i)), Y(j1(i))], [X(j2(i))],[X(j3(i)),Y(j3(i)), Z(j3(i))], and[X(j4(i)),Y(j4(i))], respectively, for iteration& . According to almost owner computes rule, this loopiteration would be partitioned to �B because it has themajority number of data elements. The communicationwould be:

� Import communication before the loop iteration is ex-ecuted:

1. X(j1(i)), Y(j1(i)): ?> � � .B� Export communication after the loop iteration is exe-

cuted:

1. tmp Yj1: EB�� � �>2. tmp Xj2: B � � A3. tmp Xj4, tmp Yj4: B � � ��

Although the number of elements to be communicatedis � , same as the former, but the communication steps arereduced (three times). This improvement is important whenthe outer sequential time step-loop is large. This illustratesthat the almost owner computes rule is not always an opti-mal scheme for guiding partition of loop iterations.

Based on the above observation, we propose a more ef-ficient computes rule for irregular loop partition. This ap-proach partitions iterations on a particular processor suchthat executing the iteration on that processor ensures

� the communication steps is minimum, and

� the total number of data to be communicated is mini-mum

3 Efficient Loop Iteration Partitioning

In this section, we give the strategy of loop iteration par-titioning for irregular codes according to least communi-cation computes rule. Different from the owner computesrule, the whole loop body of a loop to be parallelized isprocessed. Suppose that all of the arrays including data ar-rays and index arrays are initially distributed as BLOCK.The communication pattern of a partitioned loop iterationon a processor can be represented as a directed graph ������?2; , called communication pattern graph (CPG), where

the set of nodes � consists of all pre-execution nodes, anexecution node and all post-execution nodes for the proces-sors. An edge from a pre-execution node �� to an executionnode � ��&������ represents that if an iteration is executedon processor � , processor �� has to send data to � . Theweight of the edge represents the number of data to be com-municated. Similarly, an edge from a execution node to apost-execution node represents that the results of executionof the iteration returns to the processor. Figure 2 shows thecommunication pattern graph of the loop of Example 2.

Iteration partitioning according to least communicationcomputes rule is to assign iterations to the processors suchthat whole communication (import and export) steps andmessage length is minimized. Before the loop partitioningalgorithms are presented, we give the following definitions.

Definition 1 A loop-independent dependence exists in aloop body (block) when two statements reference the samememory location within a single iteration of all their com-mon loops. In this paper, we pay attention to the followingthree kinds of dependence:

1. true dependence: The first statement stores into a lo-cation that is later read by the second statement:S1 A(ia(i)) = ...S2 ... = A(ia(i))

2. output dependence: Both statements write into thesame location:S1 A(ia(i)) = ...S2 A(ia(i)) = ...

3. input dependence: Both statements use the same loca-tion:S1 ... = A(ia(i))S2 ... = A(ia(i))

Definition 2 Let an iteration &1' be partitioned onto proces-sor $� , set �� #�&-'#2@� is defined as all the data array ele-ments which must be sent from � to E� . � �� ��+&8'D2 � �� is thenumber of data in the set. Similarly, ��� �&8'D2@� is defined asall the data array elements which must be sent back to � from E� .Definition 3 Let an iteration & ' be partitioned onto pro-cessor � , set ����������� � is defined as a set of proces-sors A 2@ B 2�������2@�� which have to send data to proces-sor � before the iteration is executed. Each processor 2��5� � �"! has a degree � �#�$� � �+& ' 2 � �� . If there isno need to import data when executing the iteration on � ,�������#����$�� %�'& . Similarly, set ���������=������ is definedas a set of processors �A�21 B#2������921 �)( which have to receivedata from processor ?� after the iteration is executed. Eachprocessor � #2����*� �+!�� has a degree � � ,�-�"� �.� �&-'D2@� �� . Ifthere is no need to export data after executing the iterationon � , ���������=��� � �� & .

4

P2 P3

Execution node

1

P0 P1 P2 P3

P0 P1

P0 P1 P2 P3

3

22

1

2

P0 P1 P2 P3

P0 P1 P2 P3

3

(a) (b)

1

P0 P1 P2 P3

P0 P1 P2 P3

2 2

1 2

P0 P1 P2 P3

P0 P1 P2 P3

3

(c) (d)

Execution node1

1P2 P3

Execution node

1

P0 P1 P2 P3

P0 P1

P0 P1 P2 P3

3

22

1

2

P0 P1 P2 P3

P0 P1 P2 P3

3

(a) (b)

1

P0 P1 P2 P3

P0 P1 P2 P3

2 2

1 2

P0 P1 P2 P3

P0 P1 P2 P3

3

(c) (d)

Execution node1

1

Figure 2. Communication pattern graph of the loop in Example 2. (a), (b), (c), and (d) are the required communicationswhen the loop iteration is executed on processor >�21EA [email protected] , and � , respectively.

Example 3For Example 1, suppose that the sizes of array ) 214�2 : , and& ,.2/&-6 2/&8< are all 12, the number of processors � �

, alldata arrays and index arrays are distributed with BLOCK,and the values of index arrays are as follows:

�> EA B& 1 2 3 4 5 6 7 8 9 10 11 12&-, 5 7 9 11 1 3 2 4 6 8 10 12& 6 4 4 3 3 1 6 9 10 2 4 6 8&8< 4 5 6 7 8 10 12 2 1 3 9 11

We want to partition iteration 5. We obtained the dataelements of this iteration in the loop body are ) � � 3214 � � ,and :;���� . Then, � A ��� 2@�� � ��:;���� ���2 � B �� 2@�� �& 2�.�A �� 2@�� ��D: ���� ���2�.�B �� 2@�� �� & 2�� ����2����������� > "� ���������=��� > � � A ��21����������� A ����������=��� A �� > ��2����������� B � �������������� B �� > 21 A �,�

Definition 4 The degrees of the set ����������� � and���������=��� � �2������7������������ � / and ��� � ���������������� � 1 ,

are defined as

��� �7������������ � / ��� �� A � �-�

�� ��.A � � �+& ' 2 � �� 2

��� �7���������������$�D / ��� (� ��.A � �

� �+&8'D2 � ��)�

From the above definitions, we have the followingproposition.

Proposition 1 The least communication computes rule isto partition an iteration to a processor %� such that

1. � �������#����E�� �� � � ���������=���E�� ��-����������� "! � � ����������� ��#� � ���������=��� �� , where $is the processor set.

2. If more than one � , say ��% 21 �'& 2������021 �)( satisfy theabove formula, then select a � � such that�����7������������E� � / *�+�����7����������=���E� � / ������ � � �,���- %/.101010 . ��- (32 �������7������������ / *������7����������=��� / 1 �

In the following algorithms, we assume that a loop bodyis composed of � statements 4 A 2'4 B 2������92)465 , and each 4 �

5

has one left-hand array elements ! � and � right-hand ar-ray elements � A 2�� B 2������92���� , � ��4 � and � ��4 � represent thedefine-variable set and use-variable set of statement 4 � re-spectively. We also abbreviate ��� if a data is distributedonto processor . The algorithms for computing � ��+&8'�2 � and � � �&8'D2@� are as follows.

Algorithm 1 � �+& ' 2 � Input: Iteration & ' , Processor � , and iteration block�/4 A 2)4 B 2������92'4 5 �Output: � #�+&8'#2@� ;�� #�&-'#2@� � & ;for &�� ��2/�

for ���� ��� ����= in 4 �if ����)� � ��� ��4�A0 �� ��� �� � ��4 ����A9 1 � &���� � ��� ��� ��4 A �� ��������� �4 ����A / � &���� � � � � �� � // There is no true dependence between � � and all of// the left hand variables of the statements before 4 � ,// and there is no input dependence between � � and all//other right hand variables of the statement before 4 � .

thenif � ��� � � �+& ' 2 � then� �+& ' 2@� � � �& ' 2@� �� ��� � �

end ifend if

end forend for

Algorithm 2 � � �+&8'D2 � Input: Iteration &@' , Processor E� , and iteration block�/4�A 2)47BD2������92'4 5 �Output: � � �+&8'#2@� ;� � �&-'#2@� � & ;for &�� �E2��#2 � �

for ! � in 4 �if � ! � ��� � � ��� ���.A � � ���!��� ��� 5 / �� &�� ! � � � � �� �

// There is no output dependence between ! � and all of// the left hand variables ( ! �"�.A 2������92! 5 ) // of the

statements below 4 � ,then

if ! ��� � � � �+& ' 2 � then�.� �+&8'#2@� � � � �&8'D2@� �� � ! �3�

end ifend if

end forend for

The algorithm 3 computes the set �������#� . The computa-tion of ���������=� is similar with �������#� . Finally, Algorithm4 determines the processor on which the loop iteration ispartitioned.

Algorithm 3 ����������� � Input: Processor � , and � �& ' 2@� 9�� �*�$# Output: �����������?�� ;�������#����$�� � & ;for &�� ��2/ 2 �� �

if �� ��&8'D2@� ��� &for �,�%� �� ��+&8'#2@�

if � � � '& thenif '& � � ����������� � then����������� � �� ����������� � �� � (&6�

end if� '&�� � � (&���� �

end ifend for

end ifend for

Algorithm 4 Partition( �������#�E2@���������=�321 � )Input: ����������� and ���������=��� for all processor 2@� � � �* ,Output: iteration executing processor � ;for � � ��2/

get ��% 2������92@ �)( such that� �������#���� � �� � � ���������=��� � �� ������ � � "! � � ����������� ���� �������������� �� ;

select a E� � such that��� � �������������E� � / � ��� � ��������) �=���E� � / ���� � � � �,���- % .101010 . ��- ( 2 ����� �7������������ / �

��� �7�������) �=��� / / end for

4 Global-local Array Transformation and In-dex Array Remapping

There are two approaches to generating SPMD irregularcodes after loop partitioning. One is receiving required datain an iteration every time before the iteration is executed,and send the changed values (which are resident in otherprocessors originally) to other processors after the iterationis executed. Another is gather all remote data from otherprocessors for all iterations executed on this processor andscatter all changed remote values to other processors afterall iterations are executed. Because message aggregation isa main communication optimization means, Obviously thelater one is better for communication performance. In or-der to perform communication aggregation, this section dis-cusses redistribution of indirection arrays and local SPMDprogram style.

6

P0 P1 P2

P0 P0 P0 P0 P0P1 P1 P1 P2 P2 P2 P2iter

src_iter

P1 P0 = { 5, 8} P2 P0 = { 9, 10}

P0 P1 = { 2, 3, 4} P2 P1 = { }

P0 P2 = { } P1 P2 = { 6, 7}

1 5 8 9 10 2 3 4 6 7 1112

5 1 4 6 8 7 9 11 2 4 1012

4 1 10 2 4 4 3 3 6 9 6 8

4 8 2 1 3 5 6 7 1012 9 11

P0 P1 P2

iixiyiz

P0 P1 P2

P0 P0 P0 P0 P0P1 P1 P1 P2 P2 P2 P2iter

src_iter

P1 P0 = { 5, 8} P2 P0 = { 9, 10}

P0 P1 = { 2, 3, 4} P2 P1 = { }

P0 P2 = { } P1 P2 = { 6, 7}

1 5 8 9 10 2 3 4 6 7 1112

5 1 4 6 8 7 9 11 2 4 1012

4 1 10 2 4 4 3 3 6 9 6 8

4 8 2 1 3 5 6 7 1012 9 11

P0 P1 P2

iixiyiz

Figure 3. Subscript computations of index array re-distribution in Example 1.

����� ����� ��������������������������� �!�����"#�After loop partitioning analysis, all itera-

tions have been assigned to various processors:& � ��� �.> � �0&-> . A 2/&-> . BD2������921& > . &�$ ��2������ , &-� ��� �&%��7A0 ��0& %��7A . A 2/& %��7A . B 2������02/& %��7A . &�'#( % � . If an iteration & ' ispartitioned to processor � , the index array elements& , �+& ' �2/&-67�+& ' 32 ��� � may not be certainly resident in � .Therefore, we need to redistribute all index arrays so thatfor &-� ��� � � �� �0& � . A 2/& � . B 2������92/& � . & - � and every index array& , , elements &-,.�+& � . A �2������921&-,.�+& � . & - are local accessible.

As mentioned above, The source BLOCK partitionscheme for processor � is ���) &-� ��� � � � �+* ,%.- � � ��-� � �/* ,% - � � � � � , where � is the number of iterations ofthe loop. Then in a redistribution, the elements of an in-dex array need to communicate from �� to �� can be in-dicated by ���) &-� ��� �?�� '� &-� ��� � �� . Back to the Example1, same as Example 3, let the size of data array and indexarrays be �/0 , after loop partitioning analysis, we can ob-tain &-� ��� � > �� � ��2'� 2'� 221 2�� � ��2/&-� ������ A �� ��0 2 � 2 � � , and& � ��� � B � � � 243�2�� �#2��/0#� . The index array elements to beredistributed are shown in Figure 3.���65 798;:���<!=�����>?���@����<���������A <!�����"&�CBD��"E8�GF�<!��

A redistribution routine can be divided into two part:subscript computation and interprocessor communication.If there is no communication scheduling in a redistributionroutine, communication contention may occur, which in-creases the communication waiting time. Back to Exam-

ple 1, assume there are 5 processors in system executingthe loop. In some scenarios, the source processors and theircorresponding destination processors to which they have tosend the elements of index arrays would be shown as Fig-ure 4(a). If no scheduling is considered, the communica-tion steps can be organized as Figure 4(b), where the weightabove an arrow is the number of index elements. Clearly ineach communication step, there are some processors send-ing messages to the same destination processor. This leadsto node contention. Node contention will result in over-all performance degradation.[6, 7] The scheduling shownin Figure 4(c) can avoid this contention.

Another consideration is to align messages so that thesize of messages as near as possible in a step, since the costof a step is likely to be dictated by the length of the longestmessage exchanged during the step. Figure 4(d) representssuch an alignment. We use a sparse matrix H to representthe communication pattern between the source and targetprocessor sets where H �+& 2 �� �JI expresses a sendingprocessor � should send an I size message to a receivingprocessor K . The following matrices describe the redis-tribution pattern of Figure 4(a) and scheduling pattern ofFigure 4(d).

H �

LMMMMN� � � � � �� � � 0 � 00 0 � � 1� � � 0 � �� � � � �

OQPPPPR 22S 4*�

LMMMMN� � � T� 0 � �� � � �T � 0 T� � T 0

OQPPPPR

We have proposed an algorithm that accepts H as inputand generates a communication scheduling table U&V to ex-press the scheduling result where U&VC�&@2 � � � means thata sending processor �� sends message to a receiving pro-cessor K at a communication step � . The above matrixU&V represents the scheduling pattern shown in Figure 4(d),where

Tindicates there is nointerprocessor communication

at this step. The scheduling satisfies:

� there is no node contention in each communicationstep; and

� the sum of longest message length in each step is min-imum.

���XW YZ"E��[BD��"&>&���]\After the index array redistribution is completed, we

can develop the node program which has three parts: pre-execution import communication (gathering phase), irreg-ular loop execution (executing phase), and post-executionexport communication (scattering phase). In the node pro-gram, � � � �� ( �.�� � �� ) is the all data which need send to from � (current executing processor), before (after) loop

7

P0 : { P1, P3, P4}P1 : { P0, P2, P3, P4}P2 : { P0, P1, P3, P4}P3 : { P0, P2}P4 : { P1, P2, P3}

(a)

P1P03

P310

P43

P0P110

P22

P31

P42

P0P22

P12

P33

P49

P0P310

P22

P1P44

P23

P33

Step 1 Step 2 Step 3 Step 4

P1P03

P310

P43

P0P110

P22

P31

P42

P02

P12

P33

P4P29

P2P32

P010

P3P43

P14

P25

Step 1 Step 2 Step 3 Step 4

P1P03

P310

P43

P4P12

P22

P31

P010

P49

P12

P33

P0P22

P0P32

P210

P3P43

P14

P23

Step 1 Step 2 Step 3 Step 4

(b)

(c) (d)

P0 : { P1, P3, P4}P1 : { P0, P2, P3, P4}P2 : { P0, P1, P3, P4}P3 : { P0, P2}P4 : { P1, P2, P3}

(a)

P1P03

P310

P43

P0P110

P22

P31

P42

P0P22

P12

P33

P49

P0P310

P22

P1P44

P23

P33

Step 1 Step 2 Step 3 Step 4

P1P03

P310

P43

P0P110

P22

P31

P42

P02

P12

P33

P4P29

P2P32

P010

P3P43

P14

P25

Step 1 Step 2 Step 3 Step 4

P1P03

P310

P43

P4P12

P22

P31

P010

P49

P12

P33

P0P22

P0P32

P210

P3P43

P14

P23

Step 1 Step 2 Step 3 Step 4

(b)

(c) (d)

Figure 4. Communication scheduling in redistribution routine. (a)Source processors and their corresponding tar-get processor set; (b)Communication steps if no scheduling is considered; (c)Node contention-free scheduling; and(d)Contention-free and message length alignment scheduling.

execution. &�� ! )�)9�,! marks as local loop index. � � is thenumber of iterations partitioned onto � . � ’s node program:

// pre-communicating required elements with other proces-sors.� � �����������E��

receive data from ;for � � � 2/ � �#2-�� �

if E� �"����������� thenBuffering data from� � � �� � � � �+& A 2 �� �� ��� �� � � �&!& - 2 �� ;

send to ;end if

end for// executing the local iterationsfor &�� ! )�)�� ! � �#2�� �

4 A �+&�� ! )�)�� !� ��'4 B �&�� ! )�)�� ! ����������)4657�+&�� ! )�)�� !� ;end for// post communicating changed remote elements with otherprocessors.for

� � ���������=���E��

Buffering data from ��� ��� � � � �+&/A 2 � �� � ��� � �� ��+& & - 2@� ;send changed data to ;

end for

for � � � 2/ � �#2-�� �if � � �������������� then

receive changed data from � ;end if

end for

5 Experiments and Performance Results

We now present experimental results to show the efficacyof the methods presented so far. We measure the differencemade by using owner computes rule, almost owner com-putes rule, and our least communication computes rule inan experimental program. All the experiments are exam-ined on a 24 node SGI Origin 2000 parallel computer.

We select an irregular kernel of fluid dynamics code,ZEUS-2D for our study. ZEUS-2D is a computationalfluid dynamics code developed at the Laboratory for Com-putational Astrophysics (NCSA, University of Illinois atUrbana-Champaign) for astrophysical radiation magneto-hydro dynamics problems[17]. ZEUS-2D solves problemsin one or two spatial dimensions with a wide variety ofboundary conditions. The C language preprocessor allows

8

Table 3. Load balance rate when element values of index arrays are one-to-one mapping with the loop variable.Computes Rule Num. of processors

4 8 12 16 20 24

Owner Computes Rule 1.00 1.00 1.04 1.12 1.00 1.60Almost Owner Computes Rule 1.08 1.06 1.14 1.08 1.25 1.34Least Communication Computes Rule 1.21 1.16 1.07 1.28 1.30 1.14

Table 4. Load balance rate when element values of index arrays are many-to-one mapping with the loop variable.Computes Rule Num. of processors

4 8 12 16 20 24

Owner Computes Rule 1.34 1.29 1.31 1.19 1.17 1.58Almost Owner Computes Rule 1.22 1.31 1.09 1.24 1.08 1.32Least Communication Computes Rule 1.32 1.25 1.47 1.16 1.17 1.35

the user to define various macros to customize the ZEUS-2D algorithm for the desired physics, geometry, and out-put. ZEUS-2D solves the equations of ideal (non-resistive),non-relativistic, hydrodynamics, including radiation trans-port, (frozen-in) magnetic fields, rotation, and self-gravity.Boundary conditions may be specified as reflecting, peri-odic, inflow, or outflow. The kernel irregular subroutineX2INTZC includes some loops with similar appearance asExample 2. We specify the geometry as Cartesian XY, thegrid as uniformly spaced zones 800 by 2, and extend theirregular loop iterations to 1000. The node programs arewritten in Fortran, using MPI communication library andsystem call gettimeofday() for measuring executiontime.

Since the execution time of a parallel program dependson the maximum of the execution time of all the node pro-grams, we must partition the loop iterations as balanced aspossible. One of the anxieties is whether partitioning it-eration can keep load balance when least communicationcomputes rule is applied. We first examine the load balancerate by some inputs of index arrays. The load balance rateis calculated by

� ����� %� �.A �7�= ��) � &-� ��� ��) � � �� ��7�� ��) � ����)�) �� " �!) �(� ! �7�= �� ���) � &-� �������8& ) ���

The first experiment is designating that the values of in-dex arrays as the values of a linear function of the loop vari-able, i.e. the element values of an indirection array have theone-to-one mapping with the loop index. Because the indi-rection arrays are initially distributed as BLOCK, the ownercomputes rule has the good load balance if the number of it-erations is the multiple of the number of processors. Table3 summarizes our analysis for load balance rate. The ownercomputes rule, almost owner computes rule, and least com-munication computes rule are compared on the number of

processors� 2�� 2��/0 2���� 220#� , and 0 � , respectively.

Table 4 shows the load balance rate when the elementvalues of an index array vs. loop variable is a many-to-onemapping. Since for several iterations the same data arrayelement may be accessed, the owner computes rule has nogood load balance, but for the almost owner computes ruleand least communication computes rule, the load balancerates do not obviously change compared with Table 1.

In Figure 5, we show the performance difference ofINC2TZ obtained by using three kinds of the compute rule.Performance of the different versions of the code is mea-sured from 2 to 24 processors. The curves marked withowner computes, almost computes, and least comm. arethe versions of the code which the loop is partitioned us-ing owner computes rule, almost owner computes rule, andour least communication computes rule. The figure showsthat our method gets good performance in most cases. Onsmall number of processors, our method is not better thanthe other computes rules. This is because that the totalcommunication time is so small that the communicationoptimization makes no significant difference. Another rea-son is that the load balance rate of our method is worse insome cases. However, the overall performance difference issmall. When the same data is distributed over a larger num-ber of processors, the communication time becomes a sig-nificant part of the total execution time and the communica-tion optimization makes significant difference in the overallperformance of the program.

In Figure 6, we further study the impact of different ver-sions on communication statements. Only the communi-cation time is shown for the various versions of the code.Communication optimizations in our method include mes-sage aggregation before and after loop execution, index ar-ray redistribution and communication scheduling. Whenthe number of processors is large, our method can lead to

9

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

����������

���������� �� ��������������� ���������� ������

Figure 5. Effect of least communication computesrule in executing ZEUS-2D irregular loops (Execu-tion time).

substantial improvement in the performance of the code, be-cause the communication time influences significantly totalperformance of parallel program.

6 Conclusions

The efficiency of loop partitioning influences perfor-mance of parallel program considerably. For automaticallyparallelizing irregular scientific codes, the owner computesrule is not suitable for partitioning irregular loops. In thispaper, we have presented an efficient loop partitioning ap-proach to reduce communication cost. In our approach, run-time preprocessing is used to determine the communicationrequired between the processors. We have developed thealgorithms for performing these communication optimiza-tion. We have also presented how indirection arrays are re-distributed after determined loop iterations are executed onwhich processors.

We have done a preliminary implementation of theschemes presented in this paper. The experimental resultsdemonstrate efficacy of our schemes. However, we onlyperformed an intra-procedural analysis to optimize com-munication, because many irregular applications have theschemes of inter-procedural irregular loops, as our futurework, we will extend our approaches to inter-procedural ir-regular loop partitioning.

References

[1] R. Allen and K. Kennedy. Optimizing compilers forModern Architectures. Morgan Kaufmann Publishers,2001.

�� �

�� �

�� �

�� �

� � �

� � �

� � � � � � � � � � �

� � � � � � � � � � � � � � �

���������

���� "!�#$�% &�'�(�)����� �!�#* !�+�(�)�,�'�&�&�-

Figure 6. Communication time for owner computesrule, almost owner computes rule, and least commu-nication computes rules in executing ZEUS-2D irreg-ular loops.

[2] G. Agrawal and J. Saltz. Interprocedural compilationof Irregular Applilcations for Distributed memory ma-chines. Language and Compilers for Parallel Comput-ing, pp. 1-16, August 1994.

[3] R. Asenjo, E. Gutierrez, Y. Lin, D. Padua, B. Pot-tengerg, E. L. Zapata. On the Automatic Paralleliza-tion of Sparse and irregular Fortran codes. Techni-cal Report 1512, University of Illinois at Urbana-Champaign, CSRD, December 1996.

[4] R. Das, M. Uysal, J. Saltz, and Y-S. Hwang. Commu-nication optimizations for irregular scientific compu-tations on distributed memory architectures. Journalof Parallel and Distributed Computing, 22(3):462–479, September 1994.

[5] C. Ding and K. Kennedy. Improcing cache perfor-mance of dynamic applications with computation anddata layout transformations. In Proceedings of theSIGPLAN’99 Conference on Programming LanguageDesign and Implementation, Atlanta, GA, May, 1999.

[6] M. Guo, I, Nakata, and Y. Yamashita. Contention-free communication scheduling for array redistribu-tion. Parallel Computing, 26(2000), pp. 1325-1343,2000.

[7] M. Guo and I. Nakata. A framework for efficient arrayredistribution on distributed memory machines. TheJournal of Supercomputing, Vol. 20, No. 3, pp. 253-265, 2001.

10

[8] M. Guo, Y. Pan, and C. Liu. Symbolic CommunicationSet generation for irregular parallel applications. Toappear in The Journal of Supercomputing, 2002.

[9] E. Gutierrez, O. Plata, E. L. Zapata. On automatic par-allelization of irregular reductions on scalable sharedmemory systems. In Proceedings of the Fifth Interan-tional Euro-Par Conference, pp. 422-429, Toulouse,France, August-September 1999.

[10] E. Gutierrez, R. Asenjo, O. Plata, and E.L. Zapata. Au-tomatic parallelization of irregular applications. Paral-lel Computing, 26(2000), pp. 1709-1738, 2000.

[11] H. Han and C.-W. Tseng. Improving compiler and run-time support for adaptive irregular codes. In Proceed-ings of the International Conference on Parallel Archi-tectures and Compilation Techniques, Paris, France,October 1998.

[12] Y. Hu, A. Cox, and W. Zwaenepoel. Improving fine-grained irregular shared-memory benchmarks by datareordering. In Proceedings of SC’00, Dallas, TX,November 2000.

[13] Y. Hu, S. L. Johnsson, and S.-H. Teng. High Perfor-mance Fortran for highly irregular problems. In Pro-ceedings of the Sixth ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming, LasVegas, NV, June 1997.

[14] Y.-S Hwang, B. Moon, S. D. Sharma, R. Ponnusamy,R. Das, and J. Saltz. Runtime and language support forcompiling adaptive irregular programs on distributedmemory machines. Software-Practivce and Experi-ence, Vol.25(6), pp. 597-621,1995.

[15] J. Mellor-Crummey, D. Whalley, and K. Kennedy.Improving memory hierarchy performance for irreg-ular applications. In Proceedings of the 1999 ACM In-ternational Conference on Supercomputing, Rhodes,Greece, June 1999.

[16] N. Mitchell, L. Carter, and J. Ferrante. Localizingnon-affine array references. In Proceedings of th In-ternational Conference on Parallel Architectures andCompilation Techniques, Newport Beach, LA, Octo-ber 1999.

[17] J.M. Stone and M. Norman. ZEUS-2D: A radiationmagnetohydrodynamics code for astrophysical flowsin two space dimensions: The hydrodynamic algo-rithms and tests. Astrophysical Journal SupplementSeries, Vol. 80, pp. 753-790, 1992.

[18] R. Ponnusamy, Y-S. Hwang, R. Das, J. Saltz, A.Choudhary, G. Fox. Supporting irregular distributions

in Fortran D/HPF compilers. Technical report CS-TR-3268, University of Maryland, Department of Com-puter Science, 1994

[19] R. Ponnusamy, J. Saltz, A. Choudhary, S. Hwang,and G. Fox. Runtime support and compilation meth-ods for user-specified data distributions. IEEE Trans-actions on Parallel and Distributed Systems, 6(8), pp.815-831, 1995.

[20] M. Ujaldon, E.L. Zapata, B.M. Chapman, and H.P.Zima. Vienna-Fortran/HPF extensions for sparse andirregular problems and their compilation. IEEE Trans-actions on Parallel and Distributed Systems. 8(10),Oct. 1997. pp. 1068 -1083.

11