An HPF compiler for the IBM SP2

An HPF Compiler for the IBM SP2Manish Gupta� Sam Midki�� Edith Schonberg� Ven SeshadriyDavid Shields� Ko-Yang Wang� Wai-Mee Ching� Ton Ngo�AbstractWe describe pHPF, an research prototype HPF compiler for the IBM SP series parallel ma-chines. The compiler accepts as input Fortran 90 and Fortran 77 programs, augmented withHPF directives; sequential loops are automatically parallelized. The compiler supports symbolicanalysis of expressions. This allows parameters such as the number of processors to be unknownat compile-timewithout signi�cantly a�ecting performance. Communication schedules and com-putation guards are generated in a parameterized form at compile-time. Several novel optimiza-tions and improved versions of well-known optimizations have been implemented in pHPF toexploit parallelism and reduce communication costs. These optimizations include eliminationof redundant communication using data-availability analysis; using collective communication;new techniques for mapping scalar variables; coarse-grain wavefronting; and communication re-duction in multi-dimensional shift communications. We present experimental results for somewell-known benchmark routines. The results show the e�ectiveness of the compiler in generatinge�cient code for HPF programs.1 IntroductionFortran has always been synonymous with fast execution. High Performance Fortran (HPF) [13, 8]de�nes a set of directive extensions to Fortran to facilitate performance portability of Fortranprograms when compiling for large-scale, multiprocessor architectures, while preserving a shared-address space programming model. Unfortunately, HPF compilers have not appeared as rapidly asoriginally had been hoped, and it is now accepted that high quality compilers for HPF will have toevolve over time and with experience. Some of the complexities of compiling HPF result from thefollowing:� The ability to perform communication optimizations is essential for high performance. Asingle inner-loop communication can result in a signi�cant loss of performance. For HPF per-formance to approach hand-coded performance, more and more sophisticated optimizationswill have to be implemented.� The implementation of every feature of Fortran is a�ected by the distribution of data acrossdi�erent address spaces. Fortran data is primarily static, but HPF data must be dynami-cally allocated, since arrays can be redistributed. Even without redistribution, the numberof processors is, in general, not known at compile-time, so that the sizes of local array par-titions are not known statically. Thus, in addition to loop parallelization, HPF requires an�IBM T.J. Watson. Research, P.O. Box 704, Yorktown Heights, NY, 10598. The authors can be reached by e-mailat fmgupta,midki�,schnbrg,shields,kyw,ching,[email protected]. The corresponding author can be reached [email protected] Software Solutions Division, 1150 Eglinton Ave. East, North York, Ontario, CANADA M3C 1V7. Theauthor can be reached by e-mail at [email protected]

enhanced run-time model, and new implementations of such features as common blocks, datastatements, block data, and array addressing.� HPF is an extremely high-level language, in that the amount of generated/executed code persource line is generally larger than for other common programming languages. This poses achallenge for code generation, run-time library design, and tool development.� In addition to generating scalable code, an HPF compiler must generate code with gooduniprocessor node performance. In particular, the HPF-speci�c transformations must notinhibit uniprocessor optimization, and the SPMD overhead must be small.This paper describes pHPF, an HPF compiler for the IBM SP2 architecture that has been devel-oped over the past two years1. pHPF consists of extensions for the HPF subset [13, 8] to an existingFortran 90 [7] compiler. In this paper we focus on novel aspects, optimizations, and experiencewith a set of benchmarks. Although pHPF is still under development, all features described in thispaper have been fully implemented.Several HPF-related compiler e�orts have been previously described [11, 23, 17, 19, 3, 15, 18].Our compiler is unique in the following combination of features:� pHPF exploits data parallelism from Fortran 90 array language and also performs programanalysis to e�ectively parallelize loops in Fortran 77 code.� pHPF performs symbolic analysis to generate e�cient code in the presence of statically un-known parameters like the number of processors and array sizes. Rather than leaving the taskof determining the communication schedule to run-time, pHPF generates closed-form, parame-terized expressions to represent communication data and sets of communicating processors.2.� Along with well-known communication optimizations like message vectorization [11, 23] andexploiting collective communication [16, 9], pHPF also performs aggressive optimizations likeeliminating redundant communication, reducing the number of messages for multi-dimensionalshift communication, and coarse-grain wave-fronting. pHPF also uses novel techniques formapping scalar variables to expose more parallelism and to reduce communication costs.The rest of this paper is organized as follows. Section 2 presents an overview of the architectureof the entire compiler and the SPMDizer in more detail. Section 3 describes a set of optimizationsperformed by the SPMDizer, and Section 4 gives performance results for a set of benchmarks.Section 5 describes other compilers for HPF and similar languages, that have been discussed in theliterature. Finally, Section 6 presents conclusions and ideas for future work.2 Compiler FrameworkpHPF was implemented by incorporating a new SPMDizer component into an existing Fortran 90compiler, as shown in Figure 1. Additionally, front-end extensions were required to process HPFdirectives. The impact of HPF on the other components of the Fortran 90 compiler was very small;only the scalarizer required signi�cant modi�cations.1This is a prototype development project at IBM Research.2pHPF currently supports block and cyclic, but not block-cyclic distributions.2

Array Language Scalarizer

HPF SPMDizer

Locality Optimizer

Optimizing Backend

FORTRAN 90/HPF Frontend

Dataflow /Dependence Analyzer

LoopTransformer

HPF RuntimeFigure 1: Architecture of the pHPF compiler2.1 Architecture OverviewpHPF is a native compiler, not a preprocessor. While preprocessors are more portable and are easierto build, native compilers are generally preferable for obtaining highly-tuned performance. Also, itis easier to write application debuggers for native compilers than for preprocessors. The compilationprocess can be summarized by considering the output from each of the phases in Figure 1:� The front end produces a high-level intermediate representation, which includes internal formsof Fortran 90 array language and HPF directives.� The scalarizer compiles the internal array language into internal Fortran 77 scalar form.� The SPMDizer interprets the HPF directives, and produces an SPMD single-node program,in which data and computation have been partitioned and communication code resolves non-local data references.� The locality optimizer performs loop reordering transformations on the uniprocessor programto better utilize cache and registers.� The back end performs traditional optimizations on the uniprocessor program.The scalarizer in-lines Fortran 90 intrinsic functions whenever possible. By in-lining intrinsicfunctions, it is possible to eliminate extra array temporaries and copying, and to achieve per-formance for Fortran 90 programs comparable to Fortran 77 programs. HPF extensions to the3

HPF SPMDizer

Data Partitioning Preprocessing

Communication Analysis Dataflow /

Dependence Analyzer

LoopTransformerComputation

Partitioning

CommunicationCode Generation

Data Partitioning PostprocessingFigure 2: Architecture of the pHPF SPMDizerscalarizer ensure that information about parallelism implicit in a Fortran 90 construct (such as areduction operator) is not lost during in-lining, by recording any additional information neededlater. The scalarizer also assigns a suitable distribution to array temporaries created during scalar-ization.The placement of the SPMDizer after the scalarizer is signi�cant in that the SPMDizer processesboth Fortran 77 programs and scalarized Fortran 90 programs in a uniform manner. Similarly, thelocality optimizer processes both SPMDized and uniprocessor input programs in a uniform way.This modularization of functionality simpli�es the implementation without sacri�cing performance.The SPMDizer places communication outside of loops when it is legal; the locality optimizer issubsequently able to reorder the inner, communication-free loops for better cache utilization. Fur-thermore, the SPMDized form of new loop bounds for parallel loops facilitates easy identi�cationof conformable loops, to enable loop fusion transformations for improving data locality. For dis-tributed arrays, the SPMDizer shrinks distributed dimensions, so that there are no address-spaceholes in the data that would otherwise reduce spatial locality.The compiler includes data- ow analysis, data dependence analysis, and loop transformationmodules, which enable a variety of optimizations performed by di�erent compilation stages. Forexample, the scalarizer uses data dependence information to eliminate array temporaries, which areotherwise needed to preserve Fortran 90 array language semantics [7]. Data dependence analysisand loop transformations are also critical to locality and communication optimizations.The structure of the SPMDizer is shown in Figure 2. We illustrate the compilation process,and in particular, the di�erent phases of the SPMDizer, with the help of an example.4

integer A(100)integer B(100,100)!HPF$DISTRIBUTE B(BLOCK,BLOCK)!HPF$ALIGN A(I) WITH B(1,I)B = spread(A,1,100)endFigure 3: Fortran 90 Example with SPREADdo i 5=1,100,1do i 6=1,100,1B(i 6,i 5) = A(i 5)end doend doendFigure 4: Scalarized SPREAD program2.2 HPF Compilation ExampleFigures 3 shows a simple Fortran 90 SPREAD program with HPF directives. Figure 4 shows theresult of scalarizing this program. Naive scalarization of a SPREAD operation would result in thegeneration of a 2-dimensional temporary array and a run-time call to a library spread routine. Thescalarizer generates more e�cient in-line code for the SPREAD operation, without creating anynew temporary variable.Figure 5 shows the pseudo-code output produced by the SPMDizer for the scalarized programshown in Figure 4. We use this example to illustrate each of the SPMDizer phases.Data Partitioner. Each distributed array (which may be static or a common-block array)is transformed into a Fortran 90 pointer, which is allocated dynamically in the library routinehpf allocate. Arrays are shrunk according to each processor's local block size, by adjusting thebounds of the local array within the global address space. For example, if there are 4 processors,the bounds of A on the second processor will be (26:50). Because we use Fortran 90 dynamicpointer allocation to allocate shrunk arrays, our SPMDizer does not need to perform global-to-localaddress remapping, unlike various HPF preprocessors [15, 18]. 3The form of the output code is general enough to represent missing compile-time information. Inthis example, even though the global array bounds of A and B are given, the number of processorsis not given, so that the processor-local block sizes and hence the local array bounds are notknown at compile-time. Because we believe that number-of-processors, loop-bounds, and array-bounds information will commonly be unknown at compile-time, pHPF is designed so that importantcompilation parameters are treated as general expressions rather than constants, and storage formapped arrays is always dynamically allocated, with minimal observed performance loss.Communication Analysis and Generation. For each read reference in the program, the com-munication analysis phase determines whether it requires interprocessor communication. If com-3Cyclic distributions, which require subscript modi�cation, are an exception.5

integer, pointer :: A, B! DATA PARTITIONINGcall hpf_get_numprocs(2,numprocs,pid)global_bounds(1) = 1global_bounds(2) = 100global_bounds(3) = 1global_bounds(4) = 100blocksize(1) = ((100 + numprocs(1)) - 1) / numprocs(1)blocksize(2) = ((100 + numprocs(2)) - 1) / numprocs(2)iown_lbound(1) = 1 + blocksize(1) * pid(1)iown_ubound(1) = blocksize(1) + iown_lbound(1) - 1iown_lbound(2) = 1 + blocksize(2) * pid(2)iown_ubound(2) = blocksize(2) + iown_lbound(2) - 1call hpf_allocate(B, global_bounds, blocksize ...)...call hpf_allocate(A,...)! COMMUNICATIONcb_section(1) = iown_lbound(2)cb_section(2) = min0(iown_ubound(2),100)cb_section(3) = 1call hpf_allocate_computation_buffer(buffer,cb_section,...)if (pid(2) .le. 99 / blocksize(2) .and. pid(1) .le. 99 / blocksize(1)& .or. pid(1) .eq. 0) thensend_section(1) = iown_lbound(2)send_section(2) = min0(iown_ubound(2),100)send_section(3) = 1....call hpf_bcast_section(A,send_section, buffer...)end if! LOOPS SHRUNK BY COMPUTATION PARTITIONINGdo i_8=iown_lbound(2),min0(iown_ubound(2),100),1do i_9=iown_lbound(1),min0(iown_ubound(1),100),1B(i_9,i_8) = buffer(i_8)end doend docall deallocate(buffer)end Figure 5: SPMDized SPREAD program6

munication is needed, the analysis determines (i) the placement of communication, and (ii) thecommunication primitive that will carry out the data movement in each processor grid dimension.pHPF extends the analysis described in [9] to handle scalars, and to incorporate additional possibil-ities regarding the distribution of array dimensions, namely, replication and mapping to a constantprocessor position. The code generator then identi�es the processors and data elements that need toparticipate in communication, and generates calls to run-time communication routines, accordingto the communication requirements on di�erent grid dimensions.In Figure 4, A is assigned to each row of B. Since A is aligned with the �rst row of B, communica-tion analysis selects a broadcast along the �rst dimension of the processor grid. Data and processorsthat participate in the communication are speci�ed as array sections using global-address-spacecoordinates. The array section cb section speci�es the bounds of the computation buffer,which stores non-local data received from another processor. The array send section speci�esthe processor-local section of A that is broadcast from each sending processor. Computation bu�erbounds are also speci�ed in the global address space.Computation Partitioning. Computation partitioning shrinks loop bounds and introducesstatement guards, where necessary, to distribute the computation among di�erent processors. Theowner-computes rule [11] is generally used, except for reduction computations and for assignmentsto scalars under conditions described in Section 3.HPF Run-Time Library. The run-time library allocates partitioned arrays and performs localdata movement and communication. It provides a high level interface for the compiler to specifycommunication based on global indices of data (indices corresponding to arrays before data distri-bution) and processors. Both data and processors are represented as multi-dimensional sections,possibly with non-unit strides. Any bu�er allocation, packing, and unpacking needed to linearizethe data for communication is performed in the run-time routines. The run-time system also per-forms optimizations like overlapping unpack operations with non-blocking receives, when waitingfor the completion of multiple receives. The run-time library is portable across di�erent basic com-munication systems. Currently it supports both IBM MPL and MPI libraries. The runtime systemalso provides detailed performance statistics, trace information for debugging by hand, and tracegeneration for program visualization tools [12].3 OptimizationsThe SPMDizer of pHPF performs several optimizations to reduce both communication costs andthe overhead of guards introduced by computation partitioning. Some of these optimizations arewell-known and have been discussed in the literature [11, 23, 17, 19, 3], while many others areunique to pHPF.Message Vectorization Moving communication outside of loops to amortize the startup costsof sending messages has been recognized as an extremely important optimization. pHPF uses de-pendence information to determine the outermost loop level where communication can be placed.The applicability of this optimization is further improved by:� Loop distribution: a preliminary analysis of communication and computation partitioningguards is used to guide selective loop distribution. pHPF uses data and control dependenceinformation to �rst identify the program structure under maximal loop distribution, in the7

!HPF$ Align D(i) with A(i; 1) do i = 1; ndo i = 1; n D(i) = D(i) + s �B(i)D(i) = D(i) + s �B(i) enddoCommunication for D(i) Communication for D(i), i=1:ndo j = 1; n do i = 1; nA(i; j) = A(i; j) +D(i) do j = 1; nenddo A(i; j) = A(i; j) +D(i)enddo enddoenddoFigure 6: Enabling of message vectorization by loop distributionform of strongly connected components (SCCs) in the program dependence graph. Since un-necessary loop distribution can hurt cache locality and also redundant message elimination,which currently does not recognize redundant messages in di�erent loop nests, pHPF identi�esthose SCCs where loop distribution is expected to improve performance. The SCCs identi�edare those with inner-loop communication that can be moved out with loop distribution, andthose that have mutually di�erent local iteration sets obtained by computation partitioning.In those cases, loop distribution reduces communication costs and the overhead of computa-tion partitioning. For example, in Figure 6, the communication for the reference to D(i) ismoved outside the i-loop as a result of loop distribution.� Exploiting the INDEPENDENT directive: The compiler assumes there is no loop-carried de-pendence in loops marked independent by the programmer. This often allows communicationto be moved outside the loops when static analysis information is imprecise.Collective Communication pHPF uses techniques from [9] to identify the high-level pattern ofcollective data movement for a given reference. This information is used to recognize opportunitiesfor using collective communication primitives like broadcast and reduction, and for generatinge�cient send-receive code for special communication patterns like shift. The e�ectiveness of thisanalysis is improved by:� Idiom-recognition for reductions: Currently, pHPF recognizes sum, product, min and maxreductions in Fortran 77 code. Since all Fortran 90 reduction operations are handled throughinlining, information about reduction operations gathered by idiom-recognition and inlining isrepresented uniformly. Communication analysis exploits this information to generate parallelcode for the local reduction, followed by a global reduction with e�cient communication.� Symbolic analysis: Since data distribution parameters such as block size are often unknown(e.g. when the number of processors is not speci�ed statically), the compiler uses symbolicanalysis for operations like checking the equality of expressions and checking if one expressionis an integral multiple of another expression. This enables pHPF to generate more e�cientcode, for example, by detecting the absence of communication through a symbolic test forthe strictly synchronous property [9] between array references.Elimination of Redundant Communication A unique feature of pHPF is the analysis ofavailability of data at processors other than the owners [10]. This enables detection of redundant8

$HPF! Align A(i,j) with B(i,j)$HPF! Align D(i) with B(i,1)do i = 2; nD(i) = F(A(i � 1; n)) S1B(i; 1) = F(A(i � 1; n)) S2enddoFigure 7: Example of redundant communicationProgram # Refs with Comm. % Refs withWithout Redundancy Elim. With Redundancy Elim. Redundant Comm.grid (block, block) 15 11 26.7tomcatv (�, block) 47 35 25.5ncar (block, block) 45 30 33.3ncar (�, block) 25 17 32.0x42 (block, block) 33 17 48.5baro (�, block) 3 3 0comp 47 34 27.7cmslow 44 21 52.3intba1 3 1 66.7graph1 0 0 N/ATable 1: Results of optimization to eliminate redundant communicationcommunication, when the compiler can infer that data to be communicated is already available atthe intended receivers due to prior communication(s). For example, in Figure 7, communication forthe reference to A(i-1,n) in statement S2 is made redundant by communication for the referenceto A(i-1,n) in S1. We have implemented a simpli�ed version of the analysis presented in [10]to eliminate redundant communication, which �nds redundant communication within single loopnests. An advantage of our analysis is that it is performed at a high level, and hence is largelyindependent of communication code generation. Table 1 summarizes results obtained by pHPFon some of the benchmark programs described in Section 4. The table shows static counts ofthe number of references requiring communication before and after the optimization to eliminateredundant communication, and shows that the compiler is quite successful in identifying redundantcommunication.Optimizing Nearest-Neighbor Shift Communication The pHPF compiler employs severaltechniques to optimize shift communication, which occurs frequently in many scienti�c applications.The scalarizer optimizes the number of temporary arrays introduced to handle Fortran 90 shiftoperations. The following optimizations are performed to further reduce communication costs:9

$HPF! Distribute (block,block) :: A, Bdo j = 2; ndo i = 2; nA(i; j) = F(B(i � 1; j); B(i � 1; j � 1))enddoenddoFigure 8: Example of nearest-neighbor shift communication$HPF! Distribute A(block,block)do j = 2; ndo i = 2; nA(i; j) = F(A(i � 1; j); A(i; j � 1))enddoenddoFigure 9: Example of wavefront computation� Message coalescing: Consider the program segment in Figure 8. The communication for thetwo rhs references have a signi�cant overlap, but neither completely covers the other. Byaugmenting the data being communicated for B(i-1,j-1) with the extra data needed for B(i-1,j), a separate communication for B(i-1,j) can be avoided. The pHPF communication analysisidenti�es the data movement for B(i-1,j) as a shift in the �rst dimension and internalizeddata movement (IDM) in the second dimension, and the data movement for B(i-1,j-1) as ashift in both dimensions. Recognizing the communication for B(i-1,j-1) as dominant (dueto interprocessor communication instead of IDM in the second dimension), pHPF drops thecommunication for B(i-1,j) after augmenting the dominant communication with the droppedcommunication data set. In this case, it extends the upper bound of the date communicatedfor the second dimension of B(i-1,j-1). We have generalized our implementation of redundantcommunication elimination to do message coalescing as well.� Multi-dimensional shift communication : Given an array reference with shift communicationin d processor grid dimensions, each processor (ignoring the boundary cases) sends data toand receives data from 2d � 1 processors. For example, communication for B(i-1,j-1) inFigure 8 requires each processor to send data and receive from 3 other processors. ThepHPF compiler uses an optimization to reduce the number of messages exchanged in eitherdirection from 2d � 1 to d. This is accomplished by composing the communication in d stepsand augmenting the data being communicated suitably in each step. Our experiments shownoticeable performance improvements with this optimization even for shift communicationsin two dimensions.Coarse Grain Wavefronting Consider the program segment shown in Figure 9. The basicdependence based algorithm would place the communication for A(i,j-1) inside the j-loop, and thecommunication forA(i-1,j) inside the i-loop, due to true dependences to these references fromA(i,j).10

!HPF$ Align (i) with A(i) :: B,C,D!HPF$ Align (i) with A(*) :: E,F!HPF$ Distribute (block) :: Ado i = 1; nx = B(i) +C(i) ! align x with consumer, owner(D(i� 2))y = A(i) + B(i) ! align y with producer, owner(A(i))z = E(i) + F (i) ! privatize z without alignmentA(i � 1) = y=zD(i � 2) = x=zenddo Figure 10: Di�erent alignments of privatized scalarsThis leads to pipeline parallelism across both grid dimensions, but regardless of the loop ordering,one of the dimensions has extremely �ne-grained parallelism, with high communication overhead.pHPF performs special analysis for �ne-grained pipelined communication taking place inside a loopnest. It identi�es the maximal fully-permutable inner loop nest [22], if all statements inside theloop nest have the same computation partitioning. Each pipelined communication correspondingto an array dimension with block distribution can be moved outside the fully-permutable loop nest.This follows from two observations. First, a fully-permutable loop nest can be tiled arbitrarily [22].Second, the loop in which the block-distributed array dimension with pipelined communication istraversed can be tiled onto an outer loop traversing the processors (which does not appear as aseparate loop in the SPMD code) and an inner loop traversing the local space on each processor;the communication can be moved between those loops. Thus, for the example of Figure 9, pHPF isable to move communication for both A(i-1,j) and A(i,j-1) outside the j-loop, leading to wave-frontparallelism with the low communication overhead associated with a coarse-grain pipeline. In thefuture, we will experiment with loop strip-mining to further control the grain-size of pipelining.Mapping of Scalars It is well-known that replicating each scalar variable on all processorsoften leads to ine�cient code. For example, replication of the variable x in Figure 10 wouldunnecessarily lead to each processor executing the �rst statement in the loop in every iteration,and the values of B(1:n) and C(1:n) being broadcast to all the processors. For scalar variables thatcan be privatized, pHPF chooses among (i) alignment with \producer" reference, (ii) alignment with\consumer" reference, and (iii) no alignment. We �rst explain the mappings chosen by pHPF forthe scalar variables in Figure 10, and then describe the general algorithm used to determine themapping of scalars.The privatizable variable y is aligned with a producer reference, i.e., the rhs reference A(i)on the statement that computes the scalar value. This avoids any communication needed for theproducer reference on that statement. The variable x is aligned with a consumer reference, i.e., anlhs reference (D(i-2)) which uses the scalar value in its computation. If x were aligned insteadwith a producer reference (B(i) or C(i)), the communication of x to the owner of D(i-2) wouldhave required communication inside the i-loop, because of a dependence from the de�nition of xto the use of x inside the loop. By aligning x with D(i-2), communication is now needed for tworeferences B(i) and C(i), but these communications can be moved outside the i-loop. Finally, thevariable z uses the value of replicated array elements E(i) and F(i) in its computation, and is not11

real a(100,100)!HPF$ distribute a(block,block) if (c1 � iown lower(a, 2) && c1 � iown upper(a, 2)) thendo i = 1, n do i=iown lbound(a,1,i), iown ubound(a,1,i), 1a(i, c1) = 0 a(i, c1) = 0end do end doend ifFigure 11: Guard optimization exampleexplicitly aligned with any reference. Each processor that executes an iteration of the i-loop underthe computation partitioning (as determined by the partitioning of other statements in that loop)\owns" and computes a temporary value of z in that loop iteration.pHPF uses the static single assignment (SSA) representation [5] to associate a separate mappingdecision with each assignment to a scalar. It chooses replication as the default mapping. For eachde�nition of the scalar in a loop that can be privatized without a copy-out (based on the def/useanalysis), and for which no reaching use identi�es any other reaching de�nition, pHPF aligns thescalar with a reference to a partitioned array, if any, on the rhs of the statement. If each rhsreference on the statement is available on all processors, the scalar variable is explicitly marked ashaving no alignment. If any scalar value needs to be communicated to the owner of a consumerreference, then pHPF determines, in a separate pass, the desirability of changing the alignment of thecommunicated scalar (and of other scalars referenced in its computation) with a consumer referenceinstead. This change is only done if the new alignment shows a lower estimated cost resulting fromcommunication being moved outside the loop.Any scalar computed in a reduction operation (such as sum) carried out across a processor griddimension is handled in a special manner. An additional privatized copy of the scalar is created tohold the results of the local reduction computation initially performed by each processor. A globalreduction operation combines the values of the local operations.Optimizing Statement Guards Statement guards are needed to enforce the owner-computesrule. However, inner-loop guards can inhibit parallelism and signi�cantly degrade performance.Several guard optimizations are performed:� If all statements within a loop have the same local iteration set for that loop, loop boundshrinking is used to perform computation partitioning. Otherwise, guards are introduced forindividual statements.� The guards introduced for computation partitioning are hoisted out of the loop nest as far aspossible. For a given statement, the local iteration sets for di�erent loops and the guards fordi�erent processor grid dimensions are handled independently. This increases the ability to oat those guards out of inner loops, since each type can be moved as far out as possible.In the example of Figure 11, there is no guard needed for the �rst processor grid dimensionafter the i-loop bounds are reduced to the local iteration set. The guard for the second dimensionis based on a condition that is invariant inside the i-loop, and hence is moved outside the i-loop.12

Program Speedupsserial time = 42.62 secs. serial 1 2 4 8 16 32grid (block,block,�) 1.00 1.01 2.01 4.02 7.92 15.47 30.49grid* (block,block,�) 1.00 1.02 2.02 4.03 7.98 14.09 30.28MPL version 1.00 1.00 2.00 3.98 7.89 15.78 30.43Table 2: Speedups of the Grid program4 Benchmark ResultsIn this section we discuss some preliminary experimental results on a set of benchmark programs,and the utilization of the optimizations discussed in the previous section.4.1 Experimental SetupWe chose four programs from the HPF benchmark suite developed by Applied Parallel Research,Inc. These are publicly available programs and vary in the degrees of challenge that they presentto the compiler. The �rst benchmark program, grid, is an iterative 2-D 9-point stencil programthat features regular, nearest neighbor communication followed by a global reduction operation.The grid program has little computation and so the benchmark version of the program takes thelogarithms of the 9-point stencil and compute the exponential value of the average to arti�ciallyboost the computation/communication ratio. The second program, tomcatv (originally from theSPEC benchmark), is a mesh generator with Thompson's solver. The third program, NCAR shallowwater atmospheric model is a non-trivial 2-D stencil program that contains both nearest neighborcommunication and some general communication. The last program, x42, is an explicit modelingsystem using fourth order di�erencing (using values of 2 grids on each side of the center point).For each benchmark routine, we used the serial performance of the program { compiled usingthe IBM XLF compiler { for the base-line performance. When compiling the HPF programs, thenumber of physical processors was not speci�ed at the compile time. The performance results wereobtained by running the same object code with di�erent number of processors (1, 2, 4, 8, 16, and32 processors). The speedup of the programs was then calculated by dividing the base-line time bythe parallel time (Sp = T1=Tp).All experiments (both the sequential and parallel runs) were done on a 36-processor IBM SP2with 32 thin-nodes and 4 wide-nodes. All programs were compiled with the same optimization ag (-O3). For grid, NCAR shallow-water, and x42, timing for a hand-optimized version of theprogram using the message passing library (MPL) is also shown.4.2 ResultsGrid The three-dimensional arrays are aligned to a template that is distributed onto a two-dimensional processor grid. The benchmark results were obtained with N = 500 and run for 20cycles. Our compiler achieves linear speedup in this case.As noted in the previous section, pHPF was successful in eliminating redundant communicationwith this program. The column marked as grid* in Table 2 was obtained by specifying the numberof processors at compile-time. This shows that pHPF is capable of generating high quality codewhen the number of processors is unknown at compile time.13

Program Speedupsserial time = 40.06 secs. serial 1 2 4 8 16 32tomcatv (�, block) 1.00 0.89 1.76 3.29 6.36 11.34 17.65Table 3: Speedups of the Tomcatv programProgram Speedupsserial time = 14.34 secs serial 1 2 4 8 16 32ncar (�, block) 1.00 0.89 2.13 3.88 6.91 12.12 18.33ncar (block, block) 1.00 1.01 1.71 3.74 6.75 12.20 19.32MPL version 1.00 1.14 2.28 4.53 8.82 16.62 31.10Table 4: Speedups of the NCAR shallow water programTomcatv All arrays in tomcatv are distributed column-wise onto a 1-D processor grid. Thearrays are of size 514x514. The program �rst computes the residuals, which requires nearestneighbor communication. Next the maximum values of the residuals are determined, and �nallythe tridiagonal system is solved in parallel. This computation iterates until the maximum valueof the residuals converges. Idiom recognition identi�ed the reduction operation for computing themaximum of the residuals. Communication for one-third of the references initially identi�ed asneeding communication were recognized as redundant and eliminated. Other optimizations thatwere useful included alignment of scalars with consumers and the replacement of induction variables.These optimizations enabled the bounds of the main computation loop nest to be shrunk, with noguards needed for statements inside the loop nest. Although the compiler achieves only 55% of theideal speedup for 32 processors, this result is quite good compared to the results we have seen fromother HPF compilers.NCAR Shallow Water Benchmark The results of Table 4 were computed based on 512x512arrays distributed (*, BLOCK). The compiler had to rely on the independent directive for some ofthe loops, which could not otherwise be recognized as parallel due to a statically unknown constantappearing in some subscripts. For a small number of processors, the speedup is good, but it doesnot scale well when the number of processors is increased. If the arrays are distributed (BLOCK,BLOCK), then the number of references that need communication increases from 25 to 45. Theperformance of 2-D distribution is better than the 1-D distribution, however, because the cost ofextra messages was less than the savings from sending shorter messages. Redundant communicationelimination was also more e�ective for the 2-D distribution.X42 The benchmark version from APR only times the part of the program that computes thewave-�elds. The arrays are distributed using 1-D (*,BLOCK) distribution. The redundant com-munication elimination removes 8 of the 19 static communications.Summary Compiled HPF programs have many inherent performance overheads, which result inperformance less than that of highly-tuned, hand-coded programs. The benchmark performance14

Program Speedupsserial time = 0.84 secs. serial 1 2 4 8 16 32x42 (�, block) 1.00 1.03 1.95 3.65 6.61 10.98 16.45x42 (block, block) 1.00 1.18 2.03 3.81 6.99 13.36 21.96MPL version 1.00 1.00 1.98 3.85 7.50 13.77 24.70Table 5: Speedups of the X42 programresults reported above represent the combined e�ects of all optimizations built into the compiler.The symbolic analysis ability of the compiler maintains the same level of performance when thenumber of processors is not known at compile time.5 Related WorkThere are several groups which have looked at the problem of compiling HPF-like languages ondistributed memory machines [11, 23, 19, 3, 4, 15, 18]. Our work has also bene�ted from otherearly projects like Kali [14], Id Nouveau [17], and Crystal [16].The Fortran D compiler [11] performs several optimizations like message vectorization, usingcollective communication, and exploiting pipeline parallelism. It also performs analysis to eliminatepartially redundant communication for irregular computations [21]. The current version of theFortran D compiler requires the number of processors to be known at compile time, and supportspartitioning of only a single dimension of any array.The SUPERB compiler [23] being developed at the University of Vienna represents the secondgeneration of their compiler which pioneered techniques like message vectorization and the useof overlap regions for shift communication. Their compiler only supports block distribution ofarray dimensions. It puts special emphasis on performance prediction to guide optimizations anddata-partitioning decisions [6].The Fortran 90D compiler [3] exploits parallelism from Fortran 90 constructs in generating theSPMD message-passing program, it does not attempt to parallelize sequential Fortran 77 programs.Their work has focussed considerably on supporting parallel I/O and handling out-of-core programs[20].The PARADIGM compiler [19, 2] is targeted to Fortran 77 programs and provides an option forautomatic data partitioning for regular computations. It also supports exploitation of functionalparallelism in addition to data-parallelism.The ADAPTOR system [4] supports the HPF subset and performs optimizations both forhandling Fortran 90 array constructs and for improving cache locality, in addition to those forreducing communication costs. The ADAPTOR determines the communication schedules at run-time.The SUIF compiler [1] performs loop transformations for increasing parallelism, and for en-hancing uniprocessor performance. The compiler also supports automatic data partitioning. Theydo not report results on message-passing machines.15

6 ConclusionsWe have described an HPF compiler for the IBM SP series parallel machines. Our compiler, pHPF,is unique in its ability to e�ciently support both Fortran 90 array operations and sequential Fortran77 loops in HPF programs. It handles statically unknown parameters like the number of processorsusually with no performance degradation, as it uses symbolic analysis and does not resort to run-time determination of communication schedules. pHPF makes several contributions to optimizingcommunication { it eliminates redundant communication using data-availability analysis, dealswith the problem of mapping scalar variables in a comprehensive manner, and performs special-purpose optimizations like coarse-grain wave-fronting and reducing the number of messages inmulti-dimensional shift communications. We have presented experimental results which indicatethat these optimizations lead to e�cient code generation.In the future, we plan to apply optimizations for communication across procedure boundariesthrough inter-procedural analysis. We plan to support block-cyclic distribution of array dimensions,and also provide more e�cient support for irregular computations. We are also investigating com-pilation strategies using remote memory copy operations like get and put as the basic primitivesfor transferring data across processors.7 AcknowledgementsWe thank Rick Lawrence and Joefon Jann for providing performance results for the MPL versionof the benchmark programs. We would also like to thank Alan Adamson and Lee Nackman fortheir support.References[1] S. Amarasinghe, J. Anderson, M. Lam, and A. Lim. An overview of a compiler for scalableparallel machines. In Proc. Sixth Annual Workshop on Languages and Compilers for ParallelComputing, Portland, Oregon, August 1993.[2] P. Banerjee, J. Chandy, M. Gupta, E. Hodges, J. Holm, A. Lain, D. Palermo, S. Ramaswamy,and E. Su. An overview of the PARADIGM compiler for distributed-memory multicomputers.IEEE Computer, 1995. (To appear).[3] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach forFortran 90D/HPF compilers on distributed memory MIMD computers. In Proc. Sixth AnnualWorkshop on Languages and Compilers for Parallel Computing, Portland, Oregon, August1993.[4] T. Brandes. ADAPTOR: A compilation system for data-parallel Fortran programs. In C. W.Kessler, editor, Automatic parallelization { new approaches to code generation, data distrib-ution, and performance prediction. Vieweg Advanced Studies in Computer Science, Vieweg,Wiesbaden, January 1994.[5] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, and F.K. Zadeck. E�ciently comput-ing static single assignment form and the control dependence graph. ACM Transactions onProgramming Languages and Systems, 13(4):451{490, October 1991.16

[6] T. Fahringer and H. P. Zima. A static parameter based performance prediction tool for parallelprograms. In Proc. 7th ACM International Conference on Supercomputing, Tokyo, Japan, July1993.[7] ANSI Fortran 90 Standard Committee. Fortran 90, 1990. ANSI standard X3.198-199x, whichis identical to ISO standard ISO/IEC 1539:1991.[8] High Performance Fortran Forum. High Performance Fortran language speci�cation, version1.0. Technical Report CRPC-TR92225, Rice University, May 1993.[9] M. Gupta and P. Banerjee. A methodology for high-level synthesis of communication onmulticomputers. In Proc. 6th ACM International Conference on Supercomputing, WashingtonD.C., July 1992.[10] M. Gupta, E. Schonberg, and H. Srinivasan. A uni�ed data- ow framework for optimizingcommunication. In Proc. 7th Workshop on Languages and Compilers for Parallel Computing,Ithaca, NY, August 1994. Springer-Verlag.[11] S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66{80, August 1992.[12] D. Kimelman, P. Mittal, E. Schonberg, P. Sweeney, K. Wang, and D. Zernik. Visualizing theexecution of high performance fortran (hpf) programs. In Proc. of 1995 Intl' Conf. on ParallelProcessing Symposium, April 1995.[13] C. Koelbel, D.B. Loveman, R.S. Schreiber, G.L. Steele Jr., and M. E. Zosel. The High Perfor-mance FORTRAN Handbook. The MIT Press, Cambridge, MA, 1994.[14] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed exe-cution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440{451, October 1991.[15] J.M. Levesque. Applied Parallel Research's xHPF system. IEEE Parallel & Distributed Tech-nologies, page 71, Fall 1994.[16] J. Li and M. Chen. Compiling communication-e�cient programs for massively parallel ma-chines. IEEE Transactions on Parallel and Distributed Systems, 2(3):361{376, July 1991.[17] A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proc.SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 69{80, June 1989.[18] V.J. Schuster. PGHPF from The Portland Group. IEEE Parallel & Distributed Technologies,page 72, Fall 1994.[19] E. Su, D. J. Palermo, and P. Banerjee. Automating parallelization of regular computations fordistributed memory multicomputers in the PARADIGM compiler. In Proc. 1993 InternationalConference on Parallel Processing, St. Charles, IL, August 1993.[20] R. Thakur, R. Bordawekar, and A. Choudhary. Compiler and runtime support for out-of-coreHPF programs. Technical Report SCCS-597, NPAC, Syracuse University, 1994.[21] R. v. Hanxleden and K. Kennedy. Give-n-take { a balanced code placement framework. InProc. ACM SIGPLAN '94 Conference on Programming Language Design and Implementation,Orlando, Florida, June 1994. 17

[22] M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximizeparallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452{471, October1991.[23] H. Zima and B. Chapman. Compiling for distributed-memory systems. Proceedings of theIEEE, 81-13(2):264{287, Feb 1993.

18

An HPF compiler for the IBM SP2

Documents

Transcript of An HPF compiler for the IBM SP2