On minimizing materializations of array-valued temporaries

On Materializations of Array-Valued

Temporaries

Daniel J. Rosenkrantz, Lenore R. Mullin, and Harry B. Hunt III

Computer Science DepartmentUniversity at Albany – SUNY

Albany, NY 12222{djr, lenore, hunt}@cs.albany.edu

Abstract. We present results demonstrating the usefulness of mono-lithic program analysis and optimization prior to scalarization. In par-ticular, models are developed for studying nonmaterialization in basicblocks consisting of a sequence of assignment statements involving array–valued variables. We use these models to analyze the problem of mini-mizing the number of materializations in a basic block, and to developan efficient algorithm for minimizing the number of materializations incertain cases.

1 Introduction

Here, we consider the analysis and optimization of code utilizing operations andfunctions operating on entire arrays or array sections (rather than on the under-lying scalar domains of the array elements). We call such operations monolithicarray operations.

In [16], Veldhuizen and Gannon refer to traditional approaches to optimiza-tion as transformational, and say that a difficulty with “transformational op-timizations is that the optimizer lacks an understanding of the intent of thecode. . . . More generally, the optimizer must infer the intent of the code to applyhigher–level optimizations.” The dominant approach to compiler optimization ofarray languages, Fortran90, HPF, etc., is transformational optimizations doneafter scalarization. In contrast, our results suggest that monolithic style pro-gramming, and subsequent monolithic analysis prior to scalarization, can be usedto perform radical transformations based upon intensional analysis. Determin-ing intent subsequent to scalarization seems difficult, if not impossible, becausemuch global information is obfuscated. Finally, many people doing scientific pro-gramming find it easier and natural to program using high level monolithic arrayoperations. This suggests that analysis and optimization at the monolithic levelwill be of growing importance in the future.

In this paper, we study materializations of array–valued temporaries, with afocus on elimination of array–valued temporaries in basic blocks.

There has been extensive research on nonmaterialization of array-valued in-termediate results in evaluating array–valued expressions. The optimization of

S.P. Midkiff et al. (Eds.): LCPC2000, LNCS 2017, pp. 127–141, 2001.c© Springer-Verlag Berlin Heidelberg 2001

https://www.researchgate.net/publication/2585236_Active_Libraries_Rethinking_the_roles_of_compilers_and_libraries?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

128 Daniel J. Rosenkrantz, Lenore R. Mullin, and Harry B. Hunt III

array expressions in APL entails nonmaterialization of array subexpressions in-volving composition of indexing operations [1, 2, 3, 4]. Nonmaterialization ofarray-valued intermediate results in evaluating array–valued expressions is donein Fortran90, HPF, ZPL [7], POOMA [5], C++ templates [14, 15], the Ma-trix Template Library (MTL) [8], active libraries [16], etc. Mullin’s Psi Calculusmodel [10, 9] provides a uniform framework for eliminating materializations in-volving array addressing, decomposition, and reshaping. Nonmaterialization inbasic blocks is more complicated than in expressions because a basic block maycontain assignment statements that modify arrays that are involved in potentialnonmaterializations. Some initial results on nonmaterialization applied to basicblocks appear in Roth, et.al. [6, 11, 12].

In Section 2, we develop a framework, techniques, algorithms, etc., for thestudy of nonmaterialization in basic blocks. We give examples of how code canbe optimized via nonmaterialization in basic blocks, and then formalize this op-timization problem. We then formalize graph–theoretic models to capture fun-damental concepts of the relationship between array values in a basic block.These models are used to study nonmaterializations, and to develop techniquesfor minimizing the number of materializations. A lower bound is given on thenumber of materializations required. An optimization algorithm for certain casesis given. This algorithm is optimal; the number of materializations produced ex-actly matches the lower bound.

In Section 3, we give brief conclusions.

2 Nonmaterialization in Basic Blocks

2.1 Examples of Nonmaterialization

Many array operations, represented in HPF and Fortran90 as sectioning and asintrinsics such as transpose, cshift, eoshift, reshape, and spread involve therearrangement and replication of array elements. We refer to these operations asaddress shuffling operations. These operations essentially utilize array indexing,and are independent of the domain of values of the scalars involved. Often itis unnecessary to generate code for these operations. Instead of materializingthe result of such an operation (i.e. constructing the resulting array at run–time), the compiler can keep track of how elements of the resulting array canbe obtained by appropriately addressing the operands of the address shufflingoperation. Subsequent references to the result of the operation can be replacedby suitably modified references to the operands of the operation.

As an example of nonmaterialization, consider the following statement.

X = B + transpose(C)

Straightforward code for this example would first materialize the result ofthe transpose operation in a temporary array, say named Y , and then addthis temporary array to B, assigning the result to X . Indeed, instead of usinga single assignment statement, the programmer might well have expressed the

https://www.researchgate.net/publication/213881119_An_APL_Compiler_for_a_Vector_Processor?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/213881897_Optimization_of_Data-Parallel_Field_Expressions_in_the_POOMA_Framework?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2574351_Optimizing_Fortran_90_Shift_Operations_on_Distributed-Memory_Multicomputers?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2585236_Active_Libraries_Rethinking_the_roles_of_compilers_and_libraries?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2416540_The_Matrix_Template_Library_A_Generic_Programming_Approach_to_High_Performance_Numerical_Linear_Algebra?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2619907_Expression_Templates?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2608712_Compiling_Stencils_in_High_Performance_Fortran?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2843563_ZPL_An_array_sublanguage?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/2393124_Optimizing_Fortran90DHPF_for_Distributed-Memory_Computers?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/34914262_A_Mathematics_of_Arrays?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/220997758_Compilation_and_Delayed_Evaluation_in_APL?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/213882599_Using_C_template_metaprograms_C_Rep?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

https://www.researchgate.net/publication/224104578_Efficient_Evaluation_of_Array_Subscripts_of_Arrays?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==

On Materializations of Array-Valued Temporaries 129

above statement as such a sequence of two statements, forming part of a basicblock, as follows.

Y = transpose(C)X = B + Y

A straightforward compiler might produce a separate loop for each of theabove statements, but an optimizer could fuse the two loops, producing thefollowing single loop.

forall (i = 1:N, j = 1:N)Y(i,j) = C(j,i)X(i,j) = B(i,j) + Y(i,j)

end forall

However, Y need not be materialized at all, yielding the following code.

forall (i = 1:N, j = 1:N)X(i,j) = B(i,j) + C(j,i)

end forall

Nonmaterialization is also an issue in optimizing distributed computation.Kennedy, et. al. [6], Roth [11], and Roth, et. al. [12] developed methods for opti-mizing stencil computations by nonmaterialization of selected cshift operationsinvolving distributed arrays. This nonmaterialization was aimed at minimizingcommunications, including both the amount of data transmitted, and the num-ber of messages used. The amounts of the shifts in the stencils involved wereconstants, so the compiler could determine which cshift operations in the ba-sic block were potential beneficiaries of nonmaterialization. The compiler couldanalyze the basic block and choose not to materialize some of these cshift oper-ations. The nonmaterialization technique in [6, 11, 12] used a sophisticated formof replication, where a subarray on a given processor was enlarged by adding ex-tra rows and columns that were replicas of data on other processors. The actualcomputation of the stencil on each processor referred to the enlarged array onthat processor.

For instance, consider the following example, taken from Roth, et. al. [12].

(1) RIP = cshift(U,shift=+1,dim=1)(2) RIN = cshift(U,shift=-1,dim=1)(3) T = U + RIP + RIN

The optimized version of this code is the following. The cshift operationsin statements (1) and (2) are replaced by overlap cshift operations, whichtransmit enough data from U between processors so as to fill in the overlapareas on each processor. In statement (3), the references to RIP and RIN arereplaced by references to U, annotated with appropriate shift values, expressedusing superscripts.

(1) call overlap cshift(U,shift=+1,dim=1)(2) call overlap cshift(U,shift=-1,dim=1)(3) T = U + U<+1,0> + U<−1,0>









2.2 Formulation of Nonmaterialization Optimization Problem forBasic Blocks

We now begin developing a framework for considering the problem of minimizingthe number of materializations in a basic block (equivalently, maximizing thenumber of shuffle operations that are not materialized).

Definition 1. A def–value in a basic block is either the initial value (at thebeginning of the basic block) of an array variable or an occurrence of an arrayvariable that is the destination of an assignment. (An assignment can be to thecomplete array, to a section of the array, or to a single element of the array.)

Definition 2. A complete–def is a definition to an array variable in whichan assignment is made to the complete array. A partial–def is a definition toan array variable in which an assignment is made to only some of the arrayelements, and the other array elements retain their prior values. An initial–defis an initial value of an array variable.

Note that each def–value can be classified as either a complete–def, a partial–def, or an initial–def. For instance, consider the following example.

Example 1.(1) B = cshift(A,shift=+5,dim=1)(2) C = cshift(A,shift=+3,dim=1)(3) D = cshift(A,shift=-2,dim=1)(4) B(i) = 50(5) R(2:199) = B(2:199) + C(2:199) + D(2:199)A, B, C, and D all dead at end of basic block, R live.

For convenience, we identify each def–value by the name of its variable, super-scripted with either 0 for an initial–def, or the statement assigning the def–valuefor a noninitial–def. Example 1 involves initial–defs A0 and R0, complete–defs B1,C2, and D3, and partial–defs B4 and R5.

An important issue for nonmaterialization is determining whether a givenshuffle operation is a candidate for nonmaterialization. The criteria for a givenshuffle operation being a candidate is that it be both safe and profitable. Thecriteria for being safe is that the source array of the shuffle is not modified whilethe definition of the destination array is live, and that the destination array isnot partially modified while the source array is live [13, 11]. The criteria for prof-itability can depend on the optimization goal, the shuffle operation involved, theshape of the arrays involved, architectural features of the computing environmenton which the computation will be performed, and the distribution/alignment ofthe arrays involved. For instance, Roth [11] gives criteria for stencil computationson distributed machines. For purposes of this paper, we assume the availabilityof appropriate criteria for determining whether a given shuffle operation is acandidate for nonmaterialization.



https://www.researchgate.net/publication/220687470_Optimization_of_Very_High_Level_Languages_-_I_Value_Transmission_and_Its_Corollaries?el=1_x_8&enrichId=rgreq-16ba7443-db38-4c0c-aff9-f624229ce955&enrichSource=Y292ZXJQYWdlOzIxMzg4MjEzMTtBUzoxOTE0OTg2NDM0MDI3NTNAMTQyMjY2ODI0MjM1OA==


Definition 3. An eligible statement is an assignment statement whose rightside consists of a shuffle operation meeting the criteria for nonmaterialization.We say that the def–value occurring on the right side of an eligible statement ismergeable with the def–value occurring on the left side of the statement. Anineligible statement is a statement that is not an eligible statement.

In a given basic block, the decisions as to which eligible statementsshould be nonmaterialized are interdependent. Roth [11] uses a “greedy”algorithm to choose which eligible statements to nonmaterialize. Namely, ateach step, his algorithm chooses the first eligible statement, and modifies thebasic block so that this statement is not materialized. Each such choice cancause subsequent previously eligible statements to become ineligible. Example 1illustrates this phenomenom.

Consider the basic block shown in Example 1. Assume that statements (1),(2), and (3) are eligible statements. Roth’s method would first consider the shuffleoperation in statement (1), and choose to merge B with A. This merger would becarried out as shown below, changing the partial–def of B in statement (4) intoa partial–def of A. In statements (4) and (5), the reference to B is replaced by areference to A, annotated with the appropriate shift value.

(1) call overlap cshift(A,shift=+5,dim=1)(2) C = cshift(A,shift=+3,dim=1)(3) D = cshift(A,shift=-2,dim=1)(4) A<+5>(i) = 50(5) R(2:199) = A<+5>(2:199) + C(2:199) + D(2:199)

Note that the new partial–def of A in statement (4) makes statements (2)and (3) unsafe, and therefore ineligible, thereby preventing C and D from beingmerged with A. Thus there would be three copies of the array. But, it is betterto make a separate copy of A in B, and let A, C, and D all share the same copy,as follows:

(1) B = cshift(A,shift=+5,dim=1)(2) call overlap cshift(A,shift=+3,dim=1)(3) call overlap cshift(A,shift=-2,dim=1)(4) B(i) = 50(5) R(2:199) = B(2:199) + A<+3>(2:199) + A<−2>(2:199)

The above example generalizes. Suppose that instead of just C and D, therewere n additional variables which could be merged with A. Roth’s method wouldonly merge B with A, resulting in a total of n+1 materialized arrays. In contrast,by making a separate copy for B, there would be a total of only two material-ized arrays. Consequently, the “greedy” algorithm can be arbitrarily worse thanoptimal.

The optimization problem we consider is to minimize the numberof materializations in a basic block, under the following seven assump-tions:



Assumptions:

1. No dead code. (No Dead Assumption)2. Arrays with different names are not aliased1. (No Aliasing Assumption)3. No rearrangement of the ineligible statements in a basic block. (Fixed Se-

quence Assumption)4. The mergeability relation between def–values in a basic block is symmetric

and transitive. (Full Mergeability Assumption)5. There is no fine–grained analysis of array indices. Recall that each def–value

is classified as being either an initial–def, a complete–def or a partial–def. Noanalysis is made as to whether two partial–defs overlap or whether a givenuse overlaps a given partial–def. (Coarse Analysis Assumption)

6. The initial value of each variable that is live on entry to a basic block isstored in the variable’s official location. (Live Entry Assumption)

7. A variable that is live at exit from a basic block, must have its final valueat the end of the basic block stored in the variable’s official location. (LiveExit Assumption)

In considering Assumptions 6 and 7, note that any def–value that is not liveat entry or exit from a basic block need not be stored in the official location ofits variable.

2.3 The Clone Forest Model

We now consider sets of mutually mergeable def–values, represented as trees inthe clone forest, defined as follows.

Definition 4. The clone forest for a basic block is a graph with a node foreach def–value occurring in the basic block, and a directed edge from the sourcedef–value to the destination def–value of each eligible shuffle operation. A cloneset is the set of def–values occurring in a tree of the clone forest. The root ofa given clone set is the def–value that is the root of the clone tree correspondingto the clone set.

As an example, the clone forest for Example 1 is shown in Figure 1.

Definition 5. A materialization is either an initial–def or a complete–def.

An important observation is that at least one materialization is needed foreach clone set whose root is either an initial–def or a complete–def. The overalloptimization problem can be viewed as minimizing the total number of material-izations for all the clone sets for a given basic block. In Example 1, the only cloneset with more than one member is {A0, B1, C2, D3}. The basic block can be opti-mized by materializing A0 and B1 from this clone set, and materializing R0 from

1 However, the techniques in this paper can be suitably modified to use aliasinginformation.


��B1 ��C2

��A0

��D3

�� ?

@@@R

��R0 ��B4 ��R5

Fig. 1. Clone forest for Example 1

clone set {R0}. Clone sets {B4} and {R5} do not require new materializations;they can be obtained by partial–defs to already materialized arrays.

Note that even if there is a materialization for a given clone set, the materi-alized member of the clone set need not necessarily be the root of the clone set.This freedom may be necessary for optimization, as illustrated in the followingexample.

Example 2.(1) A = C + D(2) B = cshift(A,shift=+3,dim=1)(3) z = A(i) + B(i)A dead, B live at end of basic block.

Consider the clone set consisting of A1 and B2. The root, A1, is a complete–def, and so at least one materialization is needed for this clone set. Since A isdead at the end of the basic block, and B is live, it is advantageous to materializethe value of the clone set in variable B instead of variable A. Thus, an optimizedversion of the basic block might be as follows. Here, the partial result C + D isannotated by the compiler to indicate that the value stored in variable B shouldbe appropriately shifted.

(1′) B = (C + D)<+3>

(3) z = B<−3>(i) + B(i)

We now establish a lower bound on the number of separate copies requiredfor a given clone set.

Definition 6. A given def–value in a basic block is transient if it is not thelast def to its variable in the basic block, and the subsequent def to its variable isa partial–def.

For instance, in Example 1, def–value B1 is transient.

Definition 7. A given def–value in a basic block is persistent if it is the lastdef to its variable in the basic block, and its variable is live at the end of thebasic block.

For instance, in Example 1, def–value R5 is persistent.


Theorem 1. Consider the clone tree for a given clone set. There needs to bea separate copy for each def–value in the clone tree that is either transient orpersistent.

Proof: Two transient def–values cannot share the same copy because of Assump-tions 1 and 5. Two persistent def–values cannot share the same copy becauseof Assumption 7. A transient def–value and a persistent def–value cannot sharethe same copy because of Assumptions 1 and 7. ��

2.4 Main Techniques for Nonmaterialization Under ConstrainedStatement Rearrangement

In this section, we outline our main techniques and results on the nonmateriali-zation problem for basic blocks under constrained statement rearrangement.

First, note that for each transient def–value of a given clone set, there is asubsequent partial–def to its variable, and this partial–def is the root of anotherclone set. This relationship can be represented in a graph with a node for eachclone set, as formalized by the following definition.

Definition 8. The version forest for a basic block is a graph with a node foreach clone set in the basic block, and a directed edge from clone set α to cloneset β if the root of clone set β is a partial–def that modifies a member of α. Theroot def–value of a given tree in the version forest is the root def–value of theclone set of the root node of the version tree.

As an example, the version forest for Example 1 is shown in Figure 2. Thereare two trees in the version forest, with root def–values A0 and R0, respectively.

��

��A0, B1, C2, D3

��

��B4

?

��

��R0

��

��R5

?

Fig. 2. Version forest for Example 1

Definition 9. A node of a version forest is persistent if any of its def–valuesare persistent.

For instance, in Figure 2, the only persistent node is the node for clone set{R5}.Definition 10. The origin point of an initial–def is just before the basic block.The origin point of a complete–def or a partial–def is the statement in whichit receives its value. The origin point of a clone set is the origin point of theroot def–value of the clone set.


For instance, in Figure 2, the origin point of {A0, B1, C2, D3} is (0), of {B4}is (4), of {R0} is (0), and of {R5} is (5).

Definition 11. The demand point of a persistent def–value in a basic blockis just after the basic block. The demand point of a non–persistent def–valueis the last ineligible statement that contains a use of the def–value, and is (0) ifthere is no ineligible statement that uses it. The demand point of a clone setis the maximum demand point of the def–values in the clone set.

For instance, in Figure 2, the demand point of {A0, B1, C2, D3} is (5), of {B4}is (5), of {R0} is (0), and of {R5} is (6).

Next, we formalize the concept of an essential node of a version tree. Infor-mally, an essential node is a node whose value is needed after the values of all itschild nodes (if any) have been produced. Consequently, the materialization usedto hold the value of an essential node in order to satisfy this need for the valuecannot be the same as any of the materializations that are modified to producethe values of the child nodes.

Definition 12. A node of a version tree is an essential node if it is either aleaf node, or a non–leaf node whose demand point exceeds the maximum originpoint of its children.

For instance, in Figure 2, node {A0, B1, C2, D3} is essential because its demandpoint (5) exceeds the origin point (4) of its child {B4}. Node{R0} is not essentialbecause its demand point (0) does not exceed the origin point (5) of its child{R5}. Nodes {B4} and {R5} are leaf nodes, and so are essential.

Proposition 1. Every persistent node of a version tree is essential.

Proof: A persistent leaf node is essential by definition. A persistent non–leafnode is essential because its demand point is just after the basic block, while theorigin point of each of its children is within the basic block. ��

Recall that there are three materializations in the optimized code for Exam-ple 1. There is the materialization of A0, which is also used for C2 and D3. Thereis a materialization of transient def–value B1, which is subsequently modified bythe partial–def that creates B4. Finally, there is a materialization of transientdef–value R0, which is subsequently modified by the partial–def that creates R5.In this example, the number of materializations in the optimized code equalsthe number of essential nodes in the version forest. Each of the three essentialnodes of the version forest (shown in Figure 2) for this basic block can be as-sociated with one of these materializations. Node {A0, B1, C2, D3} is associatedwith materialization A0. Node {B4} is associated with materialization B1 (viathe partial–def to B4). Node {R5} is associated with materialization R0 (via thepartial–def to R5). Nonessential node {R0} is associated with the same material-ization (R0) as is associated with its child node, so that the partial–def R5 thatcreates the child node can modify the variable R associated with the parent node.

Now consider the following example, which illustrates changing the destina-tion variable of a complete–def, so that a nonessential parent node utilizes thesame variable as its child node.


Example 3.(1) A = G + H(2) B = cshift(A,shift=+5,dim=1)(3) C = cshift(A,shift=+3,dim=1)(4) x = C(i) + 3(5) B(i) = 50(6) A = G - HA and B live, C dead at end of basic block.

The version forest for this basic block is shown in Figure 32. The basic blockcan be optimized by merging A1, B2, and C3 into a single copy stored in B’s officiallocation, as shown below. (Thus, the references to A, B, and C in statements (1)through (5) are all replaced by references to B.)

(1′) B = (G + H)<+5>

(4) x = B<−2>(i) + 3(5) B(i) = 50(6) A = G - H

The complete–def A1 is transformed into a complete–def to variable B, sincethis permits the version tree with root def–value A1 to be evaluated using onlyone materialization. The materialization is in variable B, rather than variable A,because def–value B5 is persistent. Def–value A6 is a persistent def, and so usesthe official location of A.

Note that the version forest contains two essential nodes, {B5} and {A6}, andthe optimized code uses two materializations. In the optimized code, essentialnode {B5} utilizes variable B, and is associated with new materialization B1

′(via

the partial–def to B5). Essential node {A6} is associated with materialization A6.Nonessential node {A1, B2, C3} is associated with the same materialization (B1

′)

as is associated with its child node, so that the partial–def B5 that creates thechild node can modify the variable B utilized by the parent node. The referencein statement (4) to def–value C3 from node {A1, B2, C3} is replaced by a referenceto variable B, which holds def–value B1

′.

��

��A1, B2, C3

��

��B5

?

��

��A6

Fig. 3. Version forest for Example 32 Since the only defs to G and H in the basic block are initial–defs, there is no need to

include nodes for G0 and H0 in the version forest.


The following example illustrates how the freedom to rearrange eligible state-ments, as permitted by Assumption 3 (Fixed Sequence Assumption), can beexploited to reduce the number of materializations.

Example 4.(1) B = cshift(A,shift=+5,dim=1)(2) A(i) = 120(3) x = B(j) + 3(4) C = cshift(A,shift=+2,dim=1)A and B dead, C live at end of basic block.

The version forest for this basic block is shown in Figure 4. Note that bothnodes of the version forest are essential. The basic block can be optimized bymoving the complete–def C4 forward so that it occurs before statement (2), andletting the partial–def in statement (2) modify C. The optimized code is shownbelow, where old statement (4) is relabeled (4′) and is now the first statementin the basic block.

(4′) C = cshift(A,shift=+2,dim=1)(2′) C<−2>(i) = 120(3′) x = A<+5>(j) + 3

The optimized code contains two materializations, and is made possible bychanging the reference to B1 in the demand point statement (3) to be a referenceto its clone A0. The optimized code utilizes materialization A0 for node {A0, B1},and utilizes, via partial–def C2

′, new materialization C4

′for node {A2, C4}.

��

��A0, B1

��

��A2, C4

?


Theorem 2. The number of materializations needed to evaluate a given versiontree is at least the number of essential nodes.

Proof Sketch: Let us say that a given def–value utilizes a given materializationif the def–value is obtained from the materialization via a sequence of zero ormore partial–defs. Note that all the def–values utilizing a given materializationare def–values for the same variable, and the version forest nodes containingthese def–values lie along a directed path in the version forest.


Let us say that a given reference to a def–value utilizes the materializationthat is utilized by the referenced def–val. Each non–persistent essential node ofthe given version tree has a nonzero demand point. We say that a final useof a non–persistent essential node is a use of a def–value from this node in thedemand point statement for the node. We envision arbitrarily selecting a finaluse of each non–persistent essential node, and we refer to the materializationutilized by this final use as the key utilization of the node. We say that thefinal use of a persistent node is one of the persistent def–values in the node, andthe key utilization of the node is the materialization utilized by this def–value.

Now consider the version forest corresponding to an optimized evaluation ofthe basic block, where the optimization is constrained by the assumptions givenabove. Each node of the version forest for the optimized basic block correspondsto a node of the given version forest. In the optimized code, the key utilizationsof two distinct essential nodes cannot be the same materialization. ��

Next, we note that it is not always possible to independently op-timize each version tree, because of possible interactions between theinitial and persistent def–values of the same variable. This interaction iscaptured by the concept of “persistence conflict”, as defined below.

Definition 13. A persistence conflict in a basic block is a variable that islive on both entry to and exit from the basic block.

Persistence conflicts may prevent each version tree for a given basic blockfrom being evaluated with a minimum number of materializations, as illustratedby the following example.

Example 5.(1) B = cshift(A,shift=+5,dim=1)(2) A = G + H(3) B(i) = 50(4) x = B(j) + 3A live and B dead at end of basic block.

The version forest for this basic block is shown in Figure 5. The only es-sential nodes are {B3} and {A2}. Each of the two version trees can by itself beevaluated using only a single materialization, corresponding to def–values A0 andA2, respectively. However, there is a persistence conflict involving variable A. As-sumption 3 prevents statements (2), (3), and (4) from being reordered, so in thisexample, three materializations are necessary.

We next show that for a version tree with no persistent nodes, the lowerbound of Theorem 2 has a matching upper bound. As part of the proof, weprovide an algorithm for producing the optimized code.

Theorem 3. Assuming no persistence conflicts, a version tree with no persis-tent nodes can be computed using one materialization for each essential node,and no other materializations.


��

��A0, B1

��

��B3

?

��

��A2


Proof Sketch: The following algorithm is given the version tree and basic blockas input, and produces the optimized code. If there are multiple version trees,Steps 1 and 2 of the algorithm can be done for each version tree, and then thebasic block can be transformed. The algorithm associates a variable name witheach version tree node. We call this variable name the utilization–variable ofthe node. In the code produced by the algorithm, all the uses of def–values froma given node of the version tree reference the utilization–variable of that node.In the portion of the algorithm that determines the utilization–variable of eachnode, we envision the children of each node being ordered by their origin point,so that the child with the last origin point is the rightmost child.Step 1. A unique utilization–variable is associated with each essential node, asfollows. With the following exception, the utilization–variable for each essentialnode is a unique temporary variable3. The exception occurs if the root def–valueis an initial–def. In this case, the name of the variable of the initial–def is madethe utilization–variable for the highest essential node on the path from the rootof the version tree to the rightmost leaf.Step 2. The utilization–variables of nonessential nodes are determined in abottom–up manner, as follows. The utilization–variable of a nonessential nodeis made the same as the utilization–variable of its rightmost child (i.e., the childwith the maximum origin point).Step 3. In the evaluation of the basic block, the ineligible statements occur inthe same order as given. A materialized shuffle statement is included for eachchild node whose utilization–variable is different from the utilization–variable ofits parent node. This shuffle statement is placed just after the origin point of theparent node. All other eligible statements for the version tree are deleted4.Step 4. For each shuffle statement included in Step 3, the utilization–variable ofthe parent node is made the source of the shuffle statement, and the utilization–variable of the child node is made the destination of the shuffle statement.Step 5. If the root def–value is a complete–def, then in the evaluation of thebasic block, the utilization–variable of the root node is used in the left side of thestatement for the complete–def. For each nonroot node, the utilization–variable

3 Alternately, one of the variable names occurring in the node’s clone set can be used,but with each variable name associated with at most one essential node.

4 However, a cshift operation involving distributed arrays is replaced by anoverlap cshift operation placed just after the origin point of the parent node.


of the node is used in the left side of the partial–def statement occurring at theorigin point of the node.Step 6. In the ineligible statements, each use of a def–val from the version treeis replaced by a reference to the utilization–variable of the version tree nodecontaining the def–value, with appropriate shift annotation if needed.

Consider the code produced by the above algorithm. In the evaluation of theversion tree, the number of variable names used equals the number of essentialnodes, and the number of materializations equals the number of variable namesused. ��

In the optimized code for Example 1, node {A0, B1, C2, D3} utilizes material-ization A0 and variable A, node {B4} utilizes materialization B1 and variable B,and nodes {R0} and {R5} utilize materialization R0 and variable R. In Example 2,the version tree has only one node ({A1, B2}); the optimized code for this nodeutilizes materialization B1

′and variable B. In the optimized code for Example 3,

nodes {A1, B2, C3} and {B5} utilize materialization B1′and variable B, and node

{A6} utilizes materialization A6 and variable A. In the optimized code for Exam-ple 4, node {A0, B1} utilizes materialization A0 and variable A, and node {A2, C4}utilizes materialization C4

′and variable C.

Theorem 4. Assuming no persistence conflicts, the number of materializationsneeded to compute a version tree with no persistent nodes is the number of es-sential nodes of the version tree.

Proof: Immediate from Theorems 2 and 3. ��When there are persistance conflicts, the number of materializations needed

to compute a version tree is at most the number of essential nodes of the versiontree, plus the number of extra persistent def-values in persistent nodes, plus thenumber of persistance conflicts. This follows, since each persistance conflict canbe eliminated by introducing an extra materialization for the variable involved.However, when a persistance conflict involves a variable for which the initialdef–value and persistent def–value both occur in the same version tree, we candetermine efficiently if this extra materialization is indeed necessary.

3 Conclusions

Minimizing nonmaterializations is a deep problem that can be approached froman intensional perspective, utilizing an analysis of the role of entire arrays. In asingle expression, minimizing materializations is mainly an issue of proper arrayaccessing. However, in a basic block, the decisions as to which materializations toeliminate interact, so minimizing materializations is a combinatorial optimiza-tion problem. The concepts of clone sets, version forests, and essential nodes seemto model fundamental aspects of the problem. Under the assumptions listed inSection 2.2, each essential node of a version forest requires a distinct material-ization, thereby establishing a lower bound on the number of materializationsrequired. Theorem 3 provides an algorithm when persistent variables are not an


issue. The algorithm produces code with one materialization per essential nodeof the version tree, and no additional materializations. The algorithm is optimal,in the sense of producing the minimum number of materializations.

References

[1] P. S. Abrams. An APL Machine. PhD thesis, Stanford University, 1970.[2] T. A. Budd. An APL compiler for a vector processor. ACM Transactions on

Programming Languages and Systems, 6(3):297–313, July 1984.[3] L. J. Guibas and D. K. Wyatt. Compilation and delayed evaluation in APL. In

Conference Record of the Fifth Annual ACM Symposium on Principles of Pro-gramming Languages, pages 1–8, Jan. 1978.

[4] A. Hassitt and L. E. Lyon. Efficient evaluation of array subscripts of arrays. IBMJournal of Research and Development, 16(1):45–57, Jan. 1972.

[5] W. Humphrey, S. Karmesin, F. Bassetti, and J. Reynders. Optimization of data–parallel field expressions in the POOMA framework. In Y. Ishikawa, R. R. Old-ehoeft, J. Reynders, and M. Tholburn, editors, Proc. First International Confer-ence on Scientific Computing in Object–Oriented Parallel Environments (ISCOPE’97), volume 1343 of Lecture Notes in Computer Science, pages 185–194, Marinadel Rey, CA, Dec. 1997. Springer–Verlag.

[6] K. Kennedy, J. Mellor-Crummey, and G. Roth. Optimizing Fortran 90 shift opera-tions on distributed–memory multicomputers. In Proceedings Eighth InternationalWorkshop on Languages and Compilers for Parallel Computing, volume 1033 ofLecture Notes in Computer Science, Columbus, OH, Aug. 1995. Springer–Verlag.

[7] C. Lin and L. Snyder. ZPL: An array sublanguage. In U. Banerjee, D. Gelernter,A. Nicolau, and D. Padua, editors, Proceedings of the 6th International Workshopon Languages and Compilers for Parallel Computing, volume 768 of Lecture Notesin Computer Science, pages 96–114, Portland, OR, Aug. 1993. Springer-Verlag.

[8] A. Lumsdaine. The matrix template library: A generic programming approachto high performance numerical linear algebra. In Proceedings of InternationalSymposium on Computing in Object-Oriented Parallel Environments, 1998.

[9] L. Mullin. The Psi compiler project. In Workshop on Compilers for ParallelComputers. TU Delft, Holland, 1993.

[10] L. M. R. Mullin. A Mathematics of Arrays. PhD thesis, Syracuse University, Dec.1988.

[11] G. Roth. Optimizing Fortran90D/HPF for Distributed–Memory Computers. PhDthesis, Dept. of Computer Science, Rice University, Apr. 1997.

[12] G. Roth, J. Mellor-Crummey, K. Kennedy, and R. G. Brickner. Compiling sten-cils in High Performance Fortran. In Proceedings of SC’97: High PerformanceNetworking and Computing, San Jose, CA, Nov. 1997.

[13] J. T. Schwartz. Optimization of very high level languages – I. Value transmissionand its corollaries. Computer Languages, 1(2):161–194, June 1975.

[14] T. Veldhuizen. Using C++ template metaprograms. C++ Report, 7(4):36–43,May 1995. Reprinted in C++ Gems, ed. Stanley Lippman.

[15] T. L. Veldhuizen. Expression templates. C++ Report, 7(5):26–31, June 1995.Reprinted in C++ Gems, ed. Stanley Lippman.

[16] T. L. Veldhuizen and D. Gannon. Active libraries: Rethinking the roles of com-pilers and libraries. In Proceedings of the SIAM Workshop on Object OrientedMethods for Inter-operable Scientific and Engineering Computing (OO’98), York-town Heights, NY, 1998. SIAM.

On minimizing materializations of array-valued temporaries

Documents

Transcript of On minimizing materializations of array-valued temporaries