Compiler and runtime analysis for efficient communication in data intensive applications

13

Transcript of Compiler and runtime analysis for efficient communication in data intensive applications

Compiler and Runtime Analysis for E�cient Communication in Data IntensiveApplications�Renato Ferreira� Gagan Agrawaly Joel Saltz��Department of Computer ScienceUniversity of Maryland, College Park MD 20742frenato,[email protected] of Computer and Information SciencesUniversity of Delaware, Newark DE [email protected] and analyzing large volumes of data plays an in-creasingly important role in many domains of scienti�c re-search. We are developing a compiler that processes dataintensive applications written in a dialect of Java and com-piles them for e�cient execution on cluster of workstationsor distributed memory parallel machines.In this paper, we focus on the problem of generating cor-rect and e�cient communication for data intensive applica-tions. We present static analysis techniques for 1) extract-ing a global reduction function from a data parallel loop,and 2) determining if a subscript function is monotonic. Wealso present a runtime technique for reducing the volume ofcommunication during the global reduction phase.We have experimented with two data intensive applica-tions to evaluate the e�cacy of our techniques. Our re-sults show that 1) our techniques for extracting global re-duction functions and establishing monotonicity of subscriptfunctions can successfully handle these applications, 2) sig-ni�cant reduction in communication volume and executiontimes is achieved through our runtime analysis technique,3) runtime communication analysis is critical for achievingspeedups on parallel con�gurations.1 IntroductionAnalysis and processing of very large multidimensional sci-enti�c datasets (i.e. where data items are associated withpoints in a multidimensional attribute space) is an impor-tant component of science and engineering. Examples ofthese datasets include raw and processed sensor data fromsatellites, output from hydrodynamics and chemical trans-port simulations, and archives of medical images. Thesedatasets are also very large, for example, in medical imag-ing, the size of a single digitized composite slide image athigh power from a light microscope is over 7GB (uncom-�This work was supported by NSF grant ACR-9982087, NSF CA-REER award ACI-9733520, and NSF grant CCR-9808522.

pressed), and a single large hospital can process thousandsof slides per day.The processing typically carried out over multidimen-sional datasets in these and related scienti�c domains shareimportant common characteristics. Access to data itemsis described by a range query, namely a multidimensionalbounding box in the underlying multidimensional space ofthe dataset. The basic computation consists of (1) mappingthe coordinates of the retrieved input items to the corre-sponding output items, and (2) aggregating, in some way, allthe retrieved input items mapped to the same output dataitems. The computation of a particular output element is areduction operation.We have been developing compiler support for allowinghigh-level, yet e�cient, programming of data intensive com-putations on multidimensional datasets [2, 14, 13]. We usea dialect of Java for expressing this class of computations,which includes data parallel extensions for specifying collec-tion of objects, a parallel for loop, and a reduction interface.Our compiler extensively uses the existing run-time systemActive Data Repository (ADR) [6, 7] for optimizing the re-source usage during the execution of data intensive applica-tions. ADR integrates storage, retrieval and processing ofmultidimensional datasets on a distributed memory paral-lel machine. The runtime system, our language design, andcompilation techniques particularly exploit the commonali-ties between the data processing applications that we statedearlier. We target a distributed memory parallel con�gura-tion, like a cluster of workstations, for execution of dataintensive computations.In compiling any class of applications on a distributedmemory parallel con�guration, communication generationand optimization is an important challenge. In this paper,we focus on compiler and runtime analysis required for cor-rect, as well as e�cient interprocessor communication forthe class of data intensive applications we are targeting. Ascompared to the general communication analysis problemhandled by the data parallel compilers [5, 10, 17, 16, 21,28, 30, 41, 46], the communication problem we handle isless general in certain ways, and harder and more challeng-ing in certain other ways. In the applications we are tar-geting, the communication that the compiler needs to gen-erate and optimize is restricted to a global reduction be-tween the processors at the end of each data intensive loop.However, the communication problem we handle is di�er-ent from the previous distributed memory compilation work

in two important ways. First, because each processor pro-cesses large disk-resident datasets, the volume of the com-munication during the global reduction phase can be verylarge. Second, use of an object oriented data parallel lan-guage makes communication analysis harder.We present three compiler and runtime analysis tech-niques in this paper. They are:� A static analysis technique for extracting the globalreduction function from the original data parallel loop.� A static analysis technique for determining if subscriptsfunctions used for accessing input and output collec-tions are monotonic.� A runtime analysis technique for reducing the volumeof communication during the global reduction phase insparse data intensive applications.To evaluate our technique, we have developed a proto-type compiler based upon the Titanium infrastructure fromBerkeley [44]. In this paper, we present experimental resultsfrom two applications. The �rst application is a multi-gridvirtual microscope [12] and the second is a satellite imageprocessing code [8]. We have conducted our experiments ona cluster of 400 MHz Pentium II based nodes connected bya gigabit ethernet.Our experiences in compiling these applications and ourexperimental results show the following:� Though the techniques we have developed for extract-ing global reduction functions and establishing mono-tonicity of subscript functions have a number of lim-itations, they could successfully handle these applica-tions.� Signi�cant reduction in communication volume, as wellas execution times is achieved through our runtimecommunication analysis technique. The reduction incommunication volume is up to 50% on 8 node versionof multi-grid virtual microscope, and up to 15% on 8node version of satellite data processing application.The reduction in execution time is up to 33% on on 8node version of multi-grid virtual microscope, and upto 21% on 16 node version of satellite.The rest of the paper is organized as follows. In Section 2,we give an overview of the applications we target, presentthe language features and high-level abstractions our com-piler supports using satellite data processing as an example,and then give an overview of the compilation techniques weuse. Static analysis techniques for extracting global reduc-tion functions and determining monotonicity of subscriptfunctions are presented in Section 3. Runtime analysis tech-nique for reducing volume of communication is presented inSection 4. Experimental results are presented in Section 5.We compare our work with related research e�orts in Sec-tion 6 and conclude in Section 7.2 OverviewIn this section, we give an overview of the class of the data in-tensive applications we are targeting. We describe the dataparallel language dialect we use for this class of applications.Finally, we give an overview of the execution strategy usedto handle disk-resident datasets.

2.1 Data Intensive ApplicationsWe now describe some of the scienti�c domains which in-volve applications that process large datasets. Then, wedescribe some of the common characteristics of these appli-cations.Satellite data processing: Earth scientists study the earthby processing remotely-sensed data continuously acquiredfrom satellite-based sensors, since a signi�cant amount ofearth science research is devoted to developing correlationsbetween sensor radiometry and various properties of the sur-face of the earth [7]. A typical analysis processes satellitedata for ten days to a year and generates one or more com-posite images of the area under study. Generating a com-posite image requires projection of the globe onto a two di-mensional grid; each pixel in the composite image is com-puted by selecting the \best" sensor value that maps to theassociated grid point.Water contamination studies: Environmental scientists studythe water quality of bays and estuaries using long runninghydrodynamics and chemical transport simulations [29]. Thechemical transport simulation models reactions and trans-port of contaminants, using the uid velocity data generatedby the hydrodynamics simulation.Analysis of Microscopy Data: The Virtual Microscope [12]is an application to support the need to interactively viewand process digitized data arising from tissue specimens.The raw data for such a system is captured by digitally scan-ning collections of full microscope slides under high power.The virtual microscope application emulates the usual be-havior of a physical microscope including continuously mov-ing the stage and changing magni�cation and focus.Data intensive applications in these and related scienti�careas share many common characteristics. The basic com-putation consists of (1) mapping the coordinates of the re-trieved input items to the corresponding output items, and(2) aggregating, in some way, all the retrieved input itemsmapped to the same output data items. The computationof a particular output element is a reduction operation, i.e.the correctness of the output usually does not depend onthe order in which the input data items are aggregated.2.2 Language Features and Example ApplicationIn this subsection, we present our high-level abstractions forfacilitating rapid development of applications that processdisk resident datasets. We use the satellite data processingapplication as an example [8]. We �rst describe the natureof the datasets captured by the satellites orbiting the earth,then describe the typical processing on these, and �nally ex-plain the high-level abstractions and data parallel languageconstructs we support for performing such processing.A satellite orbiting the earth collects data as a sequenceof blocks. The satellites contain sensors for �ve di�erentbands. The measurements produced by the satellite areshort values (16 bits) for each band. As the satellite orbitsthe earth, the sensors sweep the surface building scan linesof 408 measurements each. Each block consists of 204 halfscan lines, i.e., it is a 204 � 204 array with 5 short integersper element. Latitude, longitude, and time is also storedwithin the block for each measure.

Interface Reducinterface ff* Any object of a class implementing *gf* this interface is a reduction variable *ggpublic class pixel fshort bands[5] ;short geo[2] ;gclass block fshort time ;pixel bands[204*204] ;pixel getData(Point[2] p) ff* Search for the (lat, long) on geo data *gf* Return the pixel if it exists *gf* else return null *gggf* Low-level Data Layout *gclass SatOrigData fblock[1d] data ;void SatOrigData(RectDomain[1] InputDomain) fdata = new block[InputDomain] ;gpixel getData(Point[3] q) fPoint[1] time = (q.get(0));Point[2] p = (q.get(1), q.get(2));return data[time].getData(p) ;ggpublic class OutputDataimplements Reducinterface fint value ;void Accumulate(pixel input ) ff* Aggregate value of input *gf* pixel into output *ggg

f* High-level Data Layout *gpublic class SatData fSatOrigData data;void SatData(RectDomain[1] InputDomain) fdata = new SatOrigData(InputDomain) ;gpixel getData(Point[3] q) freturn data.getData(q) ;ggpublic class SatelliteApp fPoint[1] minpt = ...Point[1] maxpt = ...RectDomain[1] InputDomain = [minpt : maxpt];SatData InputData = new Satdata(InputDomain) ;public static void main(int[] args) fPoint[2] lowend = (args[2],args[4]);Point[2] highend = (args[3],args[5]);Rectdomain[2] OutputDomain = [lowend : highend];Point[3] low = (args[0], args[2], args[4]);Point[3] high = (args[1], args[3], args[5]);Rectdomain[3] AbsDomain = [low : high];Image[2d] OutImage = new OutputData[OutputDomain];foreach (Point[3] q in AbsDomain) fPoint[2] p = (q.get(1), q.get(2));if (pixel val = satdata.getData(q))OutImage[p].Accumulate(val);gggFigure 1: A Satellite Data-Processing CodeThe typical computation on this satellite data is as fol-lows. A portion of earth is speci�ed through latitudes andlongitudes of end points. A time range (typically 10 daysto one year) is also speci�ed. For any point on the earthwithin the speci�ed area, all available pixels within thattime-period are scanned and the best value is determined.An example criteria for �nding the best value is cloudinesson that day, with the least cloudy image being the best. Thebest pixel over each point within the area is used to producea composite image. This composite image is used by re-searchers to study a number of properties, like deforestationover time, pollution over di�erent areas, etc [34].There are two sources of sparsity and irregularity in thedataset and computation. First, the pixels captured by thesatellite can be viewed as comprising a sparse three dimen-sional array, where time, latitude, and longitude are thethree dimensions. Pixels for several, but not all, time val-ues are available for any given latitude and longitude. Thesecond source of irregularity in the dataset comes becausethe earth is spherical, whereas the satellite sees the area ofearth it is above as a rectangular grid. Thus, the translationfrom the rectangular area that the satellite has captured ina given band to latitudes and longitudes is not straight for-ward.In Figure 1, we show the essential structure associated

with the satellite data processing application [8, 9].The class block represents the data captured in eachtime-unit by the satellite. This class has one function(getData) that takes a (latitude, longitude) pair and sees ifthere is any pixel in the given block for that location. If so,it returns that pixel. The class SatOrigData stores the dataas a one dimensional array of blocks.Classes block and SatOrigData are not visible to theprogrammer writing the processing code. The goal is to pro-vide a simpli�ed view of the dataset to the application pro-grammers, thereby easing the development of correct, butnot necessarily e�cient, data processing application. Thecompiler translating the code obviously has the access tothe source code of these classes, which enables it to gener-ate e�cient low-level code.The class SatData is the interface to the input datasetvisible to the programmer writing the main execution code.Through its access function getData, this class gives theview that a 3-dimensional grid of pixels is available.The main processing function takes 6 command line ar-guments as the input. The �rst two specify a time rangeover which the processing is performed. The next four arethe latitudes and longitudes for the two end-points of therectangular output desired. We need to iterate over all theblocks within the time range, examine all pixels which fall

into the output region, and then perform the reduction op-eration (i.e. choosing the best pixel).We specify this computation in a data parallel languageas follows. We consider an abstract 3-dimensional rectan-gular grid, with time, latitude, and longitude as the threeaxes. This grid is abstract, because pixels actually exist foronly a small fraction of all the points in this grid. However,the high-level code just iterates over this grid in the foreachloop. For each point q in the grid, which is a (time, lat,long) tuple, we examine if the block SatData[time] hasany pixel. If such a pixel exists, it is used for performing areduction operation on the object Output[(lat,long)].We have used a number of data parallel constructs inour languages. These have also been used in object-orientedparallel systems like Titanium [44], HPC++ [4], and Con-current Aggregates [11, 35], and are not unique to our ap-proach. These constructs are:� A rectdomain is a collection of objects of the same typesuch that each object in the collection has a coordinateassociated with it, and this coordinate belongs to apre-speci�ed rectilinear section.� The foreach loop, which iterates over objects in a rect-domain, and has the property that the order of iter-ations does not in uence the result of the associatedcomputations.� We use a Java interface called Reducinterface. Anyobject of any class implementing this interface acts asa reduction variable [20]. A reduction variable has theproperty that it can only be updated inside a foreachloop by a series of operations that are associative andcommutative. Furthermore, the intermediate value ofthe reduction variable may not be used within the loop,except for self-updates.2.3 Execution Strategy OverviewThis subsection gives an overview of the compilation andexecution strategy we use.The main challenge in executing a data intensive loopcomes from the fact that the amount of data accessed in theloop exceeds the main memory. While the virtual memorysupport can be used for correct execution, it leads to verypoor performance. Therefore, it is compiler's responsibilityto perform memory management, i.e., determine which por-tions of output and input collections are in the main memoryduring a particular stage of the computation.Based upon our experiences from data intensive appli-cations and developing runtime support for them [8, 7], thebasic code execution scheme we use is as follows. The out-put data structure is divided into tiles, such that each tile�ts into the main memory. The input dataset is read diskblock at a time. This is because the disks provide the high-est bandwidth and incur the lowest overhead while access-ing all data from a single disk block. Once an input diskblock is brought into main memory, all iterations of the loopwhich read from this disk block and update an element fromthe current tile are performed. A tile from the output datastructure is never allocated more than once, but a particu-lar disk block may be read to contribute to multiple outputtiles.2.3.1 Loop PreprocessingTo facilitate the execution of loops in this fashion, ourcompiler �rst performs an initial preprocessing of the loop.

foreach(r 2 R) fO1[SL(r)] = F1(O1[SL(r)]; I1[SR1(r)]; : : : ; In[SRn(r)]): : :Om[SL(r)] = Fm(Om[SL(r)]; I1[SR1(r)]; : : : ; In[SRn(r)])g Figure 2: Canonical Form of LoopIn the process, it may replace an initial loop with a sequenceof loops, each of which conforms to a canonical form.Consider any data intensive parallel loop in the dialect ofJava described earlier in this paper. For the purpose of ourdiscussion, collections of objects whose elements are modi-�ed in the loop are referred to as left hand side or lhs col-lections, and the collections whose elements are only read inthe loop are considered as right hand side or rhs collections.The canonical form we support is shown in Figure 2. Thedomain over which the iterator iterates is denoted byR. Letthere be n rhs collection of objects read in this loop, whichare denoted by I1; : : : ; In. Similarly, let the lhs collectionswritten in the loop be denoted by O1; : : : ; Om. All lhs col-lections are accessed using a single subscript function, SL.In the iteration r of this loop, the value of output elementOi[SL(r)] is updated using the function Fi. This is a com-mutative and associative function and uses one or more ofthe values Oi[SL(r)], I1[SR1(r)]; : : : ; In[SRn(r)], and otherscalar values in the program. More speci�cally, the functionhas the form:Oi[SL(r)] = Oi[SL(r)] op g1(I1[SR1(r)]) op g2(I2[SR2(r)])op : : : op gn(In[SRn(r)])where, op is a associative and commutative operator, andthe function gi only uses Ii[SRi(r)] and scalar values in theprogram.The two main restrictions we impose in this de�nition ofa canonical loop are:� All the lhs collections are accessed using the samesubscript function. This restriction allows us to tileall lhs collections in an identical fashion.� The updates to the lhs elements are performed us-ing associative and commutative functions only. Withthis restriction, the elements from di�erent input col-lections that contribute to the value of an output el-ement need not be brought into the memory at thesame time.Starting from a general data parallel loop, loop �ssioncan be performed to obtain a series of loops, such that eachof them conforms to this canonical form [14]. Applying suchloop �ssion may involve introducing temporary collections.However, for the applications we experimented with, loop�ssion was never required.2.3.2 Loop PlanningIn executing data intensive loops that we are targeting, anumber of decisions need to made before execution of theiterations of the loop. These decisions are made during theloop planning phase and are described here. For many ofthese decisions, we have chosen a simple strategy, and moresophisticated treatment is a topic for future research.One of the issues in executing any loop in parallel iswork or iteration partitioning, i.e., deciding which iterations

For each lhs strip Sl:Execute on each Processor Pj :Allocate and initialize strip Sl for O1; : : : ;OmFor each rhs collection IiFor each disk block in LijlFor each element e in the blockI = Iters(e)For each i 2 IIf (i 2 R) ^ (SL(i) 2 Sl)Update values of O1[SL(r)]; : : : ;Om[SL(r)]Perform global reduction to �nalize the values for SlFigure 3: Basic Loop Execution Strategyare performed on each processor. The work distributionstrategy we use is that each iteration is performed on theowner of the element read in that iteration. As a result, nocommunication is required for the rhs elements.The next important issue is the choice of tiling strategyfor lhs collections. We have so far used a very simple strat-egy. We query the run-time system to determine the avail-able memory that can be allocated on a given processor.Then, we divide the lhs space into blocks of that size. For-mally, we divide the lhs range into a set of smaller ranges(called strips) fS1; S2; : : : ; Srg. Since each of the lhs col-lection of objects in the loop is accessed through the samesubscript function, the same strip mining is used for each ofthem.After the tiling decision, the next important compilationproblem is determining the set of disk blocks that need to beread for performing the updates on a given tile. A lhs tileis allocated only once. If elements of a particular disk blockare required for updating multiple tiles, this disk block isread more than once. The compiler uses static declarationsin program to extract an expression that is applied to themeta-data associated with each disk block. For the purposeof describing our execution strategy, we assume that for therhs collection Ii, on the given processor j, and for the lhsstrip l, the set of disk blocks that need to be read is denotedby Lijl.2.3.3 Loop Execution StrategyOur basic loop execution strategy is shown in Figure 3.In a separate paper [15], we also describe variations of thisstrategy to match characteristics of certain applications.The lhs tiles are allocated one at a time. For each lhstile Sl, and for each rhs collection Ii, the rhs disk blocksfrom the set Lijl are read successively. For the disk blockread in memory, we need to determine the set of iterationsin the loop that use each element in this block. For an el-ement e, the set of iterations is denoted by Iters(e). Foreach iteration i belonging to the set Iters(e), we furthercheck if it belongs to the loop range R, and if the ele-ment of the lhs collection updated in that loop iteration,SL(i), belongs to the tile currently being processed (Sl).After checking for these conditions, we update the elementsO1[SL(r)]; : : : ; Om[SL(r)] using the element e.All processors participate in a global reduction and com-munication phase after processing each tile. Details of com-piler and runtime analysis required for correct and e�cientcommunication during this phase are presented in the nexttwo sections.

3 Compiler Analysis TechniquesIn this section, we present two static analysis techniques thatenable correct and e�cient interprocessor communicationfor the class of data intensive applications we are targeting.These two techniques are for:� Extracting a global reduction function from the orig-inal data parallel loop. Such a global reduction func-tion is required for correct processing of the loop on adistributed memory machine.� Determining if subscripts functions used for access-ing input and output collections are monotonic. Thisinformation is exploited by a runtime technique de-scribed in Section 4 to reduce the volume of commu-nication.3.1 Extracting Global Reduction FunctionWe now describe a static analysis technique we have devel-oped for extracting the global reduction function from thelocal reduction function or the data intensive loop itself.Problem Statement: As we had shown in Figure 2, a givenoutput collection Oi is updated in the loop as follows:Oi[SL(r)] = Fi(Oi[SL(r)]; I1[SR1(r)]; : : : ; In[SRn(r)])We want to synthesize a function fi, which can be usedin the following formOi[SL(r)] = fi(Oi[SL(r)]; O0i[SL(r)])to perform global reduction, using the collections Oi and O0icomputed on individual processors after the local reductionphase. The function fi is also such that the original compu-tation can be stated asOi[SL(r)] = fi(Oi[SL(r)];F 0i(I1[SR1(r); : : : ; In[SRn(r)]))This problem is signi�cantly di�erent from the previouswork on analysis of reduction statements [3, 18, 19, 31, 33,45], because we are considering an object-oriented languagewhere several members of an output object may be updatedwithin the same function.Solution Approach: Our approach is based upon classify-ing data dependencies and control dependencies of updatesto the data members of the lhs objects.Consider any statement in the local reduction functionthat updates a data member of the lhs object Oi[SL(r)].If this statement includes any temporary variables that arede�ned in the local reduction function itself, we performforward substitution and replace the temporary variables.After such forward substitution(s), any update to a datamember can be classi�ed as being one of the following types:1. Assignment to a loop constant expression, i.e., an ex-pression whose value is constant within each invoca-tion of the data intensive loop from which the localreduction function is extracted.2. Assignment to the value of another data member ofthe lhs object, or an expression involving one or moreother data members and loop constants.

3. Update using a commutative and associative functionop, such that the data member Oi[SL(r)]:x is updatedas Oi[SL(r)]:x = Oi[SL(r)]:x op g(: : :)where the function g does not involve any members ofthe lhs object Oi[SL(r)].4. Update which cannot be classi�ed in any of the previ-ous three groups.Our compiler can only compile data intensive loops inwhich every update to a data member of the lhs objectin the local reduction function can be classi�ed in the �rst,second, or third group above. This restriction did not createany problems for the applications we have looked at so far.The set of statements in the local reduction function thatupdate the data members of the lhs object is denoted by S.In general, the statements in the set S can be control de-pendent upon predicates in the function. We can currentlyonly handle local reduction functions in which statementsin the set S are control dependent upon loop constant ex-pressions only. Again, this restriction did not create anyproblems for the set of applications we examined.Code Generation: In synthesizing the function fi, we startwith the statements in the set S. The statements that fallin groups 1 or 2 above are left unchanged. The statementsthat fall in the group 3 are replaced by the statement of theform Oi[SL(r)]:x = Oi[SL(r)]:x opO0i[SL(r)]:xOur code generation is based upon the notion of programslicing technique [43]. The basic de�nition of a programslice is as follows. Given a slicing criterion (s; x), where sis a program point in the program and x is a variable, theprogram slice is a subset of statements in the program orfunction such that these statements, when executed on anyinput, will produce the same value of the variable x at theprogram point s as the original program.Within the original local reduction function, we use thestatements in the set S as the slicing criteria, and use pro-gram slicing to construct an executable function that willproduce the same results (except as modi�ed for the state-ments in the group 3) for these statements. The use of slic-ing for code generation naturally handles the possibility thata statement in the set S may be control dependent upon aloop constant expression.A simple example of the application of our techniqueis shown in Figure 4. This example is based upon thesatellite application we described in Section 2.2. The laststatement in the local reduction function is the only state-ment that updates the value of a data member of the re-duction object. This statement is of the type 3) as per ouranalysis. After replacing the statement to use the same datamember of the object old computed on another processor,we construct a program slice. This slice does not includeany other statement in the function, so our resulting globalreduction function has a single statement.3.2 Monotonicity Analysis on Subscript FunctionsIn this subsection, we present a static analysis techniquefor determining if a function is monotonic. Speci�cally, weare interested in determining if the subscript functions usedfor accessing lhs and rhs collections are monotonic. Our

approach is based upon combining control ow analysis withinteger programming techniques.The monotonicity property of subscript functions ex-ploited by the runtime communication analysis we presentin Section 4. This runtime analysis �nds the set of disjointrectangles in the output space that are updated by the in-put disk blocks owned by a particular processor. The mono-tonicity property of the subscript function SL and inversesubscript function S�1R can be used to map the boundingbox corresponding to an input block to an output rectan-gle, by simply applying the subscript functions to the twoend corners of the bounding box. The technique we use forinverting a subscript function is described in a separate pa-per [15]. If these functions are not monotonic, then theyneed to be applied to all elements in an input block to the�nd the set of output elements updated by them. This be-comes extremely expensive, and can easily defeat the pur-pose of runtime analysis we perform.We initially present our analysis under two assumptions.First, we assume that there are no loops in the subscript (orinverted subscript) function. Second, for simplicity of pre-sentation of our basic ideas, we consider foreach loops andinput and output collections that have a single dimension.Let us denote the function under consideration by S.Because we have assumed that the foreach loop as well asthe collections have a single dimension, this function takesone integer as the input, and returns another integer as theoutput.Consider a control ow graph (CFG) representing thefunction S. If the function S contains calls to other func-tions, we inline such functions, so that the code in the func-tion can be represented by a single CFG. We enumerate theacyclic paths in the CFG and denote them by p1, . . . , pn.We focus on the code along each acyclic path. By per-forming forward substitution for each temporary value, wecan create an expression relating the output of the functionwith the input to the function and other values in the pro-gram. For an input i, the output from the function S whenthe path pj is taken is denoted by Sj(i).For the function S to be monotonic along the path pj ,one of the following must hold:8i ( Sj(i+ 1) � Sj(i) )or 8i ( Sj(i+ 1) � Sj(i) )Suppose the function S is invoked with a particular pa-rameter. The particular acyclic path taken can dependupon the value of the parameter. Therefore, for establishingmonotonicity of the function, we need one of the followingto hold: 8i ( 8j 8l Sj(i+ 1) � Sl(i) )or 8i ( 8j 8l Sj(i+ 1) � Sl(i) )Using the expressions for the functions Sj computed byforward substitution, we can check the above conditions us-ing the integer set manipulation ability of omega calcula-tor [27, 37]. While this calculator obviously cannot answerall queries of the above type, it turned out to be su�cientfor our set of applications.We can usually improve the accuracy of the techniquein the presence of control ow by establishing that certainconditionals are independent of the input parameter of the

Accumulate(pixel val) fint b0 = val.bands[0];int b1 = val.bands[1];int ndvi = ((b1 - b0) / (b1 + b0) + 1) * 512;value = max(value,ndvi) ;g Accumulate(OutputData old) fvalue = max(value, old.value) ;gFigure 4: Local Reduction Function (left) and Global Reduction Function (right) for satellite Applicationfunction. Consider a conditional predicate c. If the pred-icate c can be shown independent of the input parameter,then invocation of the function with any parameter i willtake the same successor of c in the CFG. This allows usto partition the n paths into k disjoint groups, P1; : : : ; Pk.These groups of paths have the property that if the invoca-tion of the function with a parameter results in executionalong a path belonging the group Pj , then invocation of thefunction with any parameter will result in execution alongone of the paths belonging to the group Pj .With this analysis of the possible paths of execution, thecondition for monotonicity can be restated as:8k 8i ( 8jpj2Pk 8lpl2Pk Sj(i+ 1) � Sl(i) )or 8k 8i ( 8jpj2Pk 8lpl2Pk Sj(i+ 1) � Sl(i) )Dealing with Loops: We next discuss how we can performmonotonicity analysis on subscript functions that containloops. In the set of applications we used, the subscript func-tions did not have any loops, however, our technique candeal with loops in a limited form. We can only process loopsthat have the following properties: 1) the loop must have asingle entry-point and a single exit-point, 2) the loop doesnot contain any conditionals, 3) the loop is countable, i.e.,the number of times the loop iterates must not depend uponany value computed in the loop, 4) any array accessed in theloop must be accessed using a�ne subscripts, and must beassigned a�ne values, and 5) any scalar updated in the loopis also updated using a�ne values.In such cases, the updates performed in the loop to anyarray elements or scalars can be summarized using the loopcount and other constant values. The loop can then be re-placed by a single basic block. After such treatment of loops,the analysis presented previously can be applied.Multidimensional Spaces: Our data parallel dialect of Javaallows foreach loops and collections over multidimensionalspaces. Therefore, a subscript function takes a multidimen-sional point as the input, and outputs a multidimensionalpoint. In such cases, the runtime analysis we present in Sec-tion 4 requires the subscript functions to have the followingproperties:� The value along each dimension of the output pointis either dependent upon exactly one dimension of theinput point, or is a loop constant.� The value along each dimension of the input point in- uences the value along at most one dimension of theoutput point.

� For each dimension of the output point where is valueis dependent upon a dimension of the output point, thefunction relating the input value to the output valueis monotonic.The �rst two properties can be ascertained using simpledependence analysis in the subscript function. For estab-lishing the third property, the analysis presented for singledimensional cases is applied along each dimension.4 Runtime Communication AnalysisIn this section, we present the runtime techniques used to re-duce the communication volume during the global reductionphase. As mentioned earlier in Section 2.3, after process-ing each tile locally, each processor participates in a globalreduction phase to aggregate the values computed on eachnode.Let the number of nodes in the system be N . A naiveapproach would be to divide an output tile into N sections.Each node is responsible for collecting and aggregating allthe elements for each such section of the tile. Each node alsoneeds to communicate, to each of the other N�1 nodes, onesection of the tile. If the total output size is M , the totalcommunication volume becomes (N�1)�M . The volume ofcommunication increases linearly with the number of nodes,and can rapidly become a bottleneck.We propose a di�erent approach, which exploits the factthat each node on the system may not have data for theentire output tile being processed. Conceptually, each tile isstill partitioned intoN sections, and each node is responsiblefor collecting and aggregating all the elements for each suchsection of the tile. However, instead of communicating theentire sections of the tile to its owner node, a node only sendsthe set of elements in that portion it has actually updated.In the best case, all processors may update disjoint set ofoutput elements. So, the total communication volume willbe (N � 1) � M=N , a factor that increases asymptoticallyonly to M .The application and the dataset determines how muchbetter we can do with our optimized approach, as comparedto the naive approach. We have developed the runtime anal-ysis techniques for supporting our optimized communicationscheme in the way that their overhead is very low. So, weensure that performance is not sacri�ced when relatively lowreduction in communication volume is achieved by our op-timized approach.There are two main challenges in supporting the opti-mized communication strategy:� Determining, in an e�cient fashion, the set of elementsin the section of the tile that have been updated byeach node, and

� Communicating to other nodes, in a compact fashion,what set of elements from the section of the tile arebeing sent in each message.The �rst challenge is addressed using the meta-data as-sociated with each disk block, and the monotonicity analysisdescribed in Section 3.2. As we mentioned in Section 2.3.2,the system determines the list of rhs disk blocks that needto be brought into memory for processing each tile. As partof the meta-data, a bounding box of coordinates associatedwith elements of the disk-block is stored. If monotonicityproperty of the subscript function SL and inverse subscriptfunction S�1R is established, this input bounding box canbe mapped into an output rectangle by simply applying thesubscript function to the two end corners.By constructing such rectangles corresponding to eachdisk block read for processing the given tile, we have a de-scription of the set of elements in the tile that have beenupdated by each node. However, the complication is thatsuch rectangles overlap with each other. Particularly, whilecommunication these elements to other nodes, we need todescribe the set of elements in terms of a set of rectanglesthat do not overlap. An algorithm for this purpose is pre-sented in the next subsection.4.1 Eliminating IntersectionsWe present an algorithm that receives as input a collectionof intersecting rectangles. It produces as output another setof rectangles that comprise the same set of elements, but donot intersect each other.The algorithm is based on the notion of a sweeping line.We have implemented, and present here, a two-dimensionalversion of the algorithm, as it turned out to su�cient for ourset of applications. The algorithm can be extended to three-dimensional space by sweeping a plane instead of a line.The algorithm is presented in Figure 5. A vertical sweep-ing line (parallel to the y axis) is used in this algorithm.There are two main data structures used in this algorithm.The �rst is the event list. Initially, it stores the beginningand end x coordinates of each rectangle. When a rectan-gle is split, new values may be inserted in the event list.Each event is explicitly marked as being a beginning event,end event, or a split event. Because we are using a verticalsweeping line, when we refer to the range of a rectangle, wemean the range of its x coordinates.The second data structure is the work list. At any givenpoint during the execution of the algorithm, it stores therectangles that intersect with the current location of thesweeping line. Initially, the work list is empty.The actions taken by the algorithm at each event are de-scribed in Figure 5. Depending on the event and the currentstate of the work list, the system takes an action that mayinclude inserting another event in the event list, insertingor removing a rectangle into the work list, or outputting arectangle.In order to illustrate the execution of this algorithm, wepresent an example in Figure 6. There are 4 intersectingrectangles in the picture. At �rst, after initialization, thesweeping line is at the leftmost position, as shown in (i).The �rst event to be handled is the beginning of rectangleA. The work list is currently empty and the processing justinserts the rectangle A in the work list. The next eventon the list, shown in (ii), is the beginning of rectangle B.When comparing against the other ranges we notice that itintersects the range for rectangle A. The actions taken are:

while (E = Dequeue(Event List) ! = \No more events") fSwitch (E.type) fCase \Beginning" or \Split":for each R in the work list fSwitch (Intersection of R.range and E.range) fCase \Entirely outside":Do nothing;Case \Entirely inside":if E.box ends beyond R.boxInsert split for E.box after R.boxDefault:Output rect for R.range from R.started to E.coordRemove R from work listGrow E.range to incorporate R.rangeInsert split for longest rect after shortest endsggInsert E.range into work list if it is not inside rangeof any rectangle in the work listcase \End":if E.range is in work listOutput a rect for E.range from R.started to E.coordggFigure 5: Sweeping Line Algorithm to Remove Intersectionsof Rectangular Regions1) outputting a rectangle for the fraction of A that is alreadytraversed, 2) removing the rectangle A from the work list,3) updating the range for B to include the range of A, 4)inserting A in the work list after the end of rectangle B, and5) inserting the rectangle B to the work list.Next, shown in (iii), rectangle C is located by the sweep-ing line. It lies entirely within the active range and it endsbefore rectangle B. So, no action is taken.The next event, shown in (iv), is the beginning of D. Thisis also within the active range, but it ends after B is over, soa split event is inserted for D, after the end coordinate of B.The next event is the end of C. Since the range for C is notin the work list, no action is taken. Then, the line reachesthe end of the rectangle B, as shown in (v). Since the rangefor this rectangle is in the work list, the algorithm outputsthe rectangle. Recall that the range of B was updated earlierto include the range of A.The next two event are the splits of A and D, whichwere inserted earlier in the execution. For both these events,the system just inserts the ranges in the work list, withoutany changes, because they do not intersect any of the activeranges. The end of each of these rectangles are then reached,and the algorithm outputs the remaining portions of each asseparate rectangles. The end of rectangle A event is shownin (vi).This algorithm does not guarantee optimality, in termsof returning a minimal number of rectangles. However, aswe will show in Section 5, the runtime overhead of this al-gorithm is extremely low.4.2 Loop ExecutionAs mentioned in Section 2.3.3, all communication takes placeduring the global reduction phase. Each node is responsiblefor sending the set of non-overlapping rectangles that inter-

������

������

������

��������������������

��������������

���������������

���������������

������

��������������������

��������������

������

������

������

������

A

C

B

D

(i) (ii)

(iii) (iv)

(v) (vi)Figure 6: An Example Execution of Intersection RemovalAlgorithmsect with each section of the tile to the owner of that sectionof the tile. Each node sends only one message to each othernode. This message includes: 1) one integer containing thetotal number of rectangles in the message, 2) 4 integers perrectangle, describing the coordinates of the rectangle, and3) all data elements in these rectangles. The �rst two com-ponents of the message are referred to as the meta-data as-sociated with the message.The total overhead for a message containingR D-dimensionalrectangles is 4�R� 2�D+ 1 bytes. Further reducing thenumber of rectangles returned by the intersection elimina-tion algorithm can help in reducing this overhead. However,as we discuss in Section 5, this overhead was extremely lowfor all our test cases.After such messages are exchanged, each node processeseach message received by traversing the rectangles containedin it, and using the meta-data attached at the beginning ofthe message to update the associated elements. Finally, thenodes update the section of the tile they own using the globalreduction function constructed as described in Section 3.1.5 Experimental ResultsWe now present experimental results to demonstrate the ef-fectiveness of our techniques in generating correct and e�-cient communication for our class of applications. Speci�-cally, we present experimental data to evaluate the follow-ing:� The time taken by the intersection elimination algo-rithm, and the number of rectangles it takes as inputsand the number of rectangles it outputs.

� The size of meta-data that needs to be sent with opti-mized messages that are comprised of disjoint rectan-gular sections.� The reduction in communication volume achieved byour runtime communication optimization.� The reduction in execution time achieved by our run-time communication optimization.� The overall performance on a parallel con�guration,and the speedups achieved after performing communi-cation optimizations.5.1 Experiment DesignWe used our prototype compiler to generate code for twoapplications. We run our test programs on a cluster of 400MHz Pentium II based nodes connected by a gigabit ether-net. Each node has 256 MB of main memory and 18GB oflocal disk. We ran our experiments on 1, 2, 4, 8 nodes ofthe cluster. For some of our runs, we also used a 16 nodecon�guration.The �rst application we use is based on the Virtual Mi-croscope [12]. We implemented a multi-grid version of thisapplication, which processes images captured at di�erentmagni�cations and generates one high resolution output im-age. We refer to this application as mg-vscope. The sec-ond application we experimented with is a satellite imageprocessing application similar to the one described in Sec-tion 2.2. We refer to it as satellite in this section.These two applications are substantially di�erent, bothin terms of the nature of the datasets they handle and thecomputations they perform. mg-vscope accesses a densedataset to retrieve data corresponding to a portion of theslide in order to generate the output image. The regularityin the dataset can be exploited to partition the disk blocksbetween the nodes in a fashion that each node will have dataonly for a portion of the entire output region. So, this ap-plication can bene�t more from the runtime communicationoptimization we have developed. The second application,satellite, accesses a sparse dataset to retrieve data corre-sponding to a region of the planet and aggregates it over aperiod of time. If the aggregation is performed over a largetime period, we expect that all nodes will have data pointsfor all elements on the output. This makes the runtime com-munication optimization less pro�table for this application.We implemented three separate versions of each of theseapplications. The �rst one implements the naive approachdescribed in Section 4, which has a high communicationoverhead. We refer to this version as full. The second ver-sion incorporates the runtime communication optimizationwe have presented and is referred to as opt. Finally, as abase line best case, we implemented a version that performsno communication. This version is referred to as none.For both these applications, we experimented with twodi�erent query sizes, referred to as medium and large queries.By query size, we mean the size of the input dataset the ap-plication processes during execution. For the mg-vscope ap-plication, the dataset contains an image of 29; 238� 28; 800pixels collected at 5 di�erent magni�cation levels, which cor-responds to 3.3 GB of data. The medium query size pro-cesses a region of 10; 000�10; 000 pixels, which correspondsto reading 627 MB and generating an output of 400 MB.The large query for this application generates an image of20; 000�20; 000. This requires reading nearly 3 GB, and gen-erating an output of nearly 1.6 GB. The entire dataset for

mg-vscope satelliteorig 1-node 8-node orig 1-node 8-nodemedium 803 27 622 2380 9 905large 3894 66 2814 9433 550 6090Table 1: Number of Input Rectangles Before and After Elim-inating Intersectionsthe satellite application contains data for the entire earthat a resolution of 1=128th of a degree in latitude and longi-tude, over a period of time that covers nearly 15; 000 timesteps. The size of the dataset is 2.7 GB. The medium querytraverses a region of 15; 000�10; 000�10; 000 which involvesreading 446 MB to generate an output of 50 MB. The largequery corresponds to a region of 6; 000 � 20; 000 � 40; 000points. This requires reading nearly 1.7 GB and generatinga 400 MB output.5.2 ResultsWe �rst consider our intersection elimination algorithm. Thenumber of rectangles before and after applying the algorithmfor each test case are shown in Table 1. The 3 numbers re-ported for each application and query pair are 1) the totalnumber of rectangles before applying the algorithm (orig),2) the number of rectangles after applying the algorithmfor the 1 node case (1-node), and 3) the aggregate numberof rectangles on 8 nodes that after applying the algorithm(8-node). Two important observations can be made fromthis table. First, the total number of rectangles that needto be processed by the algorithm is quite large in all cases,so the e�ciency of the algorithm is important in keepingthe overhead of analysis low. Second, substantial reductionin the number of rectangles is achieved, which means thatthere is a signi�cant overlap between the rectangles corre-sponding to di�erent disk blocks.Next, we focus on the runtime cost for executing theintersection elimination algorithm. For all our test casesand execution on 1, 2, 4, or 8 nodes, the time taken by thisalgorithm is less than 0.9% of the total execution time. Thehighest relative cost for the optimization is for satelliteapplication with medium query. On one node, it takes 0.9% ofthe total execution time and goes down to 0.35%, 0.14% and0.05%, on 2, 4 and 8 nodes, respectively. For the mg-vscopeapplication, the highest relative cost comes with the largequery and on 1 node, and is 0.029% of the total executiontime. For medium query of mg-vscope, the ratio on one nodeis 0.013%. For the large query of satellite, the ratio onone node is 0.62%. In all cases, the ratio decreases as thenumber of nodes increases.We next focus on the overhead of meta-data that needsto be sent with optimizedmessages that are comprised of dis-joint rectangular sections. Table 2 presents the total com-munication volume (in Mega Bytes), i.e., the sum of thesizes of all messages sent by the application during execu-tion for opt versions. Next to that, in parenthesis, the tableshows the sum of the sizes of meta-data sent with all mes-sages. The size of meta-data is less than 0.03% of the to-tal communication volume in all cases. The data presentedin Table 2 clearly establishes that the overhead of sendingmeta-data with optimized messages is negligible.Table 3 shows the total communication volume for thefull versions. These numbers can be compared against the

mgv/med mgv/large2 356.70 MB (2.05 KB) 1451.69 MB (7.39 KB)4 853.21 MB (7.13 KB) 3422.83 MB (27.44 KB)8 1290.92 MB (16.13 KB) 5212.48 MB (56.85 KB)sat/med sat/large2 47.70 MB (0.41 KB) 374.24 MB (25.82 KB)4 142.99 MB (15.21 KB) 1074.47 MB (116.46 KB)8 331.31 MB (86.54 KB) 2281.47 MB (237.30 KB)Table 2: Total Communication Volume and Total Meta-dataSize (shown in paranthesis) for opt Versionsmgv/med mgv/large sat/med sat/large2 381.47 MB 1525.88 MB 47.70 MB 381.53 MB4 1144.41 MB 4577.64 MB 143.11 MB 1144.58 MB8 2670.29 MB 10681.20 MB 333.92 MB 2670.69 MBTable 3: Total Communication Volume for full Versionstotal communication volume shown for opt versions in Ta-ble 2 to see the reduction in communication volume achievedby our runtime technique. For mg-vscope, with mediumquery size, the reduction in communication volume is 9%,25%, and 52% on 2, 4, and 8 processors, respectively. Formg-vscope, with large query size, the reduction in commu-nication volume is 5%, 25%, and 51%, on 2, 4, and 8 proces-sors, respectively. For satellite, with medium query size,the reduction in communication volume is less than 1% inall con�gurations. With large query size on the same ap-plication, the reduction is 2%, 6%, and 15%, on 2, 4, and 8nodes, respectively.Figure 7 shows the executing times for mgvscope appli-cation running the medium query. The improvements pro-duced by the optimization are 0.4%, 12% and 33.75% for 2,4, and 8 nodes, respectively. The speedups for the opt ver-sion on 2, 4 and 8 processors are 1.9, 2.47, and 3.53, respec-tively. Figure 8 presents the execution times for mgvscopeexecuting the large query. opt version performs 1%, 13%and 34% better than the full version, on 2, 4 and 8 nodes,respectively. The speedups on this query are 1.96, 2.79, and4.05 for 2, 4, and 8 nodes, respectively.

1 2 4 80

50

100

150

200

250

300

350

Number of Processors

Tim

e (s

)

fullopt none

Figure 7: Comparing full, opt and node Versions ofmgvscope Application with medium Query

1 2 4 80

200

400

600

800

1000

1200

1400

Number of Processors

Tim

e (s

)fullopt none

Figure 8: Comparing full, opt and node Versions ofmgvscope Application with large Query

1 2 4 8 160

10

20

30

40

50

60

70

Number of Processors

Tim

e (s

)

full

opt

none

Figure 9: Comparing full, opt and node versions ofsatellite application running medium queryThe performance gains on 2, 4, and 8 nodes, and onmedium and large queries, are proportional to the reductionin communication volume that we reported earlier. As thenumber of nodes increases, the amount of overlap betweenthe portions written by di�erent nodes decreases. Thus,higher performance gains are obtained by communicatingonly the portions that have been written by each node.We next present execution times for the satellite ap-plication. Because we expected the runtime communicationanalysis to be less e�ective for this application, we also in-cluded a 16 node execution. The results for medium queryare shown in Figure 9. The improvements are -1.7%, 2.5%,-9.3% and 10.1%, on 2, 4, 8, and 16 nodes, respectively. Thespeedups are 1.52, 1.96, 2.08, and 2.45 for opt versions on 2,4, 8, and 16 nodes, respectively. Results for this applicationon large query are shown in Figure 10. The performanceimprovement observed are -2%, 5.9%, 8% and 21.5% for 2,4, 8, and 16 nodes, respectively. The speedups this timeare 1.36, 1.93, 2.26, and 2.69, on 2, 4, 8, and 16 nodes, re-spectively. Again, the performance gains across nodes andquery sizes are proportional to the reduction in communica-tion volume that we reported earlier.6 Related WorkThe communication analysis and optimization problem wehave addressed is harder and more challenging in many ways,

1 2 4 8 160

50

100

150

200

250

300

350

400

Number of Processors

Tim

e (s

)

full

opt

none

Figure 10: Comparing full, opt and node Versions ofsatellite Application with large Queryand less general in other ways, than the general communi-cation analysis problem handled by the distributed memorycompilers [5, 10, 17, 16, 21, 28, 30, 41, 46]. In the applica-tions we are targeting, the communication that the compilerneeds to generate and optimize is restricted to a global re-duction between the processors at the end of each data in-tensive loop. However, the communication problem we han-dle is di�erent from the previous distributed memory com-pilation work in two important ways. First, because eachprocessor processes large disk-resident datasets, the volumeof the communication during the global reduction phase canbe very large. Second, the use of an object-oriented dataparallel language makes communication analysis harder.Kandemir, Choudhary, and their co-authors have focusedon analysis and optimizations for out-of-core programs, in-cluding inter-processor communication [42, 23, 24, 25, 26].Our work is di�erent because we are considering a di�er-ent set of applications, and a di�erent language and runtimesupport infrastructure. The communication analysis in theirproject is focused on near-neighbor communication, whereasthe class of applications we target require global reduction.Runtime analysis for communication optimization hasbeen extensively used as part of the inspector/executor frame-work for parallelizing sparse computations on distributedmemory machines [39, 36]. The details of the runtime anal-ysis we have presented in Section 4 are very di�erent becausewe are considering a di�erent application class.Analysis and optimization of reduction operations is awell studied topic in the compiler literature [3, 18, 19, 31, 33,45]. Because of the use of an object-oriented language, wherereduction operations are performed on complex objects, thecompiler analysis we have used is signi�cantly di�erent thanthe previous work.Our work on monotonicity analysis is signi�cantly dif-ferent from the existing work on similar problems [32, 40],because of handling more complex control ow. Integer setmanipulation has also been previously used for many spe-ci�c optimization and code generation problems in parallelcompilation [1, 22, 38].7 ConclusionsAnalysis and processing of very large multidimensional sci-enti�c datasets is becoming an increasingly important com-ponent of science and engineering research. Compiler tech-nology can help accelerate advances in these areas by en-

abling rapid development of applications that process largedatasets. We have been developing a compiler for a dataparallel dialect of Java targeting such applications.In this paper, we have focused on compiler and runtimetechniques for enabling correct, as well as e�cient commu-nication for this class of applications. We have presentedstatic analysis techniques for 1) extracting a global reduc-tion function from a data parallel loop, and 2) determiningif a subscript function is monotonic. We have also presenteda runtime technique for reducing the volume of communica-tion during the global reduction phase.We have experimented with two data intensive applica-tions to evaluate the e�cacy of our techniques. Our re-sults show that 1) our techniques for extracting global re-duction functions and establishing monotonicity of subscriptfunctions can successfully handle these applications, 2) sig-ni�cant reduction in communication volume and executiontimes is achieved through our runtime analysis technique,3) runtime communication analysis is critical for achievingspeedups on parallel con�gurations.References[1] Vikram Adve and John Mellor-Crummey. Using integer sets for data-parallelprogram analysis and optimization. In Proceedings of the ACM SIGPLAN '98Conference on Programming Language Design and Implementation, pages 186{198.ACM Press, June 1998. ACM SIGPLAN Notices, Vol. 33, No. 5.[2] Gagan Agrawal, Renato Ferreira, Joel Saltz, and Ruoming Jin. High-levelprogramming methodologies for data intensive computing. In Proceedingsof the Fifth Workshop on Languages, Compilers, and Run-time Systems for ScalableComputers, May 2000.[3] W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence,J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Par-allel programming with Polaris. IEEE Computer, (12):78{82, December 1996.[4] Francois Bodin, Peter Beckman, Dennis Gannon, Srinivas Narayana, andShelby X. Yang. Distributed pC++: Basic ideas for an object parallel lan-guage. Scienti�c Programming, 2(3), Fall 1993.[5] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, and M.-Y. Wu. Com-piling Fortran 90D/HPF for distributed memory MIMD computers. Journalof Parallel and Distributed Computing, 21(1):15{26, April 1994.[6] C. Chang, A. Acharya, A. Sussman, and J. Saltz. T2: A customizableparallel database for multi-dimensional data. ACM SIGMOD Record, 27(1):58{66, March 1998.[7] Chialin Chang, Renato Ferreira, Alan Sussman, and Joel Saltz. Infrastruc-ture for building parallel database systems for multi-dimensional data. InProceedings of the Second Merged IPPS/SPDP (13th International Parallel ProcessingSymposium & 10th Symposium on Parallel and Distributed Processing). IEEE Com-puter Society Press, April 1999.[8] Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock, Alan Suss-man, and Joel Saltz. Titan: A high performance remote-sensing database.In Proceedings of the 1997 International Conference on Data Engineering, pages 375{384. IEEE Computer Society Press, April 1997.[9] Chialin Chang, Alan Sussman, and Joel Saltz. Scheduling in a high perfor-mance remote-sensing data server. In Proceedings of the Eighth SIAM Conferenceon Parallel Processing for Scienti�c Computing. SIAM, March 1997.[10] Siddhartha Chatterjee, John R. Gilbert, Fred J.E. Long, Robert Schreiber,and Shang-Hua Teng. Generating local addresses and communication setsfor data-parallel programs. In Proceedings of the Fourth ACM SIGPLAN Sympo-sium on Principles & Practice of Parallel Programming (PPOPP), pages 149{158,May 1993. ACM SIGPLAN Notices, Vol. 28, No. 7.[11] A.A. Chien and W.J. Dally. Concurrent aggregates (CA). In Proceedings of theSecond ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming(PPOPP), pages 187{196. ACM Press, March 1990.[12] R. Ferreira, B. Moon, J. Humphries, A. Sussman, J. Saltz, R. Miller, andA. Demarzo. The Virtual Microscope. In Proceedings of the 1997 AMIA An-nual Fall Symposium, pages 449{453. American Medical Informatics Associa-tion, Hanley and Belfus, Inc., October 1997. Also available as University ofMaryland Technical Report CS-TR-3777 and UMIACS-TR-97-35.[13] Renato Ferreira, Gagan Agrawal, and Joel Saltz. Advanced compiler andruntime support for data intensive applications. In Proceedings of SBAC-PAD2000, October 2000.[14] Renato Ferreira, Gagan Agrawal, and Joel Saltz. Compiling object-orienteddata intensive computations. In Proceedings of the 2000 International Conferenceon Supercomputing, May 2000.[15] Renato Ferreira, Gagan Agrawal, and Joel Saltz. Compiler support high-level abstractions for sparse disk-resident datasets. Submitted for publi-cation, available at http://www/eecis.udel.edu/~agrawal/p/ics01 renato.ps,2001.

[16] Manish Gupta and Edith Schonberg. Static analysis to reduce synchro-nization costs in data-parallel programs. In Conference Record of the 23rd ACMSIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 322{332. ACM Press, January 1996.[17] S.K.S. Gupta, S.D. Kaushik, S. Mufti, S. Sharma, C.-H. Huang, and P. Sa-dayappan. On compiling array expressions for efficient execution on dis-tributed memory machines. In Proceedings of the 1993 International Conferenceon Parallel Processing, August 1993.[18] M. Hall, S. Amarsinghe, B. Murphy, S. Liao, and M. Lam. Maximizingmultiprocessor performance with the SUIF compiler. IEEE Computer, (12),December 1996.[19] H. Han and Chau-Wen Tseng. Improving compiler and runtime supportfor irregular reductions. In Proceedings of the 11th Workshop on Languages andCompilers for Parallel Computing, August 1998.[20] High Performance Fortran Forum. Hpf language specification, version 2.0.Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/files/hpf-v20.ps.gz, January 1997.[21] Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling For-tran D for MIMD distributed-memory machines. Communications of the ACM,35(8):66{80, August 1992.[22] M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujan, and N. Shenoy. Aglobal communication optimization technique based on data-flow analysisand linear algebra. ACM Transactions on Programming Languages and Systems(TOPLAS), 21(6):1251{1297, November 1999.[23] M. Kandemir, A. Choudhary, and A. Choudhary. Compiler optimizationsfor i/o intensive computations. In Proceedings of International Conference onParallel Processing, September 1999.[24] M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar. Compi-lation techniques for out-of-core parallel computations. Parallel Computing,(3-4):597{628, June 1998.[25] M. Kandemir, A. Choudhary, J. Ramanujam, and M. A.. Kandaswamy. Aunified framework for optimizing locality, parallelism, and comunication inout-of-core computations. IEEE Transactions on Parallel and Distributed Systems,11(9):648{662, 2000.[26] M. Kandemir, J. Ramanujam, and A. Choudhary. Improving the perfor-mance of out-of-core computations. In Proceedings of International Conferenceon Parallel Processing, August 1997.[27] Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeis-man, and Dave Wonnacott. The Omega calculator and library, version 1.1.0.November 1996.[28] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loopsfor distributed execution. IEEE Transactions on Parallel and Distributed Systems,2(4):440{451, October 1991.[29] Tahsin M. Kurc, Alan Sussman, and Joel Saltz. Coupling multiple simula-tions via a high performance customizable database system. In Proceedingsof the Ninth SIAM Conference on Parallel Processing for Scienti�c Computing. SIAM,March 1999.[30] Antonio Lain and Prithviraj Banerjee. Exploiting spatial regularity in ir-regular iterative applications. In Proceedings of the Ninth International ParallelProcessing Symposium, pages 820{826. IEEE Computer Society Press, April1995.[31] Yuan Lin and David Padua. On the automatic parallelization of sparseand irregular Fortran programs. In Proceedings of the Workshop on Languages,Compilers, and Runtime Systems for Scalable Computers (LCR - 98), May 1998.[32] Yuan Lin and David Padua. Analysis of irregular single-indexed array ac-cesses and its applications in compiler optimizations. In In Proceedings of theConference on Compiler Construction (CC), pages 202 { 218, March 2000.[33] Bo Lu and John Mellor-Crummey. Compiler optimization of implicit re-ductions for distributed memory multiprocessors. In Proceedings of the 12thInternational Parallel Processing Symposium (IPPS), April 1998.[34] NASA Goddard Distributed Active Archive Center (DAAC). Advanced VeryHigh Resolution Radiometer Global Area Coverage (AVHRR GAC) data.http://daac.gsfc.nasa.gov/CAMPAIGN DOCS/ LAND BIO/origins.html.[35] John Plevyak and Andrew A. Chien. Precise concrete type inference forobject-oriented languages. In Ninth Annual Conference on Object-Oriented Pro-gramming Systems, Languages, and Applications (OOPSLA '94), pages 324{340, Oc-tober 1994.[36] R. Ponnusamy, J. Saltz, A. Choudhary, Y.-S. Hwang, and G. Fox. Runtimesupport and compilation methods for user-specified irregular data distribu-tions. IEEE Transactions on Parallel and Distributed Systems, 6(8):815{831, Au-gust 1995.[37] William Pugh. A practical algorithm for exact array dependence analysis.Communications of the ACM, 35(8):102{114, August 1992.[38] William Pugh and David Wonnacott. Static analysis of upper and lowerbounds on dependences and parallelism. ACM Transactions on ProgrammingLanguages and Systems, 16(4):1248{1278, July 1994.[39] Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman.Run-time scheduling and execution of loops on message passing machines.Journal of Parallel and Distributed Computing, 8(4):303{312, April 1990.

[40] Madelene Spezialetti and Rajiv Gupta. Loop monotonic statements. IEEETransactions on Software Engineering, 21(6):497{505, June 1995.[41] Jaspal Subhlok, James M. Stichnoth, David R. O'Hallaron, and ThomasGross. Exploiting task and data parallelism on a multicomputer. In Pro-ceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of ParallelProgramming (PPOPP), pages 13{22, May 1993. ACM SIGPLAN Notices, Vol.28, No. 7.[42] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Pas-sion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70{78,June 1996.[43] F. Tip. A survey of program slicing techniques. Journal of Programming Lan-guages, 3(3):121{189, September 1995.[44] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Libit, A. Krishnamurthy,P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium:A high-performance Java dialect. Concurrency Practice and Experience, 9(11),November 1998.[45] Hao Yu and Lawrence Rauchwerger. Adaptive reduction parallelizationtechniques. In Proceedings of the 2000 International Conference on Supercomputing,pages 66{75. ACM Press, May 2000.[46] Hans P. Zima and Barbara Mary Chapman. Compiling for distributed-memory systems. Proceedings of the IEEE, 81(2):264{287, February 1993. InSpecial Section on Languages and Compilers for Parallel Machines.