Metrics and techniques for automatic partitioning and assignment of object-based concurrent programs

Metrics and Techniques for

Automatic Partitioning and Assignment

of Object-based Concurrent Programs

Lonnie R. Welch �y, Binoy RavindranJorge Henriques, Dieter K. Hammer z

March 23, 1995

Abstract

The software crisis is de�ned as the inability to meet the demands for new software systems,due to the slow rate at which systems can be developed. To address the crisis, object-based designtechniques and domain models have been developed. Furthermore, languages such as Modula-2, Ada,Smalltalk and C++ have been developed to enable the software realization of object-based designs.

However, object-based design and implementation techniques do not address an additional prob-lem that plagues systems engineers | the e�ective utilization of distributed and parallel hardwareplatforms. This problem is partly addressed by program partitioning languages that allow an engineerto specify how software components should be partitioned and assigned to the nodes of concurrentcomputers. However, very little has been done to automate the task of con�guration, that is, the tasksof partitioning and assignment. Thus, this paper describes automated techniques distributed/parallelcon�guration of object-based applications, and demonstrates the technique on programs written inAda. The granularity of partitioning is the program unit, including software components such asobjects, classes, tasks, packages (including generics) and subprograms. The partitioning is performedby constructing a call-rendezvous graph (CRG) for the application program. The nodes of the graphrepresent the program units and the edges denote call and task interaction/rendezvous relationships.The CRG is augmented with edge weights depicting inter-program-unit communication relationshipsand concurrency relationships, resulting in a weighted CRG (WCRG). The partitioning algorithmrepeatedly cuts edges of the WCRG with the goal of producing a set of partitions among which (1)there is a small amount of communication and (2) there is a large degree of potential for concurrentexecution. Following the partitioning of the WCRG into tightly coupled clusters, a random neuralnetwork is employed to assign clusters to physical processors. Additionally, a graphical interface isprovided to allow viewing and modi�cation of the software-hardware con�guration.

This two-pass approach to con�guration is useful in large systems, where it is necessary to �rstreduce the complexity of the problem before applying an accurate assignment optimization technique(such as neural networks, genetic algorithms or simulated annealing). Although the tools described inthis paper partition and assign Ada programs, they are easily extended to work with other languagesby simply changing the front-end parser. Thus, a general solution to the con�guration problem forobject-based concurrent programs is provided.

�This work is supported in part by The U.S. NSWC (N60921-93-M-1912 and N60921-94-M-G096) and by the U.S. ONR(N00014-92-J-1367).

yWelch, Ravindran and Henriques are with The Department of Computer and Information Science, New Jersey Instituteof Technology, Newark, NJ 07102, e-mail: [email protected], phone: 201-596-5683.

zHammer is with The Department of Mathematics and Computer Science, Eindhoven University of Technology, Eind-hoven, The Netherlands, e-mail: [email protected], phone: +31-40-472734.

1

1 Introduction

The software engineering techniques of reuse, encapsulation and information hiding are increasingly beingemployed in the construction of large, mission-critical software. Thus, systems composed of abstract datatype (ADT) hierarchies and tasks are being produced. Simultaneously, the hardware platforms continueto increase in sophistication, particularly in the area of concurrency. Thus, we are confronted with theproblem of bridging the gap between (1) large computer-based software systems composed of tasks and(2) ADTs/objects and parallel and distributed computer platforms.

As an example of the type of software system under consideration, consider the track processingsystem, common in air tra�c control and defense systems, which can be implemented with ADTs asrepresented in Figure 1. The �gure shows the ADT instances used in the implementation, as well as callrelationships among the instances. The track �le is implemented as a list of tracks. A track is an ADTimplemented as a queue of the last n snapshots of the track's state.

The research described in this paper addresses the mapping of layered software systems onto paralleland distributed hardware platforms. We are currently using a 64-node Intel Paragon processor as a parallelcomputing testbed, and a collection of DEC workstations as a distributed platform. Given a softwaresystem, software components are partitioned/clustered according to some binding relationships (such ascommunication, concurrency or shared data access), and the clusters are assigned to processors in a waythat causes e�cient utilization of hardware resources and simultaneously obeys system constraints [20,22, 19, 17]. For example, the ADTs implementing the software for the AEGIS cruiser with doctrineregions could be assigned to processors as shown in Figure 2. The magnitude of the software systemsunder consideration as well as the complexity of the objectives and constraints make partitioning andassignment too di�cult for the average human. Furthermore, the software systems are too large for thee�ective use of a single pass, accurate assignment algorithm. A partial solution to this problem is tode�ne languages for describing how software components should be partitioned and assigned Previouswork [18, 7] shows the need for an algorithm to assign ADT modules to processors, but development ofsuch an algorithm was not the focus of the work. In [18], items identi�ed as distributable (assignable)Ada packages and tasks, which are a subset of the items considered here. APPL [7] is a languagefor specifying mappings of Ada programs to concurrent architectures. This permits programs to beimplemented without taking hardware con�gurations into account|the same philosophy used here. Asuccessor to APPL is the Distributed Application Development System (DADS) [10], which not onlyprovides a distribution speci�cation language, but also provides a linker, code generator and run-timesystem. These e�orts addressed important problems in exploiting concurrent computer platforms, butnone addressed actual partitioning and assignment.

Although solutions have been published for partitioning and assignment, the approach outlined heredi�ers from previous approaches in the following ways. The majority of the previously reported assign-ment algorithms, of which [13, 8, 1, 26, 2] are representative, consider the unit of distribution to be aprocedure, a task, or a process, in contrast to the assignment algorithm presented here, which also con-siders units of distribution to be generic abstract data type (ADT) modules. Furthermore, the metricswhich drive our con�guration are rigorously de�ned and address concurrency two features not seen inconjunction in previous work.

The con�guration technique described in this paper is depicted in Figure 3. The �rst step of ourapproach is to extract concurrency and communication metrics. Following this, the software componentsare partitioned, or assigned to logical processors. Partitioning is done using coarse, fast techniques for(1) computing metrics and (2) optimization. Partitioning is followed by the assignment of partitions toprocessors (i.e., logical processors are mapped to physical processors). Assignment uses more accurate andmore costly metrics and optimization heuristics. Finally, the assignment and partitioning speci�cationsare combined with the application and executed in a concurrent manner.

The rest of the paper describes our approach to con�guration. First, our analysis process is explained,showing the abstract representation of systems that our parsing tools produce, and de�ning techniquesto compute communication and concurrency metrics from the abstract representation. The description

2

ARRAY INTEGER

QUEUE

ARRAY INTEGER

TRACK

FILE

LIST

TRACK

RECORD

Figure 1: A design of doctrine processing software (ADT paradigm).

3

PE1

PE3

PE5

PE2

PE4

PE6

FILETRACK

TRACK

C&D

LIST

QUEUE

WCS

..

... ...

T1

T2

T1

T2

T1

T2

..

..

Figure 2: A possible assignment of ADT modules to processors.

4

program

Metrics + IR

DADS spec

Generate

Partitioning

Analysis

Metrics II

Link, Load,Execute

DADS spec

AssignmentGenerate

Assignment/

Partitions/

Figure 3: The con�guration approach.

5

of the analysis process is followed by a discussion of the partitioning tool|the algorithm employed, anexample and the graphical interface for manually tuning the partitions. A second analysis phase occursafter partitioning, to compute concurrency and communication metrics for partitions. The computationof these metrics is necessary before the assignment of partitions to processors, since the assignmentalgorithm uses the metrics to assess the suitability of the assignment with respect to the objectives.Finally, a random neural network solution to the partition assignment problem is presented.

2 Analysis: Extraction of IR and Computation of Metrics

In this section the analysis techniques and products that enable partitioning and assignment are described.The �rst part of the section discusses the intermediate representation of programs that is extracted bycompiler tools. The remainder of the section describes how communication and concurrency metrics areproduced from the intermediate representation.

2.1 The Intermediate Representation

This section de�nes a language-independent intermediate representation (IR) for capturing computer-based systems' software features that are essential for the partitioning and assignment processes.

Systems such as AEGIS and the HiPer-D vision of the 21st Century shipboard computing system [12]contain software having the structure depicted in Figure 4. We term this structure the mission criticalsoftware architecture (MCSA) [23], in order to contrast it with the NSF's grand challenge software archi-tecture. The MCSA depicts software systems that are composed of several layers (or tiers). Elements attier i are implemented in terms of elements at tier i + 1.

Tier 1 consists of a set P ofM executable programs or partitions: fP1; P2; : : : ; PM g. It is sometimesthe case that the programs are implemented in di�erent languages. Partitions are collections of programunits like tasks, packages, classes, methods and objects. The potential concurrency among partitions Piand Pj is given by the term Cij.

At tier 2 are tasks (independent threads of control), which may share resources, and are permitted torun concurrently. The task tier is represented by the task rendezvous graph, a directed graph, TRG =(V;E), wherein: a vertex v 2 V denotes a task object, f(v), and an edge (x; y) 2 E indicates that the codeof task object f(x) initiates a rendezvous with an entry provided by task object f(y). Each task object Tmay be periodically executed, in which case PR(T ) denotes the period of the task. Alternatively, a taskmay be asynchronous, that is, it may be activated by one or more events E(T ) = feT;1; eT;2; :::; eT;�g:

Tier 3 is composed of modules with multiple entry points (as in CMS-2), ADT packages (as in Ada,Modula, and Clu) and object classes (as in C++, Smalltalk and Ei�el). The employment of tier 3components provides conceptual clarity, enables tier 2 tasks to be implemented \on top of" abstractionsexported by modules, and promotes reuse. At the package instance level, a directed graph is used to showcall relationships among instances. A program is modeled by a directed graph, CGRAPHP = (V;E),where: a vertex v 2 V denotes package instance, f(v), and an edge (x; y) 2 E indicates that the code ofinstance f(x) calls some subprogram(s) provided by instance f(y).

The elements of tier 3 are implemented in terms of subprograms (or methods)|the tier 4 elements.At the granularity of the subprogram, a directed graph, CGRAPHS = (V;E), is used to represent thecall relationships by letting each vertex m 2 V denote a subprogram f(m), and each edge (m;n) 2 Eindicate that the code of subprogram f(m) calls subprogram f(n).

The research project described in this paper has produced software tools to extract the IR from Adaapplications. With the Ada paradigm, it is possible for a subprogram to initiate a rendezvous with tasksor to call subprograms exported by packages, in addition to calling other subprograms. Likewise, inaddition to rendezvousing with other tasks, Ada tasks may call subprograms. Similarly, packages maycontain calls to subprograms and rendezvouses with task entries. Such interactions must be considered

6

Method Method Method

Package Package Package

Task Task Task

Program C++ Program Ada Program etc.

Program Instructions

Figure 4: Mission critical software architecture.

7

CALL RENDEZVOUS GRAPH (Test Case 4)

6

3

21

4

5

7

Figure 5: A call-rendezvous graph (CRG).

8

during the system con�guration process. Thus, we de�ne the call-rendezvous graph (CRG) by combiningitems from tiers 2, 3 and 4. The CRG combines the nodes and vertices of TRG, CGRAPHP , andCGRAPHS, and inserts directed edges representing calls from tasks to subprograms and packages, andindicating rendezvous initiations from subprograms and packages to tasks. A sample CRG is given inFigure 5; in the �gure, dashed lines denote rendezvous edges, solid lines denote subprogram calls, boxesrepresent task objects, and circles represent packages and subprograms.

Methods are implemented as a collection of statements or instructions (tier 5 elements). At this levelof granularity, several important features are captured during parsing. The symbol table (SymTab) andthe statement table (StmtTab) [11] are extracted and used for dependence and ow analyses. Dependenceanalysis involves processing of the StmtTab to extract graphs that represent statement-level precedencerelations due to control dependences, data dependences, and code dependences. Dependence graphsrepresent program statements as nodes and use directed edges to denote statement ordering implied bythe dependences in a source program. Di�erent kinds of ordering requirements are represented in di�erentdependence graphs. In the data dependence graph (DDG) 1 a directed edge denotes a data dependence(destination and source nodes need the same value). The instance dependence graph (IDG) 2 usesundirected edges to denote instance dependences (which occur when two nodes use operations exportedby the same instance [21]). The subprogram dependence graph (SDG) uses an undirected edge to denotewhen two statements use the same subprogram. A directed edge in the control dependence graph (CDG)denotes that execution of the destination statement depends on the a decision made by the sourcestatement. In addition to the dependence graphs, the control ow graph (CFG) is extracted at thestatement level, indicating the sequential ow of control dictated by the order of the statements in thesource code.

2.2 Metrics for Communication and Concurrency

To enable optimization of partitioning and assignment, this section de�nes metrics for communicationand concurrency among program units.

2.2.1 Metrics for Communication

The communication that takes place between program units is measured in terms of the amount of datathat is passed in calls (or rendezvouses) and in terms of the frequencies of calls (or rendezvouses). Wede�ne CFAB to be the frequency of calls from A to B. For the calculation of the concurrency matrixin section 4 we also need the call frequency CFkl of a particular operation Okl of package PAk. Theseparameters are approximated with the compile-time techniques shown in Figure 6. The call rendezvousgraph (CRG) is constructed and weights are placed on the edges to indicate the amount of data exchangedamong program units, yielding a weighted CRG (WCRG) (see Figure 7). Initially, the edge weights areset to zero. To statement table is examined to determine when a call (or rendezvous) occurs. For eachcall (or rendezvous), the receiver (callee or rendezvous acceptor) is identi�ed as well as the sizes andmodes (IN, OUT, or IN and OUT) of the actual parameters. This information is used to update theappropriate edge in the WCRG. Following the identi�cation of all calls and the determination of theamount of information passed in each call, the call frequencies are propagated top-down in the WCRGand the communication weights are scaled accordingly. Further discussion of communication metrics iscontained in Section 4.

2.2.2 Metrics for Concurrency

For the partitioning approach described in this paper it is necessary to compute metrics indicatingthe amount of potential concurrency among subprograms, ADT instances, and tasks. For the metrics

1The DDG and the CDG are used for identifying concurrency inherent in software.2The IDG and the SDG are used for analysis of slowdown due to contention for the code of instances and subprograms.

9

,sendersender,

Metrics forLegacy System

pmodes

psizes

params

comm

frequenciescall

Propagate

weightUpdate WCRG

receiver

modesparameterDetermine

params

sizesparameterDetermine

receiver

parametersIdentify

receiver

sender

StmtTabStmtTab

receiverIdentify

sender

rendezvousor

Detect call

Initialize

1.5.4.1

Weighted CRG

1.5.4.8

1.5.4.61.5.4.51.5.4.4

1.5.4 Compute Communication Metrics

1.5.4.2 1.5.4.3

1.5.4.7

receiver,pmodes,

psizes

sender,

Comm. Metric

SymTabSymTab SymTab

Figure 6: The process for extracting communication metrics.

10

3

21

4

5

7

1,2

1,2

1,2

1,2 1,2

1,2

1,2

1,2

6

2,2

3,2

5,2

4,2

3,2

3,2

4,2

1,2

4,2

2,22,2

3,2

2,2 2,2 2,2 2,2 2,2 2,22,2

2,22,22,2

2,2

2,2

2,2

2,2

3,2

4,23,2 1,2 3,2

3,22,2

2,2

1,2

2,2

2,2

2,2

2,2

Figure 7: A weighted call-rendezvous graph (WCRG).

11

computation, we assume that tasks are the basic unit of concurrency, and de�ne metrics that build onprevious work wherein we have developed techniques to compute the following concurrency metrics:

� Inherently parallel percentage [25] of methods, ADT instances, and activities.

� Concurrency dependences [20] (e.g., A and B can run concurrently i� B and C can run concurrently).

� Maximum number of replicas of methods and class/package instances that can be used concur-rently [27, 21].

� The set of potentially concurrent entities, at the levels of statements, methods, ADT instances,activities, and beads [20, 27, 25, 16].

Our approach to obtaining concurrency metrics is to exploit semantic knowledge that is implicit inthe weighted call rendezvous graph (WCRG). To de�ne the metrics, it is necessary to de�ne the followingterms:

� M = fM1;M2; :::;M�g is the set of modules (subprograms and packages/ADT-instances) in anapplication.

� T = fT1; T2; :::; T�g is the set of tasks (activities) in an application.

� M(Ti) = fMi;1;Mi;2; :::;Mi;�(Ti)g is the set of modules exclusively by task Ti. This also includesmodules used indirectly or transitively. Note that M (ti) �M .

� Q(Ti) = fQi;1; Qi;2; :::; Qi;(Ti)g is the set of all moduless used by task Ti. This also includesmodules used indirectly or transitively. Note that P (ti) � Q(ti) �M .

� U(Mi) = the set of tasks that (directly or indirectly) use (call) module Mi.

� �(x; y) = the percentage of concurrency among program units x and y. Note that either x or ymay be a module (package or subprogram) or a task. �(x; y) is a number between zero and one,where 0 denotes no concurrency and 1 full concurrency.

With these de�nitions, the concurrency metric �(x; y) is formally de�ned as:

1. 8i8x;y�(Mi;x;Mi;y) = 0

2. 8i8x�(Ti;Mi;x) = 0

3. 8i;j: i6=j8x;y�(Mi;x;Mj;y) = 1

4. 8i;j: i6=j8x�(ti; Pj;x) = 1

5. 8j8i: :(ti2U(Rj ))�(Mi;Mj) = 1

6. 8j8i: ti2U(Rj )0 < �(Ti;Mj) < 1

In order to approximate �(x; y) for the last case, we use the call frequencies of the WCRG. If severaltasks of set U(Mi) call module Mi concurrently, the amount of concurrency for a particular task isproportional to its call frequency CF (Ti;Mj), i.e. a task has a higher probability to proceed concurrentlyto a common module the less frequently it can be blocked in calling this module. The amount ofconcurrency can thus be evaluated as follows:

�(Ti;Mj) =CF (Ti;Mj)P

(U(Mj))CF (Tk;Mj)

, with

CF (Tk;Mj) =Q

(RlincallchainTk!Mj)CF (Tk;Ml)

12

3 Partitioning: Mapping onto Logical Processors

The general partitioning problem addressed herein is to divide the tasks, packages and subprograms intogroups, with the objective of maximizing concurrency and minimizing communication among groups.The reason for partitioning at this level of granularity is due to the magnitude of the applications forwhich the tool is used. A typical example is the AEGIS Weapon System [12], which consists of about 8million lines of code. In such systems, the exploitation of concurrency at the statement level or at the loopiteration level is not practical in general, although it may occasionally be useful in isolated situations.

Given the intermediate representation and the concurrency metrics, the program units of an Adaprogram are distributed automatically, according to the technique shown in in Figure 8. The generationof the distribution speci�cation is accomplished by considering concurrency and communication relation-ships among components, and clustering the components in a way such that low communication costresults, while also achieving a high amount of potential concurrency. The distribution speci�cation isproduced in an abstract, language-independent form. The tool also contains a �lter that maps from thelanguage-independent form into the Rational/Veridx DADS distribution speci�cation language [10].

The partitioning algorithm distributes the program units among an in�nite number of logical pro-cessors. This is performed by repeatedly cutting edges in the WCRG, until a collection of disconnectedcomponents exist. Each component represents a logical processor, or a partition. The edges are cut inthe WCRG by considering concurrency and communication costs. The following three rules are appliedduring partitioning:

1. when �(x; y) = 1, x and y are placed into di�erent partitions; that is, partit(x) 6= partit(y)

2. when �(x; y) = 0, x and y are placed into the same partition; that is, partit(x) = partit(y)

3. when 0 < �(x; y) < 1, the communicationmetrics are used to determine whether x and y are placedin di�erent partitions.

The partitioning rules are observed by the partitioning algorithm shown in Figure 3. Initially, thealgorithm removes the rendezvous edges from the WCRG, resulting in groups that, except for rendezvous,do not communicate with each other. For example, the graph shown in Figure 10(a) was produced afterthe �rst step of partitioning. Each of the groups resulting from the initial partitioning step contains atleast one task, and thus the groups can all run concurrently. 3 The �rst step of partitioning accountsfor all cases where partitioning rules 1 and 2 apply. Partitioning rule 1 applies to two tasks, or toa package/subprogram used exclusively by task ti and a task tj (or to a package/subprogram usedexclusively by tj); in both of these cases concurrency is 1. Partitioning rule 2 applies when two programunits have no concurrency; that is, the rule applies to a task and a package/subprogram that it usesexclusively, or to two packages/subprograms used exclusively by the same task.

Although the initial partitions can all potentially run concurrently, it may be the case that somepotential concurrency is masked due to cases where multiple tasks reside in the same partition. Thisoccurs when tasks share one or more packages or subprograms, and thus call edges link the tasks; suchcases call for the application of partitioning rule 3. In fact, this is the case for tasks 1, 2, 3 and 4 inFigure 10(a). To increase the potential concurrency, the partitioning algorithm continues to selectivelycut the edges of the WCRG until each partition contains a single task. To accomplish this, each group issearched for the presence of multiple tasks. Groups having more than a single task are further partitionedby arbitrarily selecting two tasks in the group, �nding each path of call edges between the two tasks,and breaking each of the paths at the link that has the smallest communication weight (as determinedby the communication metrics de�ned in Section 2.2.1). Actually, it was observed in application of thepartitioning algorithm that occasionally, partitions with a single task were produced. This is sometimesundesirable for load balancing. To get a balanced distribution of program units among partitions, wealso use a parameter which de�nes a window in the center of each path within which cuts may be made;

3Note that since tasks are non-callable units, they appear as nodes without parents in the WCRG after removal of therendezvous edges.

13

Parse &Extract

*.a

IR

Metrics

Partitioning

DSL

Assignment

Ada Appln.

Distrbution Spec. Library

Display

Conc. & Comm Metrics

Figure 8: Design of the partitioning tool.

14

GenerateClusters(WCRG, GROUPS, num groups)beginnum groups := remove rendezvous edges(WCRG,GROUPS);for i = 1 to num groups doif no Tasks(GROUPS(i)) � 1 thenfor each j 2 GROUPS(i) doif (j.type = TASK ) thenfor each k 2 GROUPS(i), k 6= j doif (k.type = TASK ) thenE = Extract Call Chain(WCRG,j,k);/* E - set of edges in call chain */Break Chain Groups(E,x,y);/* new groups x & y */GROUPS(i) = GROUPS(i) � fxg;num groups++;GROUPS(num groups) = fxg;

end ifend for

end ifend for

end ifend for

end GenerateClusters

Figure 9: Partitioning Algorithm

outside of that window (i.e., close to the ends of the path) no cuts are permitted. Thus, the partitioningalgorithm always chooses the cheapest edge within the limits speci�ed by the parameter. Figure 10(b)shows how task 3 and some associated packages are placed into a partition separate from tasks 1, 2, and4, by cutting a call edge in the WCRG. The process is repeated until there exists no group with morethan a single task. One more repetition produces Figure 10(c). The �nal partitioned WCRG is shown inFigure 10(d).

To allow graphical viewing of partitions and to permit manual intervention in the partitioning process,a graphical user interface (GUI) has been developed. As shown in Figure 11, partitions are represented inthe GUI as large circles, while program units are depicted as small circles drawn inside the large circles.Pointing to a small circle and clicking displays the name of the program unit and indicates whether itis a task, package or procedure. The partitioning may be manually modi�ed by pointing to a programunit, holding down on the mouse button and dragging the unit into another partition. Clicking on thesave button will save modi�cations in a �le.

The partitioning tool produces a platform-independent and a distribution-language-independent de-scription of the partitions. Essentially, it is a list of partitions, each containing a list of program units.However, for practicality, a �lter has also been implemented to map the list of lists into a distributionspeci�cation in the DADS language. The DADS speci�cation, along with an Ada program, the DADSlinker and run-time system, and a concurrent computer are su�cient for execution. The partitioningtool thus generates the group speci�cation required by DADS. Each group consists of program units andcorresponds to a logical processor. Such a speci�cation is su�cient for execution, and would result ineach group being assigned to one physical processor, if possible. The speci�cation grammar for a groupis:

group_specification::=

GROUP identifier IS

identifier_or_asterisk;

{identifier_or_asterisk;}

END GROUP;

The group is given an identi�er name and each element of the group is referenced by its name de�nedin the application program. An asterisk may appear in at most one group; it symbolizes a wildcard

15

6

3

21

4

5

7

3

21

4

5

7

6

6

3

21

4

5

7

6

3

21

4

5

7

(a) (b)

(c) (d)

Figure 10: Steps of partitioning the WCRG.

16

Figure 11: The partitioning tool's graphical user interface.

17

task .At5.T5;

package Afive1;

package Afive3;

package Afive4;

procedure At5;

package Afive2;

GROUP 7 is

END GROUP;

task .At7.T7;

package Aseven1;

package Aseven2;

package Aseven3;

package Aseven4;

procedure At7;

GROUP 6 is

END GROUP;

package At6;

package Asix1;

task .At6.T6;

package Asix2;

package Asix3;

package Asix4;

END GROUP;

GROUP 5 is

package At2;

package Atwo1;

task .At2.T2;

package Atwo2;

GROUP 2 is

END GROUP;

task .At3.T3;

package Ath1;

package Att1;

package At3;

package Ath2;

package Ath3;

package Ath4;

package Att2;

package Att3;

GROUP 3 is

END GROUP;

GROUP 4 is

package A1;

package At1;

package Aone1;

package Aone2;

package Aone4;

package Aone5;

package A5;

package A6;

package Aone3;

task .At1.T1;

END GROUP;

package A9;

package A10;

package A8;

package A7;

package A3;

package A4;

package A2;

task .At4.T4;

package At4;

package Atwo4;

package Atfr1;

package Atfr2;

package Atfr3;

package Atfr4;

package Atwo5;

package Att4;

package Atwo3;

*;

GROUP 1 is

END GROUP;

Figure 12: The DADS partitioning speci�cation.

which means that unspeci�ed program components are to be placed in that group. The (partial) DADSdistribution speci�cation corresponding to Figure 10(d) is given in Figure 12.

Each DADS group corresponds to a partition or to a logical processor. In addition to groups, DADSalso provides the station construct. Each station may contain program units and DADS groups, and eachstation corresponds to a physical processor. The following two sections consider assignment of logicalprocessors to physical processors, i.e., the assignment of groups to stations.

4 Assessing Concurrency Among Partitions

The result of the partitioning step is a set of partitions P = fPij(1 � i � Ng, consisting of one task anda number of associated modules. In order to calculate the optimal assignment of partitions to PE's, theneural network uses a so called concurrency matrix C as input. Each element Cij of this matrix gives

18

the potential concurrency between partition Pi and Pj if they are assigned to di�erent PE's, i.e. theconcurrency without taking into account communication and synchronization with other partitions. Inorder to be implementation independent, we assume identical PE's with no runtime overhead for contextswitches etc. 4 and an ideal congestion-free network. This subsection describes the construction of thisconcurrency matrix.

In order to solve the above described problem, we start from the observation that on each PE, atany time only one thread of control can be active. In order to determine which one this is, we considerthe dynamic execution of the system and not its static structure as given by the CRG 5. Since we donot know beforehand which execution path such a thread of control is following at runtime, we have toconsider all possibilities. We call the graph of all possible execution pathes an activity [6]. The nodesof an activity are calls to operations provided by modules and the edges are the precedence relationsbetween these calls 6.

We call the set of all concurrent activities an execution. Usually the di�erent activities are not inde-pendent but communicate via rendezvous and may synchronize at common resources 7. The di�erencebetween an execution and a CRG is that the latter essentially describes a structural (i.e. static) rela-tionship, while an execution describes a logical and temporal (i.e. dynamic) relationships. An executioncan thus be considered as a CRG that is unrolled in time. Similar to the CRG, it can be easily derivedfrom the information collected by the compiler. Since an object-oriented design establishes a hierarchyof abstractions, executions and activities can also be hierarchically decomposed. For this analysis, we,however, make no use of the latter possibility.

In an object-oriented environment, common resources are always encapsulated by modules and thuswe only have to consider the rendezvous and the common modules of the CG. Since the Ada runtimeenvironment will prevent concurrent access to shared modules, the only di�erence between a rendezvousand a call to a concurrent p is whether the synchronization of activities is mandatory or situational, i.e.dependent on the precise timing. Since our analysis only considers possible execution pathes anyway, wecan abstract from this di�erence.

We arrive thus at an execution graph whose nodes are either tasks or ps and whose adges are precedencerelations denoting either sequential execution (within an activity) or synchronization (between di�erentactivities). At the lowest level of abstraction, the execution graph of any reasonable application is far toocomplex. One possibility to deal with this problem would be hierarchical (de)composition of executions,i.e. (de)composition along the dynamic axis. For our purpose we, however, make use of the informationalready gathered in the WCRG, which describes a particular aggregation 8, i.e. a (de)composition of thestructure.

Figure 10 shows the partitioned WCRG of our sample application. In order to construct the executiongraph at the highest level, we proceed as follows:

1. We aggregate tiers of nested module calls in the following way:

� We approximate the computational weight Wkl of an operation call Okl of module Mk bythe number of non-commented source lines NCSLk of the module divided by the number ofoperations NOk: Wkl = NCSLk=NOk.

At the source code level of a high level language like Ada, this is of course a very roughapproximation because the executable statements have a large variety in execution time. Atthe target code (assembler) level on a RISC machine this estimate is very reasonable since allstatements take approximately the same time.

4Alternatively one could average the runtime overhead of the operating system over the execution time of the variousstatements.

5We consider the rondezvous and iterations shown in the CRG as static information, since it can be derived from a staticanalysis of the code.

6Such an activity is thus a partial logical and a total temporal order.7An execution is thus a partial logical and temporal order.8Since we consider the CRG and not the class/type hierarchy, we prefer to talk about di�erent levels of aggregation

instead of abstraction.

19

� We de�ne the call frequency CFkl as the number a particular operation Okl of module Mk isinvoked by all tasks per superperiod SP as de�ned in step 2.

� We assume the execution time Ekl of operation Okl to be directly proportional to Wkl, i.e.each primitive statement has execution time 1: Ekl = Wkl.

� Within one PE we assume that the communication overhead is included in the general over-head of the runtime system. If an operation Okl at another PE is invoked, we de�ne thecommunication costs CCkl as the extra time needed to transmit the parameters and the re-sults of size Skl over an ideal network without congestion: CCkl = c1 � (Skl + 2c2), whereconstant c1 represents the ratio between the bandwidth of the network and the bandwidth ofthe PE's. The constant c2 accounts for the �xed overhead of a single message, e.g. if an emptymessage is sent for synchronization purposes. Note that Skl depends on the size and type ofthe various parameters: IN, OUT or IN and OUT.

� We account guarded commands by their average contribution to the total execution time. Theaverage execution time of a n-fold branch with branching probabilities ph,

P(h = 1 to n)ph =

1, is thusP

Ehph=P

Eh. In a similar way, we unfold all iterations and recursions and takethe average number of repetitions into account.

� We account nested calls ofOkl by their weight and call frequency CFkl and their communicationcosts CCkl, i.e. we perform a transitive summation: Ekl(level n) = 1+

P(calls to level n+1)(Ekl+

CCkl)CFkl. In this formula the constant 1 represents the call statement itself.

2. We construct an approximate schedule at the highest level of aggregation, based on the results fromstep 1:

� We assume that all tasks start at the same point in time and that each task gives rise to oneactivity.

� Activity Am is either (1) repetitive with period PAm9 or (2) asynchronously triggered by

an (internal or external) event with frequency FAm. A real application may than be anarbitrary mix of these two possibilities. We thus have to consider a superperiod SP =LCM (PAmjall periodic activities) and to account the sporadic activities with their maximumfrequency.

� The schedule can now be constructed by any suitable static scheduling algorithm like the onedescribed in [14] and [15]. At a high aggregation level such a schedule becomes so simpel thatit even can be drawn manually. The units for constructing the schedule, i.e the operation calls,are considered als non-preemptable pieces of code, called beads [6]. Note that the probablylarge size of these beads is no problem because the schedule comprises only two activities ontwo PE's. Normal scheduling problems usually involve several activities per PE, which easilyresults in a non-feasible schedule if the preemption is excluded.

3. From this approximate schedule we deduce the intervals Bij where the two activities are blocked(because of a rendezvous or a synchronization at a call of an operations of a shared module) andcannot execute in parallel on their respective PE's. The termination of an activity before the endof SP is counted as equivalent to blocking. From this we can easily calculate the concurrencyCij = ((SP �

PBij)=SP , with 0 � Cij � 1. Note that Cij = 0 represents a (dead)lock and Cij = 1

represents the independent execution of the two activities or partitions respectively.

4. A re�nement of steps 1 to 3 is possible by (1) either lowering the level of aggregation or (2) byre�ning the execution times. The �rst possibility is obvious. The second one means to calculatethe actual length of the various operations Oik rather than their average weight Wkl, e.g. by meansof a pro�ler. If the evaluation is done at source code level, the Halstead metrics ([5] and [9]) canbe used to account for the complexity of the code.

9Note that the Ada programming model implies that PAm = PTm as de�ned in section 2.1.

20

To demonstrate the method, we consider a sample Ada application consisting of �ve tasks, each ofthem giving rise to one activity. This �ve tasks and the modules they use are distributed over �vepartitions by the method described above. All activities of this sample application are non-periodic anall parameters are IN and OUT. The weighted CRG of this sample application is shown in �gure 7.

For the sake of simplicity we calculate the concurrency matrix at the highest level of aggregation ofthe source code without accounting for the complexity of di�erent programming constructs. For the samereason we choose c1 = c2 = 1 and take only the number of parameters into account and not their size.Since we are only interested in the elements Cij of the concurrency matrix, we consider the activitiespairwise. The resulting partial schedule for activities 2 and 3 is shown in �gure 13.

From this schedule we determine the potential concurrency between partition 2 and 3 as C23 =(66=1957)100 = 3; 4%. In a similar way we can determine all other elements of the concurrency matrixwhich is the input for the allocation algorithm described in the next section.

5 Assignment: Mapping Logical Processors onto Physical Pro-

cessors

This section describes a solution to the problem of mapping partitions (logical processors) onto physicalprocessors, with the objective of maximizing concurrency (i.e., the goal is to assign partitions to di�erentphysical processing elements (PEs), if the partitions can execute concurrently. Note that this objective canbe achieved by placing each partition onto a processor by itself. However, such a solution typically yieldslow processor utilizations. To avoid this, we include the objective of minimizing the number of processorsused to maximize concurrency. It is also desirable to conserve processors on parallel computers whichallow the set of processors to be partitioned among di�erent users. This paper presents an assignmentheuristic that is implemented using Gelenbe's random neural network model (RNN) [3, 4]. The neuralnetwork serves as the second phase of a two phase assignment approach that forms communication-intensive clusters that have a high amount of potential concurrency among themselves during phase 1,and that considers concurrency and processor conservation when assigning clusters to physical processors.

5.1 The Assignment Problem

To e�ectively utilize a parallel or distributed computer, the partitions of a program must be distributedintelligently among the processors. We denote the set of partitions to be assigned to processors asFAC = ff1; f2; : : : ; fF g, and we denote the set of PEs as PES = fp1; p2; : : : ; pPg. An assignmentis represented as a function, PE : FAC ! PES. An acceptable assignment is one in which no twopartitions that can execute concurrently are assigned to the same PE (i.e., all potential inter-moduleconcurrency can be exploited). The relation f==g indicates that partition f and partition g have somepotential concurrency, the relation f ? g indicates that f and g have no potential concurrency, andthe function C(fg) indicates the amount of potential concurrency among f and g. The property of anacceptable assignment is formally stated as: 8fi;fj :F ((PE(fi) = PE(fj)^(i 6= j)), fi ? fj). An optimalassignment is an acceptable assignment that minimizes the number of processors required. Acceptableassignments cannot be obtained if a su�cient number of PEs is not provided in the parallel computer.In such cases we rede�ne what it means to have an optimal assignment. We state that a con ict occursin an assignment whenever two partitions that can execute in parallel are assigned to the same PE. Thecost of a con ict is equal to the amount of potential concurrency among the con icting partitions. Anoptimal assignment is one in which the number of con icts is minimized. This criterion can often besatis�ed with several di�erent assignments and with varying numbers of PEs. We further restrict thede�nition of an optimal assignment to be the one that minimizes con icts while requiring the fewest PEspossible to achieve that number of con icts. In this section we show how a random neural network modelfor any instance of this assignment problem can be constructed and solved.

21

Method Call at the highest level of aggregation

PE3:Partition 3

PE1

PE2:Partition 2

Accept Rendezvous

TerminationInitiate Rendezvous

Activity 3

Activity 2

t=0 t=581 t=1957t=66Figure 13: The partial schedule of two activities.

22

5.2 Gelenbe's Random Neural Network Model

The random neural network (RNN) model was invented by E. Gelenbe [3, 4]. A primary advantage of theRNN model over the traditional Hop�eld neural network model is that its stable state can be obtainedanalytically, rather than through simulation. Thus, the stable state of a network can be obtained byiteratively solving for the stable probabilities that neurons are excited.

Random neural networks are composed of interconnected neurons, with each neuron i having a po-tential greater than or equal to zero. The neuron emits signals, or \�res", at rate r(i), which is anexponentially distributed random variable. It emits a positive signal to node j with a probability p+(i; j),emits a negative signal to node j with a probability p�(i; j), and emits a signal that departs from thenetwork with probability d(i). Furthermore, the following equality must hold:

d(i) +X

j

(p+(i; j) + p�(i; j)) = 1 (1)

Only neurons with positive potential are permitted to �re. The arrival of a positive signal to a neuronincreases the potential of the neuron by 1, and the arrival of a negative signal to a neuron decreases thepotential by 1. Firing decreases the potential by 1. Potentials are never allowed to become negative, sonegative signals are ignored by neurons having potential equal to zero. Signals also arrive at each neuronfrom outside of the network. Speci�cally, positive signals arrive at neuron i with rate �(i), and negativesignals arrive at neuron i with rate �(i).

The long term (or stable) probability that any neuron i is excited is the rate at which positive signalsarrive at the neuron, divided by the rate at which the potential of the neuron is decreased. This is statedas:

qi =�+(i)

��(i) + r(i)(2)

The arrival rate of a positive signal to neuron i is computed as:

�+(i) = �(i) +X

j

qjr(j)p+(j; i) (3)

Similarly, the rate at which negative signals arrive at neuron i is given by the formula:

��(i) = �(i) +X

j

qjr(j)p�(j; i) (4)

5.3 The Random Neural Network Solution

The network for approximating the solution to the problem of maximizing interpartition concurrency(i.e., minimizing con icts) with minimal PEs is constructed as follows. We use G(f) to represent thenumber of partitions that can execute concurrently with partition f , PE to indicate the number of PEsavailable, and F to denote the number of partitions to be assigned. As shown in Figure 14, there aretwo (2) neurons for each possible (partition, PE) pair. A neuron R(f; p) represents the assignment ofpartition f to PE p, and is excited whenever f is tentatively assigned to PE p. Conversely, neuron r(f; p)is excited whenever partition f is not tentatively assigned to PE p.

The objectives of the assignment problem are achieved by placing connections between neurons inways that cause the neurons to enter a state representing an (approximately) optimal solution. Thereis an inhibitory (negative) connection from each R(f; p) to the corresponding r(f; p), to insure that thecontradictory state where both R(f; p) and r(f; p) are excited will not occur when the network stabilizes.There are excitatory (positive) connections from each r(f; p) to allR(f; q), where p 6= q. These connectionsencourage the eventual assignment of f to some PE. When g==f , there is an inhibitory connection fromR(f; p) to R(g; p), to discourage f and g from being assigned to the same PE; the weight of such an

23

++-

-

fh

( r( f, p ))( r( f, p )) λΛ( R( f, p ))( R( f, p )) λΛ

q = pg // f

R( h , p ) R( f, q )R( g , p )

r( f, p )R( f, p )

Figure 14: A local view of the random neural network.

edge is proportional to the amount of concurrency among g and f , that is it is proportional to C(gf).Similarly, there is a positive connection fromR(f; p) to R(h; p) when f ? h, to encourage assignment of fand h to the same PE. This tends to reduce the number of PEs required while not limiting concurrency.

To obtain a solution to the assignment problem, the analytical solution to the network stable state isobtained and the node with the highest probability of being excited is selected. That node indicates theassignment of one partition to a PE. Such an assignment is made and the process is repeated, until allpartitions are assigned. The solution procedure is outlined below:

Procedure RNN:

1. Number the Partitions and PEs.

2. Initialize all qR(f;p) = 0.

3. Assign Partition f to PE 1, where �(R(f; p)) is the largest for all f and p. This is accomplishedby setting qR(f;p) = 1 and qR(f;q) = 0; 8q 6= p.

4. Let the rest of the network \relax". Use the equation above to calculate qR(f 0;p), where f0 has not

been �xed to a PE. Repeat this calculation until stable qR values result.

5. Find the maximum qR, say qR(f1;p1), and make the corresponding assignment by setting qR(f1;p1) =1 and qR(f1;q) = 0; 8q 6= p1.

6. Repeat steps 4 and 5 until all partitions are assigned.

24

6 Conclusions

This paper contains a description of an automated approach for partitioning and assignment of object-based concurrent programs. The approach bridges the gap that exists between application softwareand parallel or distributed architectures. Additionally the paper de�nes techniques for computation ofcommunication and concurrency metrics that drive partitioning and assignment is considered, and alanguage-independent intermediate representation from which metrics are computed is de�ned. Softwaretools incorporating these techniques have been implemented and are described as well.

The output of the processes described in this paper allows a program to be executed on a concurrentarchitecture. For example, the partitioning and and assignment produced can be expressed in the DADSdistribution speci�cation language to produce a complete software system that can be automaticallyexecuted on a parallel or distributed computer on which the DADS run-time system has been installed.

Novel aspects of this work include the consideration of tasks and objects as units of partitioning andassignment. Additionally, the concurrency metrics constitute a signi�cant result in the area of staticassessment of concurrency. Unlike other work in partitioning, we provide a partitioning tool, not just apartitioning language or a graphical tool for constructing partitions. Furthermore, this work has beenmotivated by a real need in the applications domain|the ability to distribute mission critical softwareover concurrent hardware platforms.

Future work includes additional metrics for concurrency and communication, as well as for timing, faulttolerance and reliability. Although the output of partitioning and assignment is language-independent,future work includes the exploitation of advanced features of the DADS partitioning language, as wellas exploitation of the partitioning construct of the Ada-95 language. Additionally, we will compare ofvarious partitioning and assignment techniques on several mission critical systems. Most importantly,the tools described herein are being used in the ARPA/AEGIS-sponsored HiPer-D project, which isestablishing essential capabilities for shipboard computing in the 21st Century.

References

[1] Shahid H. Bokhari, \Partitioning Problems in Parallel, Pipelined, and Distributed Computing," IEEE Transactions

on Computers, January 1988, 37(1), pages 48-57.

[2] F. Ercal, J. Ramanujam and P. Sadayappan, \Task Allocation onto a Hypercube by RecursiveMincut Bipartitioning,"Journal of Parallel and Distributed Computing, Oct. 1990, pages 35-44.

[3] E. Gelenbe, \Random neural networks with negative and positive signals and product form solution," Neural Com-putation, 1(4), 1989.

[4] E. Gelenbe, \Theory of the random neural network model,"Neural Networks: Advances and Applications, E. Gelenbe,editor, Elsevier Science Publishers, 1991.

[5] M. Halstead, "Elements of Software Science", North Holland, 1977.

[6] D.K. Hammer, P. Lemmens, E. Luit, O. van Roosmalen, P. van der Stok and J. Verhoosel, "DEDOS: A DistributedEnvironment for Object-Oriented Real-Time Systems", Journal of Parallel and Distributed Technology, Winter 1994.

[7] R. Jha, J.M. Kamrad II, and D.T. Cornhill, \Ada Program Partitioning Language: A Notation for Distributing AdaPrograms," IEEE Transactions on Software Engineering, March, 1989, 15(3), pages 271-280.

[8] Virginia Mary Lo, \Heuristic Algorithms for Task Assignment in Distributed Systems," IEEE Transactions on Com-

puters, November 1988, 37(11).

[9] R.S. Pressman, "Software Engineering: A Practitioner's Approach",McGraw-Hill, 1992.

[10] The Rational Corporation, \Distributed Application Development System Guide," version 6.2.3, December 16, 1994.

[11] B. Ravindran, \Extracting parallelism at compile-time through dependence analysis and cloning techniques in anobject-based paradigm," M.S. Thesis, New Jersey Institute of Technology, May 1994.

25

[12] A. L. Samuel, E. Sam, J. A. Haney, L. R. Welch, J. Lynch, T. Mo�t, and W. Wright, \Application of a ReengineeringMethodology to Two AEGIS Weapon System Modules: A Case Study in Progress," Proceedings of The Fifth Systems

Reengineering Technology Workshop, Naval Surface Warfare Center, February 1995.

[13] H.S. Stone, \Multiprocessor scheduling with the aid of network ow algorithms," IEEE Transactions on Software

Engineering, Vol. SE-3, No. 1, pp. 85-93, January 1977.

[14] J.P.C. Verhoosel, E.J. Luit, D.K. Hammer and E. Jansen, "A Static Scheduling Algorithm for Distributed HardReal-Time Systems", The Journal of Real-Time Systems, Vol. 3(3), September 1991.

[15] J. Verhoosel, "Pre-Run-Time Scheduling of Distributed Real-Time Systems: Models and Algorithms", PhD Thesis,

Department of Computing Science, Eindhoven University of Technology, January 1995.

[16] J. P. C. Verhoosel, L. R. Welch, D. Hammer, and A. D. Stoyenko, \Assignment and Pre-Run-time Scheduling ofObject-Based, Parallel Real-Time Processes," IEEE Symposium on Parallel and Distributed Processing, Oct. 1994.

[17] J. P. C. Verhoosel, L. R. Welch, D. K. Hammer, A. D. Stoyenko, and E. J. Luit, \A Formal Deterministic SchedulingModel for Object-Based, Hard Real-Time Executions," Journal of Real-Time Systems, 8(1), January 1995.

[18] R. A. Volz, T. N. Mudge, G. D. Buzzard, and P. Krishnan, \Translation and execution of distributed Ada programs:Is it still Ada?, " IEEE Transactions on Software Engineering, vol. 15, no. 3, pp. 281-292, March 1989.

[19] L. R. Welch, A. D. Stoyenko and S. Chen, \Assignment of ADT Modules with Random Neural Networks," The HawaiiInternational Conference on System Sciences, pages II-546-555, Jan. 1993.

[20] L. R. Welch, \Assignment of ADT Modules to Processors," Proceedings of the International Parallel Processing

Symposium, pages 72-75, March, 1992.

[21] L. R. Welch, \Cloning ADT Modules to Increase Parallelism: Rationale and Techniques," Fifth IEEE Symposium on

Parallel and Distributed Computing, pages 430-437, December 1993.

[22] L. R. Welch, A. D. Stoyenko, T. J. Marlowe, \Response Time Prediction for Distributed Periodic Processes Speci�edin CaRT-Spec," Control Engineering Practice, (in press).

[23] L. R. Welch, A. Samuel, M. Masters, R. Harrison, M. Wilson and J. Caruso, \Reengineering Complex ComputerSystems for Enhanced Concurrency and Layering," Journal of Systems and Software, July 1995, (to appear).

[24] L. R. Welch, \ A Parallel Virtual Machine for Programs Constructed from Abstract Data Types," IEEE Transactions

on Computers 37(11), pages 1249-1261, Nov. 1994.

[25] L. R. Welch, G. Yu, J. Verhoosel, J. A. Haney, A. Samuel, and P. Ng, \Metrics for Evaluating Concurrency inReengineered Complex Systems," Annals of Software Engineering, 1(1), Spring 1995.

[26] J. Xu and D. L. Parnas, \Scheduling Processes with Release Times, Deadlines, Precedence, and Exclusion Relations,"IEEE Transactions on Software Engineering, Vol. 16, No. 3, pp. 360-369, March 1990.

[27] G. Yu and L. R. Welch, \Program Dependence Analysis for Concurrency Exploitation in Programs Composed ofAbstract Data Type Modules," IEEE Symposium on Parallel and Distributed Processing, Oct. 1994.

[28] G. Yu and L. R. Welch, \A Novel Approach to O�-line Scheduling in Real-Time Systems," Informatica, (to appearin 1995).

26

Metrics and techniques for automatic partitioning and assignment of object-based concurrent programs

Documents

Transcript of Metrics and techniques for automatic partitioning and assignment of object-based concurrent programs