Micro time cost analysis of parallel computations

16
613 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991 Micro Time Cost Analysis of Parallel Computations Bin Qin, Member) IEEE, Howard A. Sholl, Member, IEEE, and Reda A. Ammar, Senior Member, ZEEE Abstract-In this paper, we investigate the modeling and anal- ysis of time cost behavior of parallel computations. It is assumed parallel computations under investigation reside in a computer system in which there is a limited number of processors, all the processors have the same speed, and they communicate with each other through a shared memory. It has been found the time costs of parallel computations depend on the input, the algorithm, the data structure, the processor speed, the number of processors, the processing power allocation, the communication, the execution overhead, and the execution environment. In this paper, we define time costs of parallel computations as a function of the first seven factors as listed above. The computation structure model is modified to describe the impact of these seven factors on time cost. Techniques based on the modified computation structure model are developed to analyze time cost. A software tool, TCAS (time cost analysis system) that uses both the analytic and the simulation approaches, is designed and implemented to aid users in determining the time cost behavior of their parallel computations. Index Terms- Parallel computations, performance modeling and analysis, shared memory environment, software evaluation, software performance engineering, time cost behavior. I. INTRODUCTION IME cost is one of the important metrics. To design time- T efficient computations, it is necessary to develop models and techniques to analyze the time cost behavior. There are two types of time cost analyses, the macro analysis and the micro analysis [l]. The macro analysis consists of choosing a dominant operation and expressing the time cost of a computation as a function of the number of times this operation is performed. During such analysis, the impact of all other operations on the time cost of the computation is ignored. The complexity analysis is one typical example of macro analysis. In complexity analysis, a dominant operation is chosen as the complexity measure and the asymptotic time cost behavior is described as a function of the times the operation is used in the computation. On the contrary, micro analysis consists of expressing the time cost of a computation as a function of the time needed to perform each of the operations in the computation. In micro analysis, all factors that affect the time cost behavior are considered. Compared to the great success that has been achieved in the areas of macro analysis of sequential computations, micro analysis to sequential computations, and macro analysis of parallel computations [1]-[19], little has been done in the area Manuscript received June 14, 1988; revised October 25, 1990. This work was supported in part by the National Science Foundation under Grants DCR8602299 and CCR8701839. B. Qin is with IBM, Toronto, Ont., Canada. H. A. Sholl and R. A. Ammar are with the Department of Computer Science IEEE Log Number 9143280. and Engineering, University of Connecticut, Storrs, CT 06269. of micro analysis of parallel computations. In this paper, we discuss the micro time cost analysis of parallel computations. It is assumed the parallel computation to be analyzed resides in a computer system in which there is a limited number of processors, all the processors have the same speed, and they communicate with each other through a shared memory. In our research work, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processing power allocation, and the communication. An exist- ing model, the computation structure model [15], is modified to describe the impact of the number of processors, the processing power allocation, and the communication on time cost. In the modified model, the time cost of any operation in a parallel computation is related to the amount of processing power and the allocation policy that are available during the execution of that operation. The locking technique is used in the model to manipulate the access to the shared memory that is used for communication. Based on the modified model, analytic techniques are developed to derive the time costs of parallel computations. A software tool, TCAS (time cost analysis system) that uses the modified computation structure model as the underlying model and carries out. the micro time cost analysis on parallel computations using either the analytic approach or the simulation approach, is designed and implemented. This paper is organized as follows. Factors that affect the time cost are first identified in Section 11. The computation structure model is briefly reviewed in Section 111. How to model the effect of processors, processing power allocation, and communication on time cost behavior is described in Sections IV and V. Time cost analysis techniques based on the model are discussed in Section VI. A time cost analysis tool (TCAS) is described in Section VII. Conclusions are finally given in Section VIII. 11. WHAT INFLUENCES TIME COST To study the time cost behavior of parallel computations, we must first identify the factors that influence the time cost. At the computation level, we believe the time cost of a parallel computation depends on at least the following nine factors. 1) The size of the input to the computation. If the com- putation is used to sort a table of n elements, the time cost of the computation increases as the size of the table increases. 2) The algorithm used in the computation. To sort a table, we may use a bubble sort algorithm or a quick sort algorithm. Using different algorithms will yield different time costs. 0018-9340/91/0500-0613$01.00 0 1991 IEEE

Transcript of Micro time cost analysis of parallel computations

613 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

Micro Time Cost Analysis of Parallel Computations Bin Qin, Member) IEEE, Howard A. Sholl, Member, IEEE, and Reda A. Ammar, Senior Member, ZEEE

Abstract-In this paper, we investigate the modeling and anal- ysis of time cost behavior of parallel computations. It is assumed parallel computations under investigation reside in a computer system in which there is a limited number of processors, all the processors have the same speed, and they communicate with each other through a shared memory. It has been found the time costs of parallel computations depend on the input, the algorithm, the data structure, the processor speed, the number of processors, the processing power allocation, the communication, the execution overhead, and the execution environment. In this paper, we define time costs of parallel computations as a function of the first seven factors as listed above. The computation structure model is modified to describe the impact of these seven factors on time cost. Techniques based on the modified computation structure model are developed to analyze time cost. A software tool, TCAS (time cost analysis system) that uses both the analytic and the simulation approaches, is designed and implemented to aid users in determining the time cost behavior of their parallel computations.

Index Terms- Parallel computations, performance modeling and analysis, shared memory environment, software evaluation, software performance engineering, time cost behavior.

I. INTRODUCTION IME cost is one of the important metrics. To design time- T efficient computations, it is necessary to develop models

and techniques to analyze the time cost behavior. There are two types of time cost analyses, the macro analysis

and the micro analysis [l]. The macro analysis consists of choosing a dominant operation and expressing the time cost of a computation as a function of the number of times this operation is performed. During such analysis, the impact of all other operations on the time cost of the computation is ignored. The complexity analysis is one typical example of macro analysis. In complexity analysis, a dominant operation is chosen as the complexity measure and the asymptotic time cost behavior is described as a function of the times the operation is used in the computation. On the contrary, micro analysis consists of expressing the time cost of a computation as a function of the time needed to perform each of the operations in the computation. In micro analysis, all factors that affect the time cost behavior are considered.

Compared to the great success that has been achieved in the areas of macro analysis of sequential computations, micro analysis to sequential computations, and macro analysis of parallel computations [1]-[19], little has been done in the area

Manuscript received June 14, 1988; revised October 25, 1990. This work was supported in part by the National Science Foundation under Grants DCR8602299 and CCR8701839.

B. Qin is with IBM, Toronto, Ont., Canada. H. A. Sholl and R. A. Ammar are with the Department of Computer Science

IEEE Log Number 9143280. and Engineering, University of Connecticut, Storrs, CT 06269.

of micro analysis of parallel computations. In this paper, we discuss the micro time cost analysis of parallel computations. It is assumed the parallel computation to be analyzed resides in a computer system in which there is a limited number of processors, all the processors have the same speed, and they communicate with each other through a shared memory. In our research work, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processing power allocation, and the communication. An exist- ing model, the computation structure model [15], is modified to describe the impact of the number of processors, the processing power allocation, and the communication on time cost. In the modified model, the time cost of any operation in a parallel computation is related to the amount of processing power and the allocation policy that are available during the execution of that operation. The locking technique is used in the model to manipulate the access to the shared memory that is used for communication. Based on the modified model, analytic techniques are developed to derive the time costs of parallel computations. A software tool, TCAS (time cost analysis system) that uses the modified computation structure model as the underlying model and carries out. the micro time cost analysis on parallel computations using either the analytic approach or the simulation approach, is designed and implemented.

This paper is organized as follows. Factors that affect the time cost are first identified in Section 11. The computation structure model is briefly reviewed in Section 111. How to model the effect of processors, processing power allocation, and communication on time cost behavior is described in Sections IV and V. Time cost analysis techniques based on the model are discussed in Section VI. A time cost analysis tool (TCAS) is described in Section VII. Conclusions are finally given in Section VIII.

11. WHAT INFLUENCES TIME COST

To study the time cost behavior of parallel computations, we must first identify the factors that influence the time cost. At the computation level, we believe the time cost of a parallel computation depends on at least the following nine factors.

1) The size of the input to the computation. If the com- putation is used to sort a table of n elements, the time cost of the computation increases as the size of the table increases.

2) The algorithm used in the computation. To sort a table, we may use a bubble sort algorithm or a quick sort algorithm. Using different algorithms will yield different time costs.

0018-9340/91/0500-0613$01.00 0 1991 IEEE

614 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5 , MAY 1991

3) The data structure used in the computation. A string can be represented as an array or a link list. Different representations need different processing and therefore, they impact the time cost behavior of the computation.

4) The processor speed. In general, as the speed of the processor increases, the time cost of the computation decreases.

5) The number of processors available. In general, as the number of processors increases, we may achieve lower time cost.

6) The processing power allocation policy. The process- ing power allocation policy determines how processing power is allocated. If a computation is allocated more processing power, it may have lower time cost.

7) The communications between computations. If two com- putations need to communicate with each other in order to fulfill a task, the way the communication is carried out and the time spent on the communication will influence the time costs of both computations.

8) The execution overhead of the computation. To support the execution of the computation, the computer system needs to perform some auxiliary operations such as interrupt handling, accounting, page swapping, processor scheduling, resource allocating, etc. The more auxiliary operations are performed, the more the time cost will increase.

9) The execution environment of the computation. During the execution of the computation, the system may also execute some other computations. Since all these com- putations share the system resource, the time cost of one computation will influence the time costs of others.

Often, the time cost of a computation is only defined as a function of the first four factors. To better analyze the time cost behavior of parallel computations, we need to model the impact of other factors on the time cost.

111. COMPUTATION STRUCTURE MODEL Our work is based on the computation structure model [15]

that describes the detailed execution and time cost behavior of computations. This model has been successfully used as the underlying models in several time cost analysis systems [3], [4], [20]. In this section, we briefly describe this model. More details about the model can be found in [15].

A computation in the computation structure model is repre- sented as two directed graphs, a control graph and a data graph. The control graph shows the order in which operations are performed, while the data graph shows the relations between operations and data.

The control graph contains a start node, an end node, operation nodes, decision nodes, or nodes, fork nodes, and join nodes. The start and end nodes indicate the beginning and end of the computation; an operation node specifies an operation to be performed; a decision node checks conditions at a branch point; an or node serves as a junction point; a fork node creates parallel execution paths; and a join node merges parallel execution paths into a single path. The execution of a computation is triggered when an activation signal enters

the control graph of the computation. When a signal enters an operation node, the operation specified by the node is performed and the signal then leaves the node. When a signal arrives at a decision node, the decision node checks some conditions and the signal leaves the node from one of its outgoing edges depending on the result of checking. When a signal arrives at an or node from one of its incoming edges, it immediately leaves the node from its outgoing edge. When a signal reaches a fork node, the fork node creates parallel execution paths and an activation signal is put on each parallel path. When a join node receives one signal from each of its incoming edges, an activation signal is created and it leaves the join node from its outgoing edge. When an activation signal finally arrives at the end node, the execution terminates.

The data graph contains operation nodes and data nodes. The former is represented as a circle and the latter a box. A data item is the input to an operation if there is an edge from the data node to the operation node. Similarly, a data item is the output of an operation if there is an edge from the operation node to the data node.

Fig. 1 illustrates the computation structure of a sequential computation. The control graph is given in Fig. l(a) and the data graph is given in Fig. l(b). The weight of each edge in the control graph indicates the number of times the edge is traversed by activation signals. This computation copies a string t with a length of n to string s. Operations in the computation are defined as follows:

op, : i + 1 0p2 : s[i] + t [ i ]

op3 : 2 e i + 1

D : i < n .

The time cost of a computation is defined as follows. Each node in the control graph is associated with a time cost which is equal to the time needed to perform the operations specified by the node. When a node with time cost C receives all the required activation signals, the execution of the specified operations is started and after C amount of time, activation signals leave the node. The time cost of a computation is then defined as the time for an activation signal to travel from the start node to the end node.

Techniques to derive time costs of sequential computations have been developed [4], [15], [21]. One approach [4], [15] works as follows. Given a control graph G( V, E ) of a sequen- tial computation, suppose node vi has a time cost Ci and it is visited E; times by activation signals, then the time cost of the computation is

I V I

E; * C;. i=l

According to the above, the time cost of the computation given in Fig. 1 is (assume the time costs of start and end nodes are 0):

Cost = Costopl + (n + 1) * Cost,, + (n + 1) * Costd + n * Costop2 + n * Cost@.

QIN et al.: MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS 615

Fig. 1. Example of a sequential computation

One can find from the definition the time cost of a com- putation in the computation structure model is defined as a function of the input, the algorithm, the data structure, and the processor speed. Since the time cost of a parallel computation also depends on the number of processors, the processing power allocation and the communication, the impact of these three factors must be considered in order to better model the time cost behavior. In the next two sections, we modify the computation structure model to include these three factors.

Iv. MODELING OF PROCESSORS AND ALLOCATION To model the impact of processors and processing power

allocation on the time cost behavior, we modify the compu- tation structure model. Without loss of generality, we assume each processor has unity processing power (since it is assumed all the processors have the same speed). Therefore, one unit of processing power is equivalent to one processor. In the modified computation structure model, it is assumed activation signals have the abilities to carry processing power and alloca- tion policy as they travel in the control graph of a computation. It is assumed the processing power allocation and release are carried out at the beginning and end of a computation's execution, respectively. That is, a computation will hold the same amount of processing power during its execution. The amount of processing power carried to the start node of a computation by the activation signal is what is allocated to the computation, and the allocation policy carried to the start node is the one passed to the computation. When an activation signal reaches a node, the node receives all the processing power and the allocation policy carried by the signal, and it uses them to perform the specified operations. After the specified operations are performed, the signal carries the same amount of processing power and the same allocation policy, and leaves the node.

Suppose the amount of processing power and the allocation policy carried by an activation signal are P and A, respec- tively. If the activation signal enters a fork node which has n outgoing edges as illustrated in Fig. 2 (the weights of an edge in Fig. 2 represent the amount of processing power and the allocation policy carried by the activation signal which traverses the edge), the fork node uses the received allocation policy to allocate the received processing power among its n successor nodes. N activation signals are created by the fork

IP1 A ' /\-

Fig. 2. First example of parallel structure.

node. All the activation signals generated by the fork node carry the same allocation policy A. Node opi is allocated Pi processing power if the activation signal which leaves for node opi carries Pi processing power. How the allocation is carried out is totally determined by the allocation policy A . The only restrictions put on the allocation are as follows:

n

Pa = P. 0 < Pi 5 P. i=l

For the join node in Fig. 2, when it receives an activation signal from each of its incoming edges, it creates an activation signal. The signal created by the join node carries allocation policy A, and if the activation signal coming from node opi carries Pi processing power, the activation signal generated by the join node then carries P = Cy=, Pi processing power.

The time cost of a computation is still defined as the time for an activation signal to travel from the start node to the end node. However, the time cost of a computation now depends on the amount of processing power and the allocation policy it receives. If the amount of processing power and the allocation policy a computation C receives are P and A, respectively, the time cost of C is defined as Cost,(P,A). Note, it is not necessary for P to be an integer number. The only restriction put on P is P > 0. P 2 1 means computation C uses more than unity processing power. P < 1 means computation C shares one processor with other computations and C can only get P * 100% of the processor time.

For the parallel structure as illustrated in Fig. 2, if the amount of processing power and the allocation policy received by the fork node are P and A, respectively, we define its time cost as

For a node N which contains no parallel paths, we define its base time cost as its time cost given that unity processing power is provided. Since such a node does not contain any parallel paths, its time cost is independent of the processing power allocation. Therefore, its base time cost is equal to Cost,(l,A') where A' is any allocation policy. Suppose P processing power is provided to node N , we define the time

616 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

cost of N as cost behavior cannot be described by the original computation Cost,( 1, A’) N’s base time cost

min( 1, P ) - - Cost,(P1 A ) = min(l, p )

where A and A’ are any allocation policies. The above indicates that the base time cost of a node that

contains no parallel paths is the best time cost one can get, providing more than unity processing power to such a node will not reduce its time cost, and providing less than unity processing power to such a node will force the node to wait during its execution, therefore, resulting in an increase of its time cost.

According to the definitions and modifications made in this section, one can find the original computation structure model is a special case of the modified computation structure model defined in this section. The time cost of a computation C , Cost,, in the original computation structure model is simply the time cost Cost,(P, A ) in the modified computation structure model given that P is large enough and A allocates processing power in such a way that every node in the computation can receive at least unity processing power.

Since the time cost depends on the amount of processing power and the processing power allocation, one may ask: how to minimize the time cost if the amount of processing power is fixed. This obviously depends on how processing power is

structure. Corollary 1: Given a parallel structure as illustrated in

Fig. 2, suppose each opi is sequential and its base time cost is Ci, then the minimum time cost can be achieved if the following amount of processing power is passed to the parallel structure:

c ci i=l

max(C1,. . . , C,} ’

The proof of Corollary 1 is given as part of Theorem 1’s proof. Corollary 1 indicates that if a parallel structure has n paral-

lel paths, it is possible to achieve the best time cost behavior with less than n processing power. To achieve the best time cost behavior with minimum amount of processing power, one can force all the operations performed in parallel to complete at the same time. To illustrate this, consider the parallel structure in Fig. 2. If each opi is allocated max~C1:,,.,C,~ processing power, the total amount of processing power needed for the parallel structure is

n

Ci i=l

allocated. Given a parallel structure as illustrated in Fig. 2, Theorem 1 shows how to minimize the time cost if each op,

Theorem 1: Given a parallel structure as illustrated in Fig. 2, suppose P processing power is carried to the fork node, each op, is sequential, and the base time cost of each op, is C,, then the minimum time cost can be achieved if each op, is allocated the following amount of processing power:

and the time cost of op, is

in Fig. 2 is internally sequential. max{C1;..,Cn}. (11)

Therefore, the time cost Of the parallel structure is

fork’s base time cost + , . , cn} + join’s base time cost.

It can be seen the above is the minimum time cost one can expect from the parallel structure. It is also easy to see

Ca P, = P * -’ i: c, 5 ca J=1

The proof of Theorem 1 is given in Appendix 1. 2= 1

To illustrate the application of Theorem 1, consider the max(C1,. . . , C,} parallel structure in Fig. and unity processing power is the minimum amount of processing power needed to achieve is passed to the fork node. According to Theorem 1, each opi gets F3 processing power. The minimum time cost of

this minimum time cost.

L j = l -J

the parallel structure is then v. MODELING OF COMMUNICATION

In a parallel computation, operations performed in parallel often communicate with each other in order to fulfill a given task. The communication time will affect the time cost behav- ior of the parallel computation, and to better analyze the time cost behavior, the communication needs to be considered in the time cost analysis. Since it is assumed the communication is through a shared memory, we only need to model the accesses to the shared data in the shared memory and determine the

In our research work, we use a locking technique [22] to manipulate the accesses to shared data. It is assumed that each

time required.

shared data item is associated with two locks, a read lock and a write lock. To read a shared data item, one needs to obtain a read lock on that shared data item. Similarly, a write lock on

+ base time cost.

This time cost is what one expects since there is only one processor in the computer system and all the n operations per- formed in parallel must share the single processor. Such time

Cost fork (1 7 A ) -k lT,n

+ Costjo;,(l, A ) = fork’s base time cost + C1 + . . . + C,

QIN el 01.: MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS 617

a shared data item is required if one wants to perform a write operation. Several operations performed in parallel can read a shared data item at the same time. However, no other read or write lock on a shared data item is granted if a write lock on that shared data item has already been held.

To include the data access control into the computation structure model, two new types of nodes, a lock node and an unlock node, are added to the computation structure model. The lock node is used to obtain locks on shared data and the unlock node is used to release locks. Each lock or unlock node has an incoming edge and an outgoing edge associ- ated with it in the control graph. A lock node needs to obtain a read lock on data item X if X appears as an input to the lock node in the data graph. Similarly, a lock node needs to obtain a write lock on X if X appears as an output from the lock node in the data graph. If X is connected with an unlock node in the data graph, the unlock node releases a read or write lock on X depend- ing on whether X appears as an input or output of the unlock node. Fig. 3 illustrates an example of lock and un- lock nodes. In Fig. 3, the lock node needs to obtain read locks on X and 2, and a write lock on Y ; the unlock node releases write locks on A and C, and a read lock on B.

It is assumed that a lock node obtains all its required locks at the same time. If the lock node cannot bet all the required locks, it is put into a lock waiting queue. In this way, a lock node will never hold some locks and wait for some other locks. When some locks are released by an unlock node, lock nodes in the lock waiting queue are checked to see if their requests can be satisfied. The checking is done in an FIFO order. Once all the requests of a lock node are satisfied, the lock node is removed from the queue.

The time cost of a computation here is again defined as the time for an activation signal to travel from the start node to the end node. When an activation signal enters an unlock node, the unlock node starts to release locks. After required locks are released, the activation signal leaves the unlock node. The time cost of an unlock node is then equal to the time to release the required locks. When an activation signal enters a lock node, the lock node starts to wait for the required locks. When all the locks on required data are available, the lock node manipulates these locks. After all the required locks are obtained, the activation signal leaves the lock node. The time cost of a lock node is then equal to the sum of the time to wait for the required locks and the time to manipulate these locks.

Given the model defined in this section, some uncertainties may occur. This happens when more than one lock node in parallel requires conflict locks on a shared data item (read and write locks, or write and write locks). Consider the parallel structure in Fig. 4. A conflict occurs in the parallel structure since lock1 and lock;! both require write locks on X at the same time. When this occurs, it is assumed that the computer system will arbitrarily choose one lock node to obtain its required locks first. The probability of selecting a lock node to get its required locks first is assumed to be uniformly distributed. If

control graph data graph

Fig. 3. An example of lock and unlock nodes.

Q Q control 0 graph

data graph

Fig. 4. An example of conflict.

any uncertainty occurs, the average time cost is used as the time cost.

VI. TIME COST ANALYSIS TECHNIQUES

To derive the time cost of a parallel computation, it is necessary to first determine the number of times each op- eration in the computation is performed. This can be done by using the flow analysis technique. In a flow analysis technique, the number of times activation signals traverse an edge is referred to as the flow of the edge. The flow analysis technique we developed [23] is based on the following four observations:

If an extra edge is added to the control graph such that it starts at the end node and ends at the start node, then for any node in the control graph except fork and join nodes, the sum of its incoming flows is equal to the sum of its outgoing flows. For a fork node, its incoming flow is equal to any of its outgoing flows. For a join node, its outgoing flow is equal to any of its incoming flows. If any parallel structure in the control graph is replaced with a single node, and an extra edge is added to the control graph such that it starts at the end node and ends at the start node, then for any node in the control graph, the sum of its incoming flows is equal to the sum of its outgoing flows.

618 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5 , MAY 1991

The fourth observation tells us if we replace each parallel structure with a single node, the parallel computation then becomes sequential and as a result, flow analysis techniques for sequential computations can be applied to the computation. Within a parallel structure, each parallel path can be considered as a control graph and the above procedure can be applied on it recursively.

The above idea is used in our flow analysis technique. The following is the flow analysis technique for parallel computations. It is designed as a recursive procedure FLOW-AhUYSIS(G) where G is the control graph of the parallel computation to be analyzed. More details about this technique can be found in [23].

FLOW-ANALYSIS(G): 1. add flow eo to G such that eo starts at the end node and

ends at the start node, and mark eo as a dependent flow; 2. for each parallel structure PS which is not contained in

another parallel structure, consider PS as a single node; 3. initialize all dependent flows in G to zero; 4. for each independent flow e, in G , do the following:

4.1. find a cycle which contains e,, but all the other flows in the cycle are dependent flows;

4.2. for each dependent flow e3 in the cycle, do the following: 4.2.1. If e , and e3 have the same direction along

the cycle, then e3 + e3 + e,; else e3 +- e3 - e,;

another parallel structure, do the following: 5.1. disconnect PS from G ; 5.2. mark all the outgoing flows of the fork node of

PS as independent flows, and set them equal to the incoming flow of the fork node;

5. for each parallel structure PS which is not contained in

5.3. mark the fork node of P S as a start node; 5.4. mark the join node of PS as an end node; 5.5. for any parallel path of PS, do the following:

5.5.1. disconnect all the other parallel paths of PS; 5.5.2. FLOW-ANAL.YSIS(PS); 5.5.3. connect all the other parallel paths back to

PS; 5.6. mark the start node of PS as a fork node; 5.7. mark the end node of PS as a join node; 5.8. mark all the outgoing flows of the fork node of PS

as dependent flows; 5.9. connect PS back to G;

6. delete flow eo from G. Once the number of times each operation is performed

in a parallel computation is determined, the next task is to derive the time cost of the computation. One time cost analysis technique for parallel computations we developed [24] assumes that operations performed in parallel do not communicate with each other. Here, we briefly describe how the technique works. Interested people can refer to [24] for more details.

The main idea of the technique is to replace each parallel

cost of the parallel structure to that node. After all the replacement, the computation becomes sequential and the time cost techniques for sequential computations can then be used to derive the time cost. Before the technique derives the time cost from a parallel computation, it first determines the amount of processing power each node in the computation receives and then it uses Theorem 1 to calculate the time cost of each node. Given a control graph G of a parallel computation C , suppose the processing power and the allocation policy passed to it are P and A, respectively. The technique uses the fol- lowing recursive procedure ALLOCATE(G, P, A ) to determine the processing power each node receives.

ALLOCATE(G, P,A): 1. for each node in G, if it is not in a parallel structure, the

processing power it receives is P; 2. for each parallel structure PS which is not contained in

another parallel structure, do the following: 2.1. the processing power the fork node receives is P; 2.2. the processing power the join node receives is P; 2.3. for each parallel path PATH of the parallel structure

P S , ALLOCATE(PATH, ALLOCATEI(PS, PATH, P,

Function ALLOCATE1 returns the processing power the acti- vation signal which traverses the parallel path PATH should carry. This depends on the properties of the parallel structure PS and the parallel path PATH, the processing power available at the fork node, and the allocation policy A.

Given a control graph G of a parallel computation, suppose the processing power and the allocation policy passed to it are P and A, respectively. The time cost analysis technique derives the time cost in the following way:

A. determine the flow count of each dependent flow in G ; B. ALLOCATE(G, P, A); C. determine the time cost of each node in G; D. for each parallel structure PS in G, if the incoming flow

of its fork node is 0, then replace PS with an operation node and set the time cost of the node to 0;

A), A).

E. Reanalysis +- 0; F. Cost +- DERIVE-COST(G);

In the above, Reanalysis is an internal variable that indicates whether a reanalysis operation on flows needs to be performed, and DERIVE-COST is a recursive function that works as follows:

DERIVE-COST(G):

2. if a parallel structure PS does not contain any loop and it contains a decision node V whose incoming flow E is not equal to 0, then 2.1. set V to an operation node; 2.2. Reanalysis + 1; 2.3. for each of V’s outgoing flow E,, do the following:

1. Cost’ t 0;

2.3.1. set E; to a dependent flow; 2.3.2. G’ t G- {all V’s other outgoing flows}; 2.3.3. remove all unreachable flows and nodes

from G‘; 2.3.4. Cost‘ t Cost’ + DERIVE-COST(G’) *

E. - structure with a single operation node and assign the time E ;

619 QIN et al.: MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS

else 2.4. if Reanalysis = 1, do the following:

determine the flow count of each dependent flow in G;

2.5. for each parallel structure PS which is not con- tained in another parallel structure, do the following: 2.5.1. divide each flow in PS by the incoming

2.5.2. replace PS with an operation node and

2.6. determine the time cost of G and assign it to Cost'; 3. return Cost'. The time cost analysis of a computation that contains

operations communicating with each other is very complicated. This is because the time cost of a lock node depends on the waiting time of the lock node which in turn, depends on the locks currently held in the computation. A general analytical approach to derive the time cost of a computation with lock nodes is very difficult to develop because it requires keeping track of the detailed history of the locking and the unlocking processes. For this reason, only some special cases are considered in this section.

Given a lock node locki, define its write lock set, W(locki), as the set of data on which locki requires write locks; and define its read lock set, R(locki), as the set of data on which locki requires read locks. For example, if locki requires write locks on X and Y , and read lock on 2; then W(1ock;) = { X , Y } and R(1ocki) = { Z } .

flow of PS's fork node;

assign the time cost of PS to the node;

h

Fig. 5. Second example of parallel structure.

where (21, i z , . . . ,in) is the ith permutation of ( l , 2 , . . . , n). The proof of Theorem 2 is given in Appendix 2.

Definition 1 : Two lock nodes, locki and lockj, are in conflict Theorem 3: Given the parallel structure in Fig. 5. Such a parallel structure has all the properties of the computation as given in Theorem 2 except that the time costs of all opil nodes are different. Without loss of generality, assume Cll < . . . < C1; < . . . < Cl,. The time cost of the parallel

if the following condition is satisfied:

W(lock i ) n W ( l o c k j ) U W(lOcki ) n R( lock j )

R(zocki) W ( l o c k j ) # { 1' structure is then equal to

Definition 2: Two lock nodes are in parallel if they are in different paths of a parallel structure.

Definition 3: Two lock nodes are independent of each other if they are not in conflict; or if they are in conflict, but are not in parallel.

Theorem 2: Given the parallel structure in Fig. 5. Assume that all the opil nodes have the same time costs, any two lock nodes are in conflict, and any lock node is independent of any lock node outside the parallel structure. For brevity, let

time cost of opil = Cl; time cost of opi2 = C2i time cost of 0pi3 = C3; time cost of lock; = CL; time cost of unloclci = Cui time cost of fork = CF time cost of join = C J.

The time cost of this parallel structure is then equal to

1 n! C F + - - * ~ max

n! ;= l l%n

f

where 1 5 i 5 j and 1 5 j 5 n, The proof of Theorem 3 is given in Appendix 3.

The parallel structure is Fig. 5 contains only one group of lock nodes in the sense that all these lock nodes are mutually in conflict. Theorems 2 and 3 can be generalized so that they can be used to derive the time cost of a computation which has n groups of lock nodes where each group's lock nodes are mutually in conflict. Corollaries 2 and 3 show how to derive the time cost of a parallel structure that contains two groups of lock nodes.

Corollary 2: Given a parallel structure as in Fig. 6. Assume all lock nodes in it can be divided into two groups such that any two lock nodes in the same group are in conflict; any two lock nodes in different groups are not in conflict; all the opi, nodes have the same time costs, and all the opi, nodes have the same time costs. Assume any lock node in the parallel structure is independent of any lock node outside the parallel

620

structure. For brevity, let

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

time cost of opZ1 = C1, time cost of opzp = C2, time cost of opts = C3, time cost of opt4 = C4,

time cost of opts = C5, time cost of opZ6 = C6, time cost of lock, = CL1, time cost of Lock: = CL2, time cost of unlock, = CU1, time cost of unlock: = CU2, time cost of fork = C F time cost of Joan = CJ.

The time cost of the parallel structure is then equal to

n! m! f k

f culih) + c 3 i k , r 1 C4j, + C ( C L 2 j h + C5j, f c u 2 j h ) + C6jl + CJ

h=l

where (ill i2,. . . in) is the ith permutation of ( 1 , 2 , . . . , n) , ( j 1 , j 2 , . . . , j m ) is the j th permutation of ( 1 , 2 , . . . , m ) , 1 5 k 5 n and 1 5 1 5 m. The proof of Corollary 2 is similar to the proof of Theorem 2.

Corollury3: Given a parallel structure as in Fig. 6. Such a parallel structure has all the properties as the one given in Corollary 2 except that all the opi, nodes have different time costs, and all the 0pi4 nodes have different time costs. Without loss of generality, assume C11 < . . . < C l i < . . . < C1, and c41 < ... < C4j < . . . < (24,. The time cost of the parallel structure is then equal to

1 1 c4k + (CL2h + C5h + c u 2 h ) + C61 + C J h=k

where 1 5 i 5 j , 1 5 j 5 n, 1 5 IC 5 1, and 1 5 1 5 m. The proof of Corollary 3 is similar to the proof of Theorem 3.

Theorems 2 and 3 can also be combined. Theorem 4 shows one possible combination.

Theorem 4: Given a parallel structure as in Fig. 6. Such a parallel structure has all the properties as the one given in Corollary 2 except that any two lock nodes in the parallel structure are in conflict. Besides, the time cost of opil is different from the time cost of opj4, where 1 5 i 5 n and 1 5 j 5 m. Without loss of generality, assume Cli < C4j,

Fig. 6 Third example of parallel structure

where 1 5 i 5 n and 1 5 j 5 m. The time cost of the parallel structure is then equal to

where (ill 2 2 , . e , in) is the ith permutation of (1, 2 , . . . n) , 1 5 k 5 n, ( j l , j 2 , . . . , j m ) is the j th permutation of (1,2, . . . ,m) and 1 5 1 5 m. The proof of Theorem 4 is similar to the proof of Theorems 2 and 3.

To illustrate how the techniques discussed in this section can be used to derive time costs, consider the problem of adding n numbers. These n numbers are stored in an array datu and the sum will be stored in the variable total. One way to add these n numbers is to have m adders acting in parallel (assume n 2 m). Each adder adds 2 numbers and puts the subtotal into a local variable. All the subtotals are then added to total to get the result. Fig. 7 illustrates the general structure of the computation and Fig. 8 illus- trates the computation for the ith adder (1 5 i 5 m).

QIN er al.: MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS 621

8 n i t o

The number of times each node is executed can be found using the flow analysis technique as discussed at the beginning of this section:

inito 1 fork 1 j o in 1 init; 1

add; n/m inc, nlm

Di n/m + 1

lock 1 storei 1

unlock 1.

Suppose P processing power is provided and processing power is equally allocated. Thus, each parallel path gets 2 processing power and the time cost of each node is then equal to

Fig. 7. The computation to add R numbers.

%nit0 CO/ min(1,p) f o rk c,/ min(1,p) j o in c,/ min(l ,p) init; m * Ginit/ m i n ( m , ~ ) Di m * Cd/ min(m,p)

addi m * Ca/ min(m,p)

lock m * Cl/ min(m,p)

unlock m * Cu/ min(m,p).

inc; m * cine/ min(m, P )

store; m * Cs/ min(m, P )

Fig. 8. The computation of the zth adder.

All the operations in the computation are specified as follows:

inito : total + 0 init; : k[ i] +- i ; subtotal[i] + 0

Di : k[ i] 5 n addi : subtotal[i] + subtotal[i] + data[k[i]] inti : k[ i] + k[ i ] + m lock : write lock on total

storei : total + total + subtotal[i] unlock : write lock on total

Assume the base time costs of start, end, and or nodes are zero, all other nodes are sequential and their base time costs are as follows:

init0 CO f ork C f jo in c, initi Cinit

Di c d

addi c a

znc; c z n c

lock Cl store; CS

unlock Cu .

Let 1x1 be the largest integer that is less than or equal to IC. Carefully examining the computation, one can find for the first n - m * [n/ml adders, each adder adds 1n/m] + 1 numbers, and for the last m - n + m * Ln/m] adders, each one adds Ln/m] numbers. Therefore, for 1 5 i 5 (n - m* Ln/mJ), the body of the loop in the ith adder is executed Ln/mJ + 1 times, and for (n - m* Ln/m] + 1) 5 i 5 m, the body of the loop in the ith adder is executed Ln/mJ times. Replace init node and the loop in the ith adder with a single node op;, and assign the sum of their time costs to op,. For 1 5 i 5 (n - m* Lnlm]), the time cost of op; is

m*Cinit/ min(m,p) + (Ln/mJ + 2)*m*Cd/ min(m,p)

+( ln/mJ + 1 ) * m * G + Cine)/ min(m1 P ) = m* { + ( Ln/mJ + 2)" Cd

+ (Ln/mJ + 1)*(Ca + Cinc))/min(m,p).

Similarly, for (n - m* Ln/m] + 1) 5 i 5 m, the time cost of op, is

m*{Cinit + (Ln/m] + 1)"Cd + LdmJ * ( C a + C;nc))/ min(m,p).

The new computation structure for the ith adder after replacement is illustrated in Fig. 9. One can find all these m adders can be divided into two groups. The time costs of any two op; nodes in the same group are the same, the time costs of any two opi nodes in different groups are different, and the lock node in each adder requires the same locks. Thus, the

622 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

1 0 nlwk

Fig. 9. New structure of the ith adder

whole computation has the same structure as the computation dealt in Theorem 4 and Theorem 4 can then be used to derive the time cost.

Let Copl be the time cost of opi where 1 5 i 5 (n - m* Lnlm]), COp2 be the time cost of opi where (n - m* Ln/ml + 1) 5 i 5 m, the time cost of lock, be c l o c k , the time cost of storei be Cstore, and the time cost of unlocki be Gunlock. It can be seen that cop:! < Copl. The time cost of the whole computation is:

x! ?I! k

where x = (m - n + m*Ln/ml), y = (n - m* Ln/m]), 1 5 IC 5 x and 1 5 1 5 y.

As a special case, assume n is dividable by m. Then Ln/m] = n/m, x = m, y = 0 and the above time cost expression becomes

Let p 2 m (note, m 2 1). That is, each adder gets at least unity processing power. The time cost is then equal to

To find the number of adders that minimizes the time cost, take the derivative with respect to m,

Set the above expression to zero and solve it for m. We get

m =

That is, given that n is dividable by m and each adder gets at least unity processing power, the time cost can be minimized if there are Jn,*(Cd + C, + Cinc)/(Cl + Cs + Cu) adders. This result is consistent with the result of the macro analysis (the best complexity can be obtained if m = 6) [14]. However, the result obtained here is more accurate and it provides more insight to help improve the time cost behavior. For example, by using the result of the micro time cost analysis, one can find as Cd increases, more adders are needed to achieve the best performance; and as Cl increases, less adders are needed. Such useful information is not available if one uses the macro time cost analysis.

VII. TIME COST ANALYSIS SYSTEM (TCAS)

To automate the micro time cost analysis of parallel com- putations, a software tool, TCAS (time cost analysis system), was fully designed and implemented. TCAS is an interactive system. It is implemented in C and runs under the Unix system. A set of commands is provided in TCAS for users to carry out time cost analyses and evaluations. Generally speaking, what the TCAS system does is to take a user’s computation as input and determine its time cost behavior. The modified computation structure model discussed in Sections IV and V is used as the underlying model in TCAS. Techniques discussed in Section VI are used in TCAS to derive time costs. Three parameters, the number of processors, the processor speed, and the allocation policy, are used in TCAS to characterize the property of the underlying computer system that supports the execution of the computation to be analyzed. The user can use some TCAS commands (chg command) to change the characteristics of the underlying computer system. By chang- ing the characteristics of the underlying computer system, the time cost behavior of the user’s computation changes.

Fig. 10 illustrates the general structure of the TCAS system. The user interface provides interaction between the TCAS system and users. It interprets user commands and calls corresponding system modules to fulfill the user’s task. Four types of commands are provided in TCAS: editing commands, library manipulation commands, time cost analysis commands, and some miscellaneous commands (for example, exit from

QIN er a / . : MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS 623

user interface I

syntax checke'

imulat ion ubsystem

user library

cost expressio

system library

cost expression

TCAS, change the properties of the underlying computer system, etc.) The following shows all the TCAS commands.

chg: change analysis mode or property of underlying computer system

copy: copy computation, distribution, or file cost: calculate cost

derc: derive computation disp: display cost expression, distribution, or machine

properties edit: edit file, computation or distribution exit: exit from TCAS

del: delete computation, distribution, or file

graph: plot cost curve histo: generate histogram output

Max: determine maximum cost

min: determine minimum cost

var: determine cost variance.

list: list files, library or system status

mean: determine average cost

move: rename computation, distribution or file

The user's computation to be analyzed is written in one of the languages supported by TCAS. Currently, TCAS supports one language that is a subset of Pascal but has extra features to support parallel computing. Since the only language de- pendent components in TCAS are the syntax checker and the computation structure deriver, TCAS can be easily expanded or redesigned to support other languages.

The user can use the editing facility in TCAS (edit com- mand) to prepare the computation to be analyzed. Once the user's computation is created, it is first checked by the syntax checker. The syntactically correct computation is then stored in the library for future analyses and evaluations.

To do time cost analysis and evaluation on a computation, the computation is retrieved from the user library and the com- putation structure deriver derives from it the corresponding computation structure. The computation structure derived is simply a graph, it is language independent and it contains all the information necessary to carry out time cost analysis. Typically, each node in the computation structure has a base time cost, each edge has a flow count that indicates the number of times activation signals will traverse the edge, and if a node is a lock or an unlock node, what and how many locks are to be obtained or released in that node are also specified.

Once the computation structure is derived, the flow analysis technique discussed in Section VI is used in the flow balance analyzer to determine the number of times each operation in the computation is performed.

TCAS can work either in the simulation mode or in the analytic mode. Users can switch from one mode to another using the chg command. Two subsystems, the simulation sub- system and the analytic subsystem, are used to support these two modes. Both subsystems take the computation structure with flows analyzed as input and return the corresponding time cost.

The analytic subsystem uses the analytic approach to deter- mine the time cost of the computation to be analyzed. Given the computer system in which the user's computation resides, the cost expression deriver applies the time cost analysis technique discussed in Section VI on the computation structure to get an expression called a cost expression. This expression describes the time cost behavior of the computation and it is a function of some parameters that represent the properties of the input to the computation. The cost expression so derived is quite complex. To reduce the amount of future evaluation work, the cost expression is symbolically simplified by the cost expression simplifier, and the simplified cost expression is then processed by the cost expression evaluator to obtain the time cost of the computation.

The simulation subsystem uses the simulation technique to determine the time cost of the computation to be inves- tigated. What the simulation subsystem does is to simulate the traveling of the activation signal in the computation to be investigated using the modified computation structure model discussed in Sections IV and V. During the simulation, an activation signal carries three attributes, time cost attribute, processing power attribute, and allocation policy attribute. Locks of all the variables in the computation are initialized before simulation. The simulation process starts when an activation signal enters the start node of the computation. The time cost attribute of this activation signal is initialized to zero; the processing power and allocation policy attributes of the signal are initialized according to the characteristics of the execution environment in which the computation resides. Every time an activation signal passes a node, the time cost of the node is added to the time cost attribute of the signal according to the definition in the model. When an activation arrives at the end node of the computation, the simulation process terminates and the time cost attribute of the signal is then the time cost of the computation. During the simulation, a random number generator is used at decision nodes to

624 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

determine which outgoing edge an activation signal should take; activation signals are created at fork nodes to simulate parallel execution, locks are obtained and released at lock nodes and unlock nodes, respectively, and a lock waiting queue is used to hold those signals that wait for locks. When more than one signal requires conflict locks at the same time, a random number generator is used to determine which signal gets the required locks first and all the other signals are put into the lock waiting queue.

The time cost evaluator contains a number of functions to evaluate the user’s computation. It can detemine the minimum time cost, the maximum time cost, the average time cost, and the time cost variance. It can also plot the time cost curve and the time cost distribution.

To illustrate the use of the TCAS system, consider the classical producer-consumer problem. Assume there are one producer and one consumer. The producer produces n products and puts the products into a buffer, while the consumer gets these products from the buffer and consumes them. However, the producer and the consumer cannot access the buffer at the same time, the buffer can only hold ten products, the producer cannot put a product into the buffer when the buffer is full, and the consumer cannot get a product from the buffer when the buffer is empty. The following is the corresponding computation which is written in a Pascal-like language.

computation (n: integer)

var count, i, next, j , first, produce, consume: integer;

begin count := 0; fork

{@ perf(n, full, empty) }

buffer: array [1..10] of integer;

unlock(buffer: read, count: write); lock(buffer: read, count: write);

end; count := count - 1; consume := buffer[first]; { ............... gets

consume := consume + j ; {.. .............

first := first - (first/ 10) * 10 + 1

product from buffer } unlock(buffer: read, count: write);

consumes product }

end end

join end.

In the above computation, the fork-join statement is used to create two parallel paths that correspond to the producer and the consumer. The lock and unlock statements are used to obtain and release locks, respectively. Special comment segments that start with “@” signs are used to provide informa- tion needed for time cost analysis. The first special comment segment indicates the computation has three performance parameters, n, fu l l , and empty, with n indicating the number of products to be produced or consumer, full indicating the number of times the producer finds the buffer full when it needs to put products into the buffer, and empty indicating the number of times the consumer finds the buffer empty when it needs to consume products. For each for loop statement or while loop statement, a special comment segment is used to indicate the number of times the body of the loop is executed. Statement produce := i in the producer is used to simulate the product producing process, while statement consume := consume + j in the consumer simulates the product consuming process.

Assume the underlying computer system has one Vax pro- cessor and processing power is equally allocated. Assume the number of products to be produced is 8 and the consumer is always slower than the producer. Thus, n = 8, full = 0, and empty = 0. In the following, the chg command is used to set the execution environment and the cost command determines the time cost (units are microseconds):

begin { ............... producer } next := 1; for {@ n } i := 1 to n do begin produce := a; {. .............. produces product } lock(buffer, count: write); while {@ full } count = 10 do

begin unlock(buffer, count: write); lock(buffer, count: write)

end; count := count + 1; bufferlnext] := produce; {. .............. puts

product into buffer } unlock(buffer, count: write); next := next - (next/ 10) * 10 + 1

end end; begin {.. ............. consumer }

first := 1; for {@ n } j := 1 to n do begin

lock(buffer: read, count: write); while {@ empty } count = 0 do

begin

chg p 1 + chg ma Vax =+ chg a equal 3 cost produce-consume(& 0,O) produce-consume : computation structure derived time cost of produce-consume : 1059.00.

Assume the underlying computer system has two PDP-11 processors and processing power is equally allocated. Assume the number of products to be produced is 50 and the consumer is always slower than the producer. Thus, n = 50 and empty = 0. Further assume the number of times the producer finds the buffer full varies from 0 to 100 and it is binomially distributed with p = 0.6. The following TCAS commands set the assumptions and determine the average time cost

QIN el al.; MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS 625

(command mean determines the average time cost):

+ chg p 2 + chg ma PDPll + mean binomial produce~consume(50,O : 100,O) produce-consume : computation structure derived value for parameter p : 0.6 average time cost of produce-consume : 16950.69.

Assume the underlying computer system is the same as before. Assume the number of products to be produced is 20, the consumer is always faster than the producer, and the number

for {@ n*n/4 } j 3 := 1 to n2 do begin

423, j3 ] := 0.0; for {@ n*n*n/4 } k3 := 1 to n do

423, j3] := ,423, j3 ] + 423, IC31 * y[k3, j3] end;

for {@ n / 2 } i4 := n21 to n do { ............ fourth parallel path }

for {@ n*n/4 } j 4 := n21 to n do begin

z[i4, j4] := 0.0; for {@ n*n*n/4 } k4 := 1 to n do

z[i4, j 4 ] := z[i4, j4] + z[i4, k4]*y[k4, j 4 ] end of times the consumer finds the buffer empty is 45. Thus,

n = 20, full = 0, and empty = 45 The corresponding time cost is join

end.

Four parallel execution paths are created in the computation with each parallel path calculating a $ x matrix for the final result. The first special comment segment in the computation

+ cost produce~consume(20,0,45) produce-consume : computation structure derived time cost of produce-consume : 9137.13.

indicates the matrixmulti computation has one performance parameter n that specifies the number of rows or columns in each matrix. The number of times the body of the most inner loop in each parallel path is executed is 5 (since each parallel path performs $ calculations).

Suppose there is one Vax processor and processing power is equally allocated. The disp command in the following is used to obtain the corresponding time cost expression:

+ chg ma Vax

+- chg a equal + disp c matrixmulti matrixmulti(n) = 41.05 + 28.28 * n * n + 67.09 * n * n * n.

Change the number of processors from 1 to 4, we get

Assume the underlying computer system has the same proper- ties as before. Assume the number of products to be produced is 30, the number of times the producer finds the buffer full is 40, and the number of times the finds the buffer empty is 50. The corresponding time cost is

+ cost produce~consume(30,40,50) produce-consume : computation structure derived time cost of produce-consume : 13037.31.

As another example, consider the matrix multiplication chg 1 computation, matrixmulti, that performs z t x * y where x, y, and z are n x n matrices.

computation (n: integer; z, y, z : array [1..100, 1..100] of real)

var i l , i2, 23, i4, j l , j2 , j3, j4, kl , k2, k3, k4, n2, n21: integer;

{@ perf(n) 1

begin n2 := n/2; n21 := n2 + I ; fork

for {@ n /2 } i l := 1 to n2 do { ............ first parallel path }

for {@ n*n/4 } jl := 1 to n2 do

+- c h g p 4 + disp c matrixmulti matrixmulti(n) = 22.39 + 7.07 * n + 7.02 * n * n

+ 16.77 * n * n * n.

begin Assume there are seven Vax processors, the corresponding time cost expression becomes z[il, j l ] := 0.0;

for {@ n*n*n/4 } k l := to n do z[il , j l ] := z[il, jl] + z[il , k l ] * y[kl, j l ] + chg p 7

end; for {@ n /2 } i2 := 1 to n2 do {.. .......... second

parallel path } for {@ n*n/4 } j 2 := n21 to n do

begin 422, j 2 ] := 0.0; for {@ n*n*n/4 } k2 := 1 to n do

z[i2, j 2 ] := ,422, j2] + z[i2, k2] * y[k2, j 2 ] end;

for {@ n / 2 } i3 := n21 to n do {.. .......... third parallel path }

+ disp c matrixmulti matrixmulti(n) = 22.39 + 7.07 * n + 7.02 * n * n

+ 16.77 * n * n * n .

One can find from the above, the time cost of matrixmulti com- putation decreases when the number of processors increases. However, the time cost remains the same when there are more than four processors. This is because there are four parallel paths in the computation and processing power is equally allocated.

626 IEEE TRANSAmIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

VIII. CONCLUSION of the parallel structure in Fig. 2 is then minimized. Next,

We discussed in this paper the micro time cost analysis of parallel computations in a shared memory environment. Models and techniques were developed to describe and analyze the time cost behavior. Based on the models and techniques developed, a time cost analysis tool, TCAS, was designed and implemented. User experience with TCAS has been favorable since TCAS provides useful information for users to determine the characteristics of parallel computations which in turn, allows users to decide what changes must be done in the design to improve the time cost behavior. With this tool, it is possible for users to compare quickly the designs of parallel computations under different circumstances, choose the one with the best time cost behavior as the best design, and document its performance behavior as the system parameters may vary.

APPENDIX 1 PROOF OF THEOREM 1

Suppose each opi node in Fig. 2 gets Pi processing power. To minimize the above time cost, it is equivalent to minimize the following:

max{Costop,(Pl,A) , . . . ,COstopn(Pn,A)}. (1)

If processing power is allocated in the way as given in Theorem 1, the following must be true:

n

P, = P, 0 < Pa 5 P. i = l

assume 1 1 - P < X 2 . j=1

Each opi then gets the following amount of processing power:

Since op, gets less than unity processing power, the time cost of op, becomes

2 c, Costopz (P,. A ) = P .

3=1 - - e,

min 1.P* + [ ill Since each op, has the same time cost, the value of expression

(1) is w. If processing power is not allocated in the way as shown in Theorem 1, then there must exist an o p k which gets Pk processing power where

Pk < P * -+ < 1.

c c3 3=1

For such opk, its time cost is

Let

C = max(C1,. . . , C,, . . . , Cn} En=, cJ . This So the value of expression (1) is greater than indicates if P processing power is provided and if P satisfies

(2), CJ is the minimum value of expression (1) and such a minimum value can be obtained only if processing power is allocated in the way as illustrated in Theorem 1.

I P

then C is the minimum value of expression (1) since the minimum time cost of each op, is simply its base time cost, C,. If each op, is allocated pa = 2 processing Power, then the time cost of op, is

C

Since C 2 C,, then Cost,,,(P,,A) = C. Thus, $ is the minimum processing power to allow op, to have time cost C. Assume

j=1 -

Since each opi gets P, = P * processing power, the

following must be true: n

cj j=1 Pi>----*

C j=1

Thus, each opi gets just sufficient processing power to obtain the minimum value of expression (1). Therefore, the time cost

APPENDIX 2 PROOF OF THEOREM 2

Assume ( 2 1 , z 2 , . . . ,z,) is a permutation of ( 1 , 2 , . . . , n). Suppose the n lock nodes obtain locks in the order ( i l , 2 2 , . . . , zn). Consider the lock node lock,,. Since all the opt, nodes have the same time costs and 1ock21, lock22. . . . , lock,,,_l obtain locks before lock,,, the time cost of the parallel path which contains lock,, is

3

~ 1 % ~ + (CL,, + c 2 z k + Cua,) + c3a9. k = 1

Therefore, the time cost of the parallel structure given that lock nodes obtain locks in the order ( 2 1 , i 2 , . . . , 2,) is

+ CJ.

QIN er al.: MICRO TIME COST ANALYSIS OF PARALLEL COMPUTATIONS 627

Since it is equally likely that a lock node will get the required locks first, the probability that the n Zock nodes obtain locks in the order ( i l , i2,. .. , i n ) is equal to 5. Since the average time cost is used as the time cost of the parallel structure, the time cost of the parallel structure is then equal to

APPENDIX 3 PROOF OF THEOREM 3

Thus, in general, the time cost of the parallel path containing lock, is

m-1

{ { k=i Clm, max cli + (CLk + c 2 k + cuk)

l<i<m-1

+ CLm + C2m + CUm + C3m

Therefore, the time cost of the parallel structure is

C F + max {time cost of the parallel path containing lockj} l < j < n

7

+ C J = C F + l < j < n max i<i<j max cli + { { k i i Since C11 < . . . < Cli < . . . < C l , , lock, obtains locks

before lock, where 1 5 x < y 5 n. Let j = 1 . Since lockl obtains locks first, the time cost of the parallel path containing lock1 is

When j = m - 1, the time cost of the parallel path containing lock,-l will be

f m-1

Rewrite the above time cost as follows f m-1

The above time cost means once the activation signal enters the parallel path containing lock,-l, the locks held by are released by unlock,-l after

f m-1

time units. Let j = m and consider the time cost of the parallel path

containing lock,. Since lock,-l obtains locks before lock,, lock, cannot start to work until the locks held by lock,-l are released by unlockm-l. If before opml finishes working, unlock,-l has finished releasing locks held by then lock, can immediately start to work after opml finishes working and the time cost of the parallel path containing lock, is

Clm + CLm + C2m + Cum + C3m.

If when opml finishes working, unlockm-l has not finished releasing locks held by lock,-l (or unlock,-l has not started working at all), then lock, can only start to work after unlock,-1 finishes releasing locks and the time cost of the parallel path containing lock, is

f m.-1

REFERENCES

[ 11 J. Cohen, “Computer-assisted microanalysis of programs,” Commun. ACM, vol. 25, no. 10, Oct. 1982.

[Z] A. Aho, J. Hopcroft, and J. Ullman, Data Structures and Algorithms. Reading, MA: Addison-Wesley, 1983.

[3] T. Booth, R. Ammar, and R. Lenk, “An instrumentation system to measure user performance in interactive systems,” J. Sysr. Sofhvare, vol. 1, no. 2, Dec. 1981.

[4] T. Booth, D. Zhu, M. Kim, B. Qin, and C. Albertoli, “PASS: A perfor- mance analysis software system to aid the design of high performance software,” in Proc. First Beijing Int. Con$ Comput. Appl., Beijing, June 1984.

[5] T. Cheatham, Jr., G. Holloway, and J. Townley, “Symbolic evaluation and the analysis of programs,” IEEE Trans. Software Eng., vol. SE-5, no. 4, July 1979.

[6] J. Cohen and N. Carpenter, “A language for inquiring about the run time behavior of programs,” Sofhvare Practice and Exper., vol. 7, no. 4,

[7] D. Ferrari and M. Liu, “A general purpose software measurement tool,” Sofhvare Practice and Exper., vol. 5 , no. 2, Mar.-Apr. 1975.

[8] A. Gabrielian, L. McNamee, and D. Trawick, “The qualified function approach to analysis of program behavior and performance,” IEEE Trans. Software Eng., vol. SE-11, no. 8, Aug. 1985.

191 M. Kim, “Automatic performance analysis of assembly level software,” M.S. thesis, Comput. Sci. Eng. Dep., Univ. of Connecticut, 1983.

[ lo] D. Knuth, The Arts of Computer Programming Vol. 1 : Fundamental Algorithms, 2nd ed., Reading, MA: Addison-Wesley, 1973.

[ l l ] L. Kronsjo, Computational Complexity of Sequential and Parallel Algo- rithms. New York: Wiley, 1985.

[12] H. Law, “Dynamic analysis and evaluation of software designs: An experimentation analysis approach,” M.S. Thesis, Dep. Comput. Sci. Eng., Univ. of Connecticut, 1987.

[13] B. Qin, “PPAS: A performance analysis tool for software designs,” M.S. Thesis, Dep. Comput. Sci. Eng., Univ. of Connecticut, 1984.

[14] M. Quinn, Designing Efficient Algorithms for Parallel Computers. New York: McGraw-Hill, 1987.

[15] H. Sholl and T. Booth, “Software performance modeling using com- putation structures,” IEEE Trans. Sofhvare Eng., vol. SE-1, no. 4, Dec. 1975.

[16] S. Simmons, “A software tool for the measurement of algorithmic performance,” M.S. Thesis, Dep. Comput. Sci. Eng., the Univ. of Connecticut, 1980.

[17] R. A. Ammar, J. Wang, and H. Sholl, “A graphical modeling technique for software performance analysis,” J. Inform. Sofhvare Technol., Dec. 1990.

[18] B. Wegbreit, “Verifying program performance,” J . ACM, vol. 23, no. 4, Oct. 1976.

July-AUg. 1977.

628 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 5, MAY 1991

1191 T. Wetmore IV, “Performance analysis and prediction from computer program source code,” Ph.D. dissertation, Dep. Comput. Sci. Eng., Univ. of Connecticut, 1983.

[20] T. Wetmore IV, “The performance compiler-A tool for software designs,” M.S. Thesis, Dep Comput. Sci. Eng., Univ. of Connecticut, 1980.

1211 R Ammar and B. Qin, “An approach to derive the time costs of sequential computations,” J. Syst. Software, vol. 10, no 3, Mar. 1990

[22] A. Tanenbaum, Operating Systems: Design and Implementation. En- glewood Cliffs, NJ: Prentice-Hall, 1987.

[23] R Ammar and B Qin, “A flow analysis technique for parallel compu- tatlons,” In PrOC 7th Annu. IL?E!? COnf COmPUt. Co“un., Scottsdak AZ, Mar. 1988.

[24] -, “A technique to derive time costs of parallel computations,” in Proc. COMPSAC 88, IEEE Comput. Society’s Twelfih Int. Comput. Software Appl. Con j , Chicago, IL, Oct. 3-7, 1988

Howard A. Sholl (M’68) is Director of the Taylor L. Booth Center for Computer Applications and Research, and Professor of Computer Science and Engineering at the University of Connecticut His primary research interests are in the areas of real- time systems, performance analysis, and software engineering, where he has frequently published. He was a Leverhulme Visiting Fellow at the University of Edinburgh in 1973-1974, and a Fulbright Senior Research Fellow at the Technical University of Munich in 1981 - 1982

Dr. Sholl is a member of the Association for Computing Machinery and the IEEE Computer Society.

Bin Qin (S’85-M’88) received the B.Sc degree in computer Science from Nanjing University in 1982 and the M Sc. and Ph.D. degrees in computer science and engineering from the University of Connecticut, Storrs, in 1984 and 1987, respectively.

He is currently a Senior Associate Development Analyst in IBM Toronto. Before he joined IBM in December 1989, Bin did his postdoctoral and lectured at the University of Connecticut; was a re- search associate at Brandeis University; and worked as a senior research analyst in Microstat Develop-

ment Corporation, Vancouver, B.C., Canada. His research interests include performance modeling and analysis of sequential and parallel computations, parallel processing, compiler design, interprocess communication, and soft- ware engineering.

Dr. Bin is a member of the Association for Computing Machinery and IEEE Computer Society.

Reda A. Ammar (S’79-M’82SM190) received the B.S. degree in 1973 in electrical engineering from Cairo University, Cairo, Egypt, the B.S. degree in mathematics in 1975 from Ein-Shames University, Egypt, and the M.S. and Ph.D. degrees in 1981 and 1983, respectively, from University of Connecticut, Storrs.

He is an Assistant Professor of Computer Science and Engineering at the University of Connecticut. From 1973 to 1978, he was a Lecturer for the Facultv of Enzineerine. Cairo Universitv. In 1978. ” VI

he joined the University of Connecticut holding the ranks of Lecturer, Research Assistant, and Visiting Assistant Professor. He was with the Faculty of the Department of Engineering Mathematics and Physics, Cairo University from 1984 to 1985. His research interests include software performance engineering, computer-aided performance engineering, modeling, optimization of sequential and parallel computations.

Dr. Reda is a member of the Association for Computing Machinery and the IEEE Computer Society.