A Global Optimization Approach for Architectural Synthesis

13
1266 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 1993 Global Optimization Approach for Architectural Synthesis Catherine H. Gebotys, Member, IEEE, and Mohamed I. Elmasry, Fellow, IEEE Abstract-A relaxed LP model, which simultaneously sched- ules and allocates functional units and registers, is presented for synthesizing cost-constrained globally optimal architec- tures. This research is important for industry by providing ex- ploration of optimal synthesized architectures since it is well known that early architectural decisions have the greatest im- pact on the final design. A mathematical integer programming formulation of the architectural synthesis problem was trans- formed into the node packing problem. Polyhedral theory was used to formulate constraints that decreased the size of the search space, thus improving integer programming solution ef- ficiency. Execution times are an order-of-magnitude faster than previous research that uses heuristic techniques. This research breaks new ground by 1) simultaneously scheduling and allo- cating in practical execution times, 2) guaranteeing globally op- timal solutions for a specific objective function, and 3) provid- ing a polynomial run-time algorithm for solving some instances of this NP-complete problem. I. INTRODUCTION RCHITECTURAL synthesis is an important part of A the VLSI design cycle. The objective of synthesizers is to transform an input algorithm (or behavior) into a hardware architecture that minimizes a cost function and satisfies a set of constraints. Synthesizers must produce globally optimal architectures and execute quickly to pro- vide early exploration of design tradeoffs. In addition, synthesizers should be able to optimize linear or piece- wise linear cost functions, and incorporate complex tim- ing constraints. The architectural synthesis problem in- volves several interdependent subtasks including scheduling, and the allocation of functional units, regis- ters, and interconnects. The preceding problem is be- lieved to be NP-hard, since many of its subtasks have been defined as NP-complete [ 11. Since the subtasks of architectural synthesis are highly interdependent, we must solve simultaneously for more than one subtask. The classic problem given subse- quently, known as precedence constrained scheduling, is taken from 12, p. 811 and is an important subtask of the architectural synthesis problem. Manuscript received February 21, 1991; revised December 14, 1992. This paper was recommended by Associate Editor A. Parker. The work of the first author was supported by NSERC and the University of Waterloo Start Up Grant. The work of the second author was supported in part by grants from BNR, GE, and NSERC. The authors are with the Department of Electrical and Computer Engi- neering, University of Waterloo, Ontario, Canada. IEEE Log Number 9207800. “A set T ‘tasks’ (each assumed to have ‘length’ one), a partial order < * on T, a number m of ‘processors’ and an overall ‘deadline’ D E Z. Is there a ‘schedule’ a: T .+ (0, 1, - - , D} such that, for each i E (0, 1, - * , D}, I{t E T : a(t) = i}I + m, and such that, whenever t <* t’, then a(t) < 0 (t)’ )?” This problem, known as precedence constrained sched- uling, is NP-complete [2] (except for two processors 131, or if the partially ordered set has an intree structure 121). In architectural synthesis, this problem (with many exten- sions that will be defined later in this paper) is called si- multaneous scheduling and jimctional unit allocation. The processors are functional units and the tasks are the code operations of the input algorithm. Any input algorithm or directed acyclic graph (DAG) can be represented by a par- tially ordered set of code operations. The following ar- chitectural synthesis problems, 1 and 2, will be examined and solved herein. Produce a schedule, by mapping each code opera- tion of an input algorithm to a time and map each code operation to a functional unit. The objective is to minimize a cost function of the functional units, given an upper bound on the time to execute the input algorithm (on the synthesized architecture). Produce a schedule, as in Problem 1 and map each code operation to a functional unit and a register (which holds the variable output by the code oper- ation, until the last code operation that uses the vari- able has been executed). The objective is to mini- mize a cost function of the functional units and reg- isters, given an upper bound on the time to execute the input algorithm (on the synthesized architec- ture). both Problems 1 and 2, the following constructs are to be supported: a) selection of single-cycle, multicycle, or pipelined functional units, b) functional pipelining (or pipelining of the input algorithm), c) conditional code, d) loops, and e) fixed, minimum, or maximum timing con- straints. Many heuristic approaches to Problem 1 have been re- searched in [4] and [5]. More recently, [4] has extended the stepwise refinement algorithm, HAL, to include changes in scheduling due to register and interconnect es- 0278-0070/93$03.00 0 1993 lEEE

Transcript of A Global Optimization Approach for Architectural Synthesis

1266 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 1993

Global Optimization Approach for Architectural Synthesis

Catherine H . Gebotys, Member, IEEE, and Mohamed I . Elmasry, Fellow, IEEE

Abstract-A relaxed LP model, which simultaneously sched- ules and allocates functional units and registers, is presented for synthesizing cost-constrained globally optimal architec- tures. This research is important for industry by providing ex- ploration of optimal synthesized architectures since it is well known that early architectural decisions have the greatest im- pact on the final design. A mathematical integer programming formulation of the architectural synthesis problem was trans- formed into the node packing problem. Polyhedral theory was used to formulate constraints that decreased the size of the search space, thus improving integer programming solution ef- ficiency. Execution times are an order-of-magnitude faster than previous research that uses heuristic techniques. This research breaks new ground by 1) simultaneously scheduling and allo- cating in practical execution times, 2) guaranteeing globally op- timal solutions for a specific objective function, and 3) provid- ing a polynomial run-time algorithm for solving some instances of this NP-complete problem.

I. INTRODUCTION RCHITECTURAL synthesis is an important part of A the VLSI design cycle. The objective of synthesizers

is to transform an input algorithm (or behavior) into a hardware architecture that minimizes a cost function and satisfies a set of constraints. Synthesizers must produce globally optimal architectures and execute quickly to pro- vide early exploration of design tradeoffs. In addition, synthesizers should be able to optimize linear or piece- wise linear cost functions, and incorporate complex tim- ing constraints. The architectural synthesis problem in- volves several interdependent subtasks including scheduling, and the allocation of functional units, regis- ters, and interconnects. The preceding problem is be- lieved to be NP-hard, since many of its subtasks have been defined as NP-complete [ 11.

Since the subtasks of architectural synthesis are highly interdependent, we must solve simultaneously for more than one subtask. The classic problem given subse- quently, known as precedence constrained scheduling, is taken from 12, p. 811 and is an important subtask of the architectural synthesis problem.

Manuscript received February 21, 1991; revised December 14, 1992. This paper was recommended by Associate Editor A. Parker. The work of the first author was supported by NSERC and the University of Waterloo Start Up Grant. The work of the second author was supported in part by grants from BNR, GE, and NSERC.

The authors are with the Department of Electrical and Computer Engi- neering, University of Waterloo, Ontario, Canada.

IEEE Log Number 9207800.

“A set T ‘tasks’ (each assumed to have ‘length’ one), a partial order < * on T, a number m of ‘processors’ and an overall ‘deadline’ D E Z.

Is there a ‘schedule’ a: T .+ (0 , 1, - - , D} such that, for each i E (0, 1, - * , D}, I{t E T : a(t) = i}I +

m, and such that, whenever t < * t’, then a(t) < 0 (t)’ )?”

This problem, known as precedence constrained sched- uling, is NP-complete [2] (except for two processors 131, or if the partially ordered set has an intree structure 121). In architectural synthesis, this problem (with many exten- sions that will be defined later in this paper) is called si- multaneous scheduling and jimctional unit allocation. The processors are functional units and the tasks are the code operations of the input algorithm. Any input algorithm or directed acyclic graph (DAG) can be represented by a par- tially ordered set of code operations. The following ar- chitectural synthesis problems, 1 and 2, will be examined and solved herein.

Produce a schedule, by mapping each code opera- tion of an input algorithm to a time and map each code operation to a functional unit. The objective is to minimize a cost function of the functional units, given an upper bound on the time to execute the input algorithm (on the synthesized architecture). Produce a schedule, as in Problem 1 and map each code operation to a functional unit and a register (which holds the variable output by the code oper- ation, until the last code operation that uses the vari- able has been executed). The objective is to mini- mize a cost function of the functional units and reg- isters, given an upper bound on the time to execute the input algorithm (on the synthesized architec- ture).

both Problems 1 and 2, the following constructs are to be supported: a) selection of single-cycle, multicycle, or pipelined functional units, b) functional pipelining (or pipelining of the input algorithm), c) conditional code, d) loops, and e) fixed, minimum, or maximum timing con- straints.

Many heuristic approaches to Problem 1 have been re- searched in [4] and [5]. More recently, [4] has extended the stepwise refinement algorithm, HAL, to include changes in scheduling due to register and interconnect es-

0278-0070/93$03.00 0 1993 lEEE

GEBOTYS AND ELMASRY 1 ARCHITECTURAL SYNTHESIS 1267

timated effects. Problem 2 represents simultaneous sched- uling, and allocation of functional units and registers. Several research projects that simultaneously optimize Problem 2 in a formal or mathematical manner are dis- cussed subsequently.

A Mixed Integer Linear Programming (MILP) model in [6] solves Problem 2 and, in addition, simultaneously schedules in real time and selects registers and functional units from a library. Unfortunately, only very small ex- amples could be solved due to the size of the model and the inefficiencies of the branch-and-bound technique.

A simulated annealing technique presented in [ 13 solves Problem 2 . The cost functions include a calculated num- ber of registers and an estimated number of interconnects. Running times were achieved comparable to heuristic techniques, although the rate of convergence to a global optimum is exponential [7].

An integer programming (IP) formulation for solving Problem 1 is used to produce a schedule [8] that can min- imize the number of functional units in one step and the sum of the variable lifetimes of the algorithm (or execu- tion time) in the second step. By using a two-step meth- odology, they have shown that individual IP’s can be solved very quickly by incrementally moving across the design space and using previous allocations as an upper bound on solving for the increased number of control steps. A heuristic partitioning procedure was presented to overcome the IP limitation to small problems ( 5 200 vari- ables [9]).

A graph-theoretical approach to the simultaneous scheduling and resource (modules and registers) minimi- zation problem was presented in [ 101. A two-dimensional placement of the data flow graph, where makespan (or execution time), graph height (number of modules), and modified cutwidth measurement (estimated number of registers) were defined, was used to represent the sched- uling problem. A heuristic algorithm was used to solve the problem since the complexity was NP-complete.

General IP solving techniques, such as branch and bound [7], are very inefficient and only practical for 100- 200 variables [9]. Even general 0- 1 IP problems with lin- ear constraints and objective functions are among the NP- complete problems for which no technically good algo- rithms are known to date [ l l ] . However, under certain conditions, polyhedral theory can be used to solve the IP more efficiency. Research in this area is motivated by the fact that integer programming is NP-hard, and very erratic performance is solving IP’s has been experienced. It has been proven that, for every bounded linear system of in- equalities (or polytope), there exists another linear system of inequalities (integral polytope) that can be solved as a linear program (e.g., using the simplex algorithm) and always produce integer solutions [7]. Integral facets’ of the polytope are the necessary constraints required to de- fine the integer vertices of the polytope. If these facets are

‘ A facet is a valid inequality of dimension one less than the dimension [7] of the polyhedron.

used over the region of a minimum (or maximum) objec- tive function, then the solution of the IP can be obtained from the relaxed Linear Programming (LP) solution. Al- though for many problems it is not known what the facets look like, there are some problems, such as matching and total unimodularity [7], for which the facets have been completely characterized. In another class of problems, such as the traveling salesman problem (TSP) and the node packing problem, the facets have been partially charac- terized. The recent success in the TSP problem has an important impact on solving large IP’s. For example [7], the TSP with 7000 integer variables was solved in 2 min 30 s, and a 50 000-variable TSP was solved to 0.25% optimality in 30 min. Unfortunately, the large amount of research in the traveling salesman problem is not propor- tional to its applicability. Crowder et al. [ l l ] have re- cently shown that, even if a problem does not fall into one of these classes, yet a subset of the problem can be mapped into one of these partially characterized classes of problems, then by using facets of this subpolytope the size of the search space is reduced and large 0-1 IP prob- lems can be solved efficiently. For example, they solved for over 2000 binary variables in under 1 h of CPU time.

The focus of this paper is on the application of node packing to architectural synthesis, which, similar to the TSP, has been partially characterized by its facets. In ad- dition, a property unique to node packing (and not valid for any other IP) is that, if not all variables in a solution are integers, the variables that are integers remain inte- gers in a globally optimal solution [7]. This important property will be called the decomposition property, since it can be used to decompose a large problem into a num- ber of smaller problems.

The weighted node packing problem [7] is stated more formally as maximizing E, cuxu, where c, is a cost vector and xu is a vertex of the graph G = ( V , E ) , where

for every edge (U, U ) E E

for all nodes U E V.

xu + x,, 5 1,

xu E (0, I ) ,

The problem can also be represented as a system of linear inequalities, Ax 5 6 , where A is a (0, 1) matrix with each row having two 1’s and b is a unit vector.

Finding all integral facets for a particular node packing problem is NP-complete. This problem is known as the stable set polytope problem, using graph-theoretical ter- minology. Nevertheless, only integral facets over the re- gion of the minimum (or maximum) objective function are required to obtain integer solutions. It is also impor- tant to extract some facets since they make the model very tight [7] (or reduce the size of the solution space), thus providing good bounds on the problem. The use of facets is also known to reduce the number of live nodes gener- ated in a branch-and-bound algorithm.

We will show how our research has mapped the archi- tectural synthesis problem into the node packing problem and by extraction and generalization of some integral fac- ets has optimally solved Problems 1 and 2 . Synthesized

1268 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 1993

solutions of several high-level synthesis benchmarks are obtained in (CPU) execution times faster than comparable heuristic technique. The simplex algorithm [9] is cur- rently used to schedule and allocate architectures simul- taneously. The cost function can be piecewise linear, and it can handle fixed resources, cost-constrained allocation, or a combination of both. Functional units can be multi- functional (e.g., an ALU), single-cycled, multicycled or pipelined. Branches, loops, conditional code, and func- tional pipelining are supported. In addition, the selection of the type of functional unit (e.g., pipelined or multicy- cled version of a multiplier) can be optimized simulta- neously.

11. PRELIMINARIES The input to the synthesizer is a list of code operations,

a table defining the partial order between code operations, the as-soon-as-possible (asap) and as-late-as-possible (alap) times for each code operation, and the cost func- tion. The IP model, its transformation to the stable set polytope problem, and techniques used for its solution will be presented next. We will use Problem 1 to illustrate how we transformed the IP into the node packing problem and extracted integral facets, and then we will discuss the additional register allocation constraints necessary for solving Problem 2.

The following terminology is used:

k = code operation. A partial order (or precedence constraint) between k l and k2 is represented by k l < k Z , or, in other words, kl must execute before k2 [2] . Let K be the maximum number of code op- erations in the input algorithm. j = time or control step (cstep), J L j 2 1, j E Z (set of integers), and J = upper bound on the num- ber of control steps. i = functional unit, i E Z. The mapping of k to i (one to many) is also defined based on the functions that each i can perform. For example, an addition code operation k, cannot be mapped to a multiplier i l , however, it may be mapped to an ALU i2 or an adder i3. Thus xl2,,,k, xL3,,,kZ are the variables gener- ated for code operation k,. R = number of registers, R E Z . When x , , , , ~ = 1 in the IP solution, this means that code operation k is assigned to functional unit i, and the code operation starts executing at control stepj. j E R ( k ) , where R (k) = {asap (k), asap (k) + 1 ,

, alap ( k ) } , means that j is lower bounded by the asap control step and upper bounded by the alap control step for code operation k or asap (k) 5 j 5 alap (k) . k, E Op (C,, L,) means that the code operation k, has an execution time of C, (the number of control steps required by the operation to produce an output) and a latency of L, (or the minimum time between suc- cessive data inputs to the functional unit). For ex- ample, a single-cycle operation is k E Op (1, l ) , and

a two-cycle pipelined operation is k E Op(2, 1 ) . This notation is used to simplify the inequalities pre- sented in Section 111. However, it should be noted that, in fact, each code operation can have different C and L values, depending on which functional unit it is mapped to. In Section V we will illustrate this with the notation ( k , i) E Op(C, L).

111. PROBLEM 1 : IP MODEL Problem 1, which performs simultaneous scheduling

and functional unit allocation, can be modeled as an as- signment problem, where the variables are represented by x i , j ,+ . The IP model consists of three types of general con- straints, as discussed below.

Equation (l), called the operation assignment con- straint, ensures that each code operation will be assigned to one control step and one functional unit.

* i , j , k - - 1 , v k . i j eR(k)

Inequality (2 ) , called the functional unit constraint, prevents more than one code operation from being as- signed to the same functional unit at the same control step.

ji = j - L + 1

Inequality (3), called the precedence constraint, pre- vents an operation k l from being scheduled after operation k2 whenever there is a partial order between these opera- tions such that kl < * k2 .

(xi , j i ,ki + x i , j ~ , k ~ ) 5 1 , vkl < * k27 j2 I j l

x ; , ~ , ~ = O or 1. (4)

IV. TRANSFORMATION TO NODE PACKING AND FACET ANALYSIS

When we relax (4) to 0 I x I 1, the system of linear inequalities2 becomes Ax I b, where A is a (0, 1) matrix (or all entries in A are 0 or 1) and b is a unit vector. For example, consider scheduling the two operations k = a and k = b (a E Op(1, 1) b E Op(1, l)), where a must be executed before b (or a < b) , defined over csteps j = 1, 2 , 3 , 4, and 5, thus R(a) = (1, 2 , 3 , 4}, and R ( b ) = ( 2 , 3 , 4, 5 ) . For now, we assume there are two functional units i = 1, 2. The constraints for this problem generated from inequalities (1) through (3) below form the follow- ing A matfix, where x = [XI, l , a , x2,1,a, ~ 1 , 2 , a , ~ 2 , 2 , a , * * *

'Equation (1) could be transformed into an (5) inequality by using a maximization objective function, such as max Ex. However, we use (1) directly. This had no effect on the efficiency of solving the IP.

I269 GEBOTYS AND ELMASRY: ARCHITECTURAL SYNTHESIS

X 1 , 5 , b , x 2 , 5 , bl:

(3)

Row No. Inequality No.

(1)

(2)

This problem, known as set packing [any problem with A a (0, 1) matrix and a unit b vector, and binary x ] [7], can be transformed into a graph, G = ( V , E ) (called the node packing graph), from which we can extract some integral facets of the node packing problem. First, we map each variable, x ~ , ~ , ~ , into a node, u E I/. Edges of the graph, E, are defined by the following procedure. For each row of the matrix A (representing a constraint), the variables

in G [7]. The graph G represents the node packing prob- lem, which can be transformed from any system of linear inequalities with a (0, 1) constraint matrix (A) and a unit vector (6) on the right-hand side. Known integral facets

5 1 , for all K cliques (or maximal complete subgraphs), and ii) C l r s C x U I (\Cl - 1)/2, for all C chordless odd

procedure for odd cycles. For simplicity, first we will examine the one-dimen-

sional (1-D) problem, where the variables are x , , k for all Jegk,) ( j 1x1 ki - c ( j ) X J , k2 5 - C I , V ~ I < * k2. j , k. Again consider the scheduling problem for code op- erations a and b, where a < e b and J = 5. Constraints We will call this constraint ( 5 * ) . Although the set of in- (1) and (3) [with i = 1 , and (2) is no longer valid for the teger feasible solutions is the same, formulation (5) is 1-D model] are used to generate the following F matrix, tighter (or forms a smaller search space) than ( 5 ” ) and the where Fx 5 1 and xT = [xl,., x2,., x3, . , x4, . , x 2 , b , x 3 , b , proof is given in Appendix I . Thus improvements in IP

x J , . and x J , b . In Fig. l(b) we note that the four nodes, x3, . , x4,., x Z , b , x 3 , b , shown in bold, form a clique, or integral facet ( i ) . The edges of this clique were formed by rows (1. l ) , (1.2), and (3.2)-(3.5) of matrix F. The clique can be represented by the inequality x 3 , . + ~ 4 , . + X 2 , b -k X 3 . b

5 1. Therefore, (3.2)-(3.5) can be replaced by the fol- lowing row:

(or columns) with a “1”entry define a complete subgraph3 [ O O 1 1 1 1 0 0 1 .

We can generalize this clique facet (precedence a m - straint) to

c X ~ 2 9 k 2 + x J l , k l v k , < * k 2 , for the node packing problem are as follows: i) C u c K xu ~ 2 5 . 1 J I 21 -cI + 1 J I ER(ki ) J 2 E R ( k 2 )

j E R(k2) rl (R(ki) + ( C I - l)), ki E 0 p ( C 1 , L I )

straint formulation was presented in [8] and [12] as

(5 ) cycles, (cl 2 5 and others in [71 that involve a lifting A different formulation of the 1-D precedence con-

J E R (A21

X 4 , b , X5,bl:

Row No. Inequality No.

( 1 )

(3)

This model, Fx I I , can be visualized as the placement Of code Operations into control steps. The node packing graph for this problem is shown in Fig. 1 . Each row gen- crated using (3) forms an edge between a node

solution efficiency and better bounds on variables are ex- pected with (5) since it is a tighter formulation [7]. o n further analysis of Fig. 1, one can see that the graph is

’A complete subgraph 1s a subgraph K c G, K = (VI, E l ) , such that v characterized by completely generated by U , v E V ’ , ( U , U ) E El (1) and ( 5 ) , and G has no chordless odd cycles. Thus G

1270 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9 , SEPTEMBER 1993

(a) (b) (a) (b) Fig. 1. Node packing graph for two code operations (a < . b), showing in bold a clique facet f o r j = 2 in (a) a n d j = 3 in (b) of inequality (5). R(a)

Fig. 2. Node packing graph for three code operations { a , b, cia < . c , 6 < . c } showing in bold inequality (2) in (a) and inequality ( 6 ) in (b). R ( a )

= { 1, 2 , 3 , 4) and R(b) = (2 , 3 , 4, 5). = R(b) = (1, 2} , R ( c ) = (2, 3 ) .

is a perfect graph (by definition), and its integral polytope is characterized completely by inequalities (1 ) and (5).

We will now extend the 1-D model to the two-dimen- sional (2-D, orxi,j,k) model that we have developed. First, we will show that the clique facet of the 1-D model (5) can be extended to a valid inequality of the 2-D model. Constraint (5) for the 2-D model can be modified by re- placing xj,k by Ci x ; , ~ , ~ , as shown in inequality (6). In- equality (3) is now redundant and can be removed from the model and replaced by inequality (6) , which was proven to be a facet of the 1-D model. For example, we can show that the valid inequality is Ci + x ; , ~ , ~ + xi, 2 , b + xi, 3 , b ) I 1 , generated from rows ( 1 . l ) , (1.2), and (3.2)-(3.5) of matrix A [and generalized in ( 6 ) ] . We can add this inequality to the system of linear inequalities as a row of the A matrix:

[O 0 1 1 1 1 0 0 1 1 1 1 0 0 0 01

and we can remove rows (3.2)-(3.5) since they are now redundant inequalities. This new inequality (6) reduces the size of (or tightens) the search space. For example,

0.5, 0.5, 0.5) is no longer valid and, therefore, is re- moved from the search space. However, is the new valid inequality a facet? The answer to this question is depen- dent on the particular example. We can demonstrate this most easily next by analyzing a node packing graph ex- ample of the 2-D model.

c , a, b, c E Op ( 1 , l)], with an upper bound of three con- trol steps and an upper bound of two functional units. The node packing graphs for this example is given in Fig. 2. Clearly, constraint (1) generates cliques and, therefore, corresponds to integral facets. Constraint (2) shown in bold in Fig. 2(a) is a clique V i , j = 1 (by definition); however, it is not clique for all i, j = 2. For example, constraint (2) generated for i = 1 , j = 2 is x ~ , ~ , ~ + x , , ~ , ~ + x1,2,c I 1; however, it is not a facet because it forms a complete subgraph (that is not maximal) in G composed of the three vertices x ~ , ~ , xl, 2,b , and x ~ , ~ , =. By including vertex x 2 , 2 , c in the inequality, for example, x ~ , ~ , ~ + x ~ , ~ , ~ + x1,2 ,c + x2 ,2 ,c I 1 , the inequality forms a clique in G and, therefore, is a facet. Thus, some generalized in- equalities, such as (2), may or may not generate facets in a specific algorithmic implementation. Inequality (6) also

the fractional point ( x 2 , 3 , a , X2,4,a, x 1 , 2 , b , x 1 , 3 , b ) = (0.5,

Consider three code operations {a , b, c ( a < c, b <

generates clique facets for the 2-D example shown in Fig. 2(b). Thus some facets of lower dimensional models, such as ( 5 ) , remain facets when higher dimensions are added, such as

v k ~ < - k2, j E ( R ( ~ I ) + CI - 1) n R W , (6)

It is also interesting to note that there are only two addi- tional (lifted odd cycle) facets4 not given or used in our model that are needed to characterize completely the in- tegral polytope for the three code operation example in Fig. 2.

V. PROBLEM 1: LP MODEL Thus, the general model for Problem 1-simultaneous

scheduling and functional unit allocation-consists of vari- a b l e s ~ ; , ~ , ~ with constraints ( l ) , (2), and (6 ) . Finally, con- straint (4) can be relaxed to the following constraint (7) since we now have some integral facets and can try to solve them by using linear programming. The complete LP model for solving Problem 1 is given below. It was applied to some benchmark architectural synthesis ex- amples and optimized for different cost functions.

X i , j , k - - 1, vk. i j E R ( k )

i i = i - L + 1 c c i 1= i

Xi, j i ,k 1 , V i , j .

vk, < k2, j E ( R ( k l ) + C1 - 1) fl R(k2) ,

0 5 x;,j,k 5 1. (7)

4The additional (lifted odd cycle) facets are x2. I , u + x2. I . h + E,,,, x , ,~ . , , 5 2 a n d x , , , , , + X i . i . h + E, k X t , 2 . k 5 2.

GEBOTYS AND ELMASRY: ARCHITECTURAL SYNTHESIS 1271

VI. PROBLEM 2: CONSTRAINT FOR

REGISTER ALLOCATION The LP model for Problem 2-simultaneous scheduling

and functional unit and register allocation-is exactly the same as the LP model presented in Section V (for solving Problem 1) except additional constraints are generated for register allocation. The register allocation constraint re- quired for this model requires some preprocessing. We have researched many different formulations [ 131 for reg- ister allocation. The formulation that gave the best per- formance (in terms of CPU time to solve the optimization problem) will be presented subsequently.

Register allocation ensures that there are no more than R registers required (or R variables whose lifetimes over- lap) at any cstep. A variable lifetime can be represented by a (lifetime-defining) edge (k < k,) between the code operation, k, which outputs the variable, and the code op- eration that last used the variable, kl. However, in many algorithms, each variable may be input to more than one code operation or k < k,, where rn = 1, * , e , thus it is different to determine which k, should be the life- time-defining edge. Two properties-transitivity5 and alap analysis-can be used to decrease the number of edges we must consider for representing a variable lifetime. Con- sider the example in Fig. 3(a) (rn = 1, 2). Transitivity can be used to determine the longest edge from k since (k, < k2) =$ (k2 = k,). In Fig. 3(b) alap analysis can also be used to determine the longest edge since (asap(k2) 2 alap (k,)) =$ (k2 = k, ) . This preprocessing can be done very fast in polynomial time, and the algorithm is given in Appendix 11.

However, even after preprocessing there will still be some code operations k, such that k, < * k,, where m = 1, * - . , e . We will now explain how the register allo-

and allocates a correct number of registers. The following terminology is used: a) an arc, k, < k, [whose head is k, and tail is k,, k, E Op(C,, L,)], is said to cross cstep j if and only if R(k,) n (0, 1, * , j - (C, - I )} # 0 andR(k,) n { j + 1 , j + 2, * * , J } # 0; b) e j (n ) = the number of arcs (k, < k,, rn = 1, - - , e), with head k, that cross at j (ej (n) I e ) . For the general case where ej(n) 1 1 Vn, constraint (8) is generated, 11, ej(n) times for cstep = j , or for all maximal sets of arcs that cross j such that no two arcs in a set have the same head. For example, if e arcs cross c s t e p j and have the same head ki[ki < * , e or e j ( i ) = e ] , and the rest of the arcs that cross cstep j have unique heads [e, (n ) = 1Vn # i], then (8) is generated e times (once for each k,) for cstepj. In practice the number of constraints will not be a significant problem, because 1) the computation time for IP problems is not highly sensitive to the number of constraints [7], and 2) most algorithms will have small values of 11, e j (n ) which intersect at the same cstep. In the register allocation constraint (8) shown below, the

cation constraint (8) deals with cases of rn = 1, * , e

k,, m = 1,

'The transitivity property states that if k < . k , and k , < . k2 then it is true that k < . k, .

(a) (b) Fig. 3. Lifetime defining edges for the variable output from k are shown in bold using (a) transitivity analysis and (b) alap analysis (a lap(k , ) 5 asap ( k d .

Fig. 4. Register allocation constraint illustrated with one cut edge, a < . b, at j (the other two edges cancel out because each edge has nodes in a "+" and "-" quadrant).

summation c k n < . k m means that you sum over all (or a maximum number of) arcs (k, < k,) that cross cstep j , such that no two arcs have the same head.

- j 3 5 j X i , j 3 , k , - j 4 > j - ( C n - l ) c X i , j 4 , k , ) < 2 R , j 3 E R (km) j 4 E R ( k d

for all j and for all maximal sets of arcs (k, < - k,) that cross cs tepj such that no two arcs in the same set have the same head.

(8) The register allocation constraint calculates two times

the number of cut edges at each cstep, by dividing time and operations into four quadrants, as shown in Fig. 4. For example, in Fig. 4, the arc a < b [previously de- fined in Fig. l with R(a) = (1, 2 , 3 , 4) and R(b) = (2 , 3, 4, 5 } ] would contribute

j = 2 j = 5 j = 4

X i , j , a - Xi,2,b + ] = 3 .x x i , j , b - j s 3 X i . 3 . a j = 1

to the left-hand side of constraint (8) generated f o r j = 2 (since it crossesj = 2).

Inequality (8) assumes that the lifetime of a variable ends after it is input into the last functional unit that uses it. However, for code operations k, where L, > 1 and there are no latches at the input of multicycle functional units we have to extend the lifetime of the variable (L, - 1) csteps. This can be done easily by first defining an arc to cross if R(k,) r l (0, 1 , , ( j - (C, - 1))) # 0 -

1272 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9 , SEPTEMBER 1993

and R(k, ) fl { j + 1 - (L, - l ) , j + 2 - (L, - 11, . . . , J } # 0; second, by changing CJ2>J12sR(k,n, of in-

(8) to > J - (Lm - 1 ) j 2 t R ( I m ) and ‘ J 3 s J j 3 ~ R ( k , , , ) Of in- equality (8) to ‘ J 3 5 J - ( L m - I ) j 7 s R ( k m ) ’

VII. CONSTRUCTS SUPPORTED I N LP MODELS FOR PROBLEMS 1 AND 2

By substituting ( k , i ) E Op(C, L) for k E Op(C, L) in the previous inequalities, we can extend our model to se- lect the types of functional units simultaneously with scheduling and functional unit allocation. This is impor- tant since some input algorithms may be more optimally synthesized, for example, using pipelined multipliers rather than multicycled multipliers or a combination of both. Inequality (3) can be generated with C, = 2 for k E i , , where i l is a two-cycle multiplier and C2 = 1 for k E i 2 , where i2 is one-cycle multiplier. Similarly, we can use this type of functional unit selection to choose whether to implement an ALU or separate adders and subtractors.

The next constraint is formulated for pipelined func- tional units, i , which can implement more than one func- tion (such as a * / + unit that can perform an addition and a multiplication), where operations that can be assigned to i have different values of C, k E Op(C, L). Constraint (9) ensures that only one pipelined function can be com- pleted at one time. For example, consider the * / + func- tional unit, i , where * E Op(2, 1) and + E Op(1, 1). Constraint (9) prevents the following illegal schedule from occurring, x , , ~ - ,,* = 1 and x,,~,.+ = 1, since at cstep j both the addition and the multiplication operations would be completed. Note that inequality (9) is not required if all operations assigned to i have the same value of C.

C ~ , , ~ - c + i , k ~ 1, v i , j . (9) k

k E Op(C, L)

Our IP model can easily support conditional code In- equality (2) is generated for each set of mutually exclu- sive code operations or code operations from each possi- ble path generated by conditional branches. Similarly, register allocation constraint (8) is generated from arcs that cross (whose head is executed before the branch and tail are executed after the join) or have a head and/or tail in each path generated by conditional branches. In addi- tion, extra data precedence constraints may need to be added to prevent conditional code operations from being scheduled before the branch or after the join of the branch.

An additional end-of-loop operation node is required for cases where one or more code operations output a data value used as input to the next loop iteration and not as input to the code operation after the loop. The end-of-loop operation is used only in the data precedence inequality (6). This ensures a register will hold the data until the end of the loop.

Fixed, minimum or maximum timing constraints be- tween pairs (or groups) of operations can be incorporated into our IP model. In all examples below, T is the time constraint value measured in the number of control steps.

We use the notation time ( k , , k2) I I = T to represent the minimum 2 , maximum I, or fixed = time constraint between the two operations, where

A fixed timing constraint between two operations, k l and k2, forces the scheduled time for operation k2 to be T control steps after operation kl or, in other words, Ci xk,,, = Ci x ~ , ~ + T ) , v j . An equivalent representation of this fixed timing constraint using node packing inequali- ties is shown in (10). The fixed timing constraint is very important for interfacing to analog processes that may in- put or output synchronous sequential data at a fixed rate, or at a fixed number of control step intervals ( T ) . In ad- dition, the minimum and maximum timing constraints of T, which will be presented next, also form facets of the fixed timing constrained scheduling problem (see Appen- dix 111 for explanation).

V j , , time ( k , , k2) = T. (10)

The minimum and maximum timing constraints are rep- resented by inequalities (1 1) and (12). Both constraints, (11) and (12), are similar to the data precedence con- straints in inequalities (6 ) , except that they are extended in a direction by T (where C, - 1 = T ) .

V j , , time ( k , , k2) I T.

V j , , time ( k , , k2) I T, (12)

The objective function or cost function for the IP prob- lem can be formulated as any linear or piecewise linear function of the variables. For example, an area cost func- tion similar to [ l ] can be formulated. This requires i ad- ditional binary variables, z i , shown in inequality (13). When zi = 1 then one functional unit ( i ) has been allo- cated.

C X ~ , ~ , ~ I z i , v i , j o I zi I 1 . (13)

Consider the formulation of a linear area cost as a func- tion of the number of each type of functional unit only.

k

GEBOTYS AND ELMASRY: ARCHITECTURAL SYNTnEsIs 1273

Assume a+ is the area of an adder, a* is the area of a multiplier, i = 1 , 2, 3 represents the multipliers, and i = 4, 5 , 6 represents the adders. Then the linear area cost can be formulated as follows:

TABLE I SYNTHESIS USING LP MODEL FOR EWF (ON RC6280)

Tgen, Texe, csteps *PI * + s Var Eqn Itns S

I = 3 1 = 6

Area = a* c z, + a+ C z I . r = l 1 = 4

Note that C: E z , represents the number of multipliers al- located. The number of registers can also be included in the area cost function. This can also be formulated as a piecewise linear cost function. For example, if we have a cost of 10 for each register up to the fifth register and a cost of 15 for all registers after the sixth, we can represent this by letting R = R I + R2, where the upper bound of R I = 5 and the lower bound of R2 = 0. Then the cost func- tion becomes cost-reg R = (10 15) (R , R2)T. The general linear or piecewise linear area cost formulation is shown as

(14)

Functional pipelining for a fixed latency (defined as the time between successive starts of an input algorithm), I , can be incorporated into our model without additional variables. We define r J / l 1 = p pipestages, and replace

where addition ( j + n l ) is modulo 1. Thus only variables representing code operations of one pipestage are used. The functional unit allocation and register allocation con- straints are generated for l consecutive control steps. If the number of pipestages is less than p (defined earlier), then the model must generate these constraints for all j such that ( J - p l ) 1 j 1 p l . Functional pipelining main- tains the node packing structure of the inequalities.

C z, cost-@(i) + cost-reg R. I

C k , E ~ ( ~ ) XI,,,^ of (2) and (8) with C k ( , + n i ) E ~ ( ~ ) X r , , + n l , k

VIII. ARCHITECTURAL SYNTHESIZER RESULTS Two input algorithms from high-level synthesis bench-

marks [14] were synthesized using the LP model. (We will refer to our model as the LP model; however, it is important to note again that we are using the relaxed LP model to solve the IP problem.) The first example is the elliptical wave filter (EWF). It was synthesized and com- pared with the simulated annealing approach [ 11, the step- wise refinement approach of HAL [4], and another integer programming approach [15]. This example has 34 code operations with over 56 partial order (or precedence) con- straints. The final input algorithm is part of the Kalman filter [14], which is a very complex algorithm with over 1000 code operations (fully unrolled). Unfortunately, we could only do a limited comparison due to the lack of published synthesizer results for this more complex al- gorithm. Synthesizer results and comparisons are given below in Subsection VIII-A for solving Problem 1 and Subsection VIII-B for solving Problem 2.

A . Results for Problem 1 Tables I and I1 give a more detailed examination of the

performance of the LP model for synthesizing the EWF

17 2 3 0.32 0.08 193 226 50 18 1 3 0 .42 0.11 292 273 124 19 1 2 1.67 1.58 391 327 221

17 3 3 0.36 0 .05 193 232 52 18 2 2 0.39 0.1 292 279 190 19 2 2 1.22 1.0 391 333 185 21 1 2 0.91 5.37 589 428 889

TABLE I1 SYNTHESIS USING LP MODEL FOR UNROLLED EWF (ON 386PC)

Tgen, Texe, Unrolled No. of Code Operations csteps *pl + s s Var Eqn Times

68 34 1 3 32 25 566 333 2 X

102 50 1 3 51 39 859 506 3 X

as the upper bound on the number of control steps (csteps) is increased and as the EWF loop is unrolled. The model generation time (Tgen), execution time (Texec) (both in CPU seconds), number of variables (Var), number of constraints (Eqn), and number of simplex iterations (Ztns) to solve the LP are given. The number of adders (+), two-cycle multipliers (*), and pipelined two-cycle multi- pliers (*pl) are given in the tables. The GAMS/MINOS (LP solver) and GAMS/ZOOM (IP solver) optimization software [9] was used to solve the LP’s and IP’s on an IBM PS/2 model 80 (386PC) and on an RC6280 MIPS workstation. All solutions were all-integer after solving the LP once, except for row 3 (19 csteps). In this case, the times reported in Table I include enumerating until a globally optimal all-integer solution was obtained. We unrolled the EWF two and three times to illustrate how well the LP model performs with a large number of code operations; see Table 11. In the later case simultaneous scheduling and functional unit allocation of over 100 code operations was executed in 90 s of CPU time. To the au- thors’ knowledge, no other published research has solved simultaneous scheduling and allocation to global opti- mums for this (large) number of code operations. In Ta- bles I and I1 the following cost function was minimized a, zi + a+ CiI; where a, = 0.49 and a+ = 0.012, using the area of the multipliers and adders in [ 161. Other cost functions similar to (14) were also used and gener- ated all-integer solutions in similar CPU times, as de- scribed next.

The problem of simultaneously scheduling and select- ing and allocating functional units was also solved. The following cost function was minimized, Ci cost-@ (i) zi, where i = 1 or 2 refers to selecting a two-cycle multiplier (C = 2, L = 2), i = 3 or 4 refers to selecting a two-cycle pipelined multiplier (C = 2, L = l ) , and i = 5 , 6, or 7 refers to allocating a one-cycle adder (C = L = 1) . An

1274 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9 , SEPTEMBER 1993

upper bound of 18 csteps was used, and the IP problem was solved by GAMSIMINOS [9] on an MIPS RC6280 workstation. For example, cost-f i( i) = 14, 16, 31, 34, 2, 2, 2, for i = 1, 2, 3 , 4 , 5 , 6, 7 , respectively, represents a cost function where the first two-cycle multiplier has an area (or cost) of 14 units and the second two-cycle mul- tiplier has an area of 16 units (slightly higher to account for additional interconnect area). This cost function was minimized, and after solving the LP once (requiring 0.47 s of CPU time), an all-integer solution was obtained. Two two-cycle multipliers and two adders were selected and allocated. When cost-fu(i) = 15, 16, 24, 34, 2, 2 , 2 for i = 1, 2, 3 , 4 , 5 , 6, 7 respectively, after solving the initial LP once (requiring 0.4 s of CPU time), an all-integer so- lution was obtained. One pipelined multiplier and three adders were selected and allocated. In both cases, the LP model had 3 12 variables and 214 inequalities.

Finally to illustrate the use of inequality (9) of Section VII, we simultaneously scheduled, allocated, and se- lected types of functional units, including functional units that were pipelined and could perform more than one function. The area of the architecture for the EWF ex- ample was minimized with a fixed execution time of 19 csteps. The functional units corresponding to i values were adders i = 1, 2, (Op(1, l ) ) , two-cycle pipelined multi- pliers i = 3 (Op(2, l)), and i = 4, 5 , 6 functional unit which can perform a one-cycle addition (Op(1, 1)) or a two-cycle pipelined multiplier (Op(2, 1)). The area of each adder was 0.012, the area of the pipelined multiplier was 0.49, and the area of each multipliedadder was 0.4. The IP problem required 748 variables, 527 equations, and 60.6 s of CPU time on RC6280 to find an architecture with an optimum area of 0.424. The solution allocated one multipliedadder functional unit and two adders.

Table I11 illustrates the importance of IP model for- mulation and the use of facets to generate a smaller search space. The problem is to schedule and allocate functional units simultaneously for the elliptic wave filter example with J = 19. Our solution is compared with the model presented in [12] and [15]. Both models were run on the same 386PC using the GAMS/ZOOM solver. Our model was solved for all-integer values after the initial LP. The other model [ 151 had to resort to a branch-and-bound rou- tine before an integer solution was found. Our LP model produced a globally optimal architecture 8.7 times faster (on the same 386PC with the same solver) than Hwang et al.’s model although our LP model has more constraints and more variables than their model. Again this further illustrates why solving IP problems is not highly depen- dent on the number of constraints. In both cases, the up- per bounds on execution time and number of functional units were the same.

The Kalman filter example [14] consists of a large nested loop followed by a matrix multiplication. It is an important example for synthesis since it is very different from the EWF in that it contains a great deal of regularity and has large loops. Herein will present only results on optimally synthesizing an architecture for the matrix mul-

TABLE 111 LP VERSUS HWANG’S [15] MODEL (BOTH SOLVED WITH GAMS ON 386PC)

Tcpu, S Var Eqn Model csteps *pl +

Hwang 19 1 2 6001. 130 120 LP 19 1 2 69 355 201

+Could not be solved with initial LP, resorted to branch and bound.

tiplication. More details on the full Kalman filter example can be found in [13]. The matrix multiplication AV = U [where A is a 4 X 16 (row by column) matrix and v is a 16 x 1 vector] has to compute

e = 16

U, = C a,,,v,, for r = 1, 2, 3, 4 , c = 1

or in algorithmic form

for r = l to 4 { temp0 = 0; for c = l to 16 {

temp = a(r,c)*v(c) + tempo; temp0 = temp; ]

u(r) = tempo;}.

The input to our LP model was obtained by first un- rolling the inner c loop 16 times, which created four mul- tiply-accumulate (mac) streams ( r = 1, 9 . , 4). One mac stream takes a row ( r ) from A and multiplies it by v to produce U,. The mac stream was input to our LP model with variables x ~ , ~ , ~ , where k,,,, < k + , , form = 1, , 16, and k+,,, < * k + , + , , form = 1, - * * , 15. Using con- structs for functional pipelining we could model the four mac streams without any additional variables and obtain a schedule with a large amount of regularity (We could also have added variables to model all of the four mac streams; however, the controller would have been un- necessarily complex since the schedule would not neces- sarily have been regular.) The functional pipelining con- structs used a latency of one and four pipestages, as dis- cussed in Section V. The four pipestages (or streams) of 32 code operations were simultaneously scheduled and al- located (using the functional pipelining constraints) in 40 s of CPU time on the 386PC. Three pipelined two-cycle multipliers and three adders were allocated. Fig. 5 illus- trates the optimal schedule in bold that the LP model pro- duced for this architecture. The boxed operations repre- sent the schedule of each pipestage required to execute three multiply-accumulate operations. This schedule forms a pattern that is repeated four times. A total of 24 csteps is therefore required to execute the 4 x 16 matrix multiplication of the Kalman filter. Unfortunately, we cannot compare with [17] since the allocation of func- tional units was not reported and in [ 181 the multiplier and adder are chained (or the output of the multiplier is con- nected directly to the input of the adder). To the authors’ knowledge other published research has tackled this benchmark-most likely because of its size, complexity,

1275 GEBOTYS AND ELMASRY: ARCHITECTURAL SYNTHESIS

m

4

Fig. 5 . Part of the LP optimized schedule (shown with four pipestages, latency = 1) for the 4 X 16 matrix multiplication of the Kalman filter benchmark example. The final schedule is shown in bold. The boxed schedule is repeated four times to complete the execution of the 4 X 16 matrix multiplication.

and the possible difficulties with applying heuristic syn- thesis techniques to an example with a great deal of reg- ularity. Nevertheless, the matrix multiplication of the Kalman filter example illustrates the flexibility of the LP model to support functional pipelining and synthesize dif- ferent types of input algorithms.

Results for Problem 2 In Table IV the synthesizers were compared by the

types of subtasks they perform and the execution time to solve the problem in CPU minutes or seconds for the EWF examples. Table IV has columns for scheduling (Sched), functional unit allocation (FU), and register allocation (Reg). A y (yes) means that the subtask is completely per- formed. A c (calculated) means that the number of re- sources is calculated exactly. An e (estimated) means that the resource is estimated using some heuristic or it is somehow considered during the algorithm. The execution time in CPU minutes/seconds in Table IV are only for the scheduling phase for HAL [4] (using a Xerox 1108 Lisp machine) and the simulated annealing (S.A.) run times (using a Vax 1118650) for [ 11. The LP model CPU seconds (Tgen + Texec) are for the GAMSIMINOS solver [9] runs on a 386PC. The asap1alap or other pre- processing times were not included but are negligible (less than 1 min of CPU time). All integer solutions were ob- tained using the LP model in over 80% of these cases (using different cost functions) for Problem 2.

Fig. 6 shows one solution for the elliptical wave filter example optimized for registers, functional units, and ex- ecution time. This optimized solution was obtained by minimizing the previous cost function with upper bounds of nineteen control steps, three adders, three two-cycle multipliers, and nine registers. The optimum solution with two two-cycle multipliers, two adders, and nine registers (not including the IN and OUT registers) required 200 s of CPU time for model generation and 18 s of CPU time (on 386PC) for LP execution (424 variables, 279 con- straints, 536 iterations). Lifetime defining edges for all but two variables were found using the transitivity and

I I I I I I I I I I I I I I I I I I I I I I I I I

I ( I I I I I I I I I I I I I

\ I

Fig. 6 . EWF schedule optimized for J = 19 control steps, with variable lifetime defining edges. The rest of the data transfers are shown with dashed lines. Two adders, two two-cycle multipliers, and nine registers were al- located.

TABLE IV SYNTHESIZERS COMPARISON FOR EWF

Subtasks Performed CPU

Example Synthesizer Runtime Sched FU Reg csteps

y c 17 EWF S.A.[l] 4 min Y HAL[4] 4-6 min y y e 17-21 LP Model 19-69 s y Y - 17-21 LP Model 3.6 min y Y Y 19

alap analysis. The multiple edges for the two variables required only 24 extra constraints. To the authors’ knowl- edge, no previous publications have quoted as low as nine registers for the EWF, which demonstrates that global op- timums have not been obtained by heuristic synthesizers.

IX. DISCUSSION

The IP models for architectural synthesis, presented herein, could not be solved using general branch-and- bound and other 0-1 IP techniques due to their size. How- ever, using polyhedral theory, we have shown that opti- mal solutions are synthesized in CPU execution times faster than previous research, which use heuristic tech- niques. Many other examples have been synthesized with the IP model [ 131.

1276 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 1993

If not all the variables in the solution are integers we can 1) automatically extract violated odd cycles facets us- ing a shortest path algorithm [7], or 2) by solving the sys- tem of edge inequalities decompose the problem into a number of smaller ones to solve, or 3) branch and bound. These three techniques maintain optimality. Over 80% of the synthesized solutions (of the relaxed LP problem) op- timized with different cost functions were all-integer. The cases where not all variables were integer mostly were a result of using objective functions which did not make practical sense for architectural synthesis applications.

Although the register allocation constraint (8) does not have the node packing structure, we have shown that we can still solve for integer variables in most cases and in very fast execution times. A different register allocation constraint, not presented herein, has the node packing structure (although it requires approximately double the variables) and, therefore, proves that simultaneous sched- uling and allocation of functional units and registers is NP-complete by transformation to the node packing prob- lem.

The worst case complexity for the IP problem in theory is exponential. This gives a poor representation of the ex- pected complexity, since it means that there will be some problems that will take a long time to solve. However, similar to previous research [7], we have found that most problems are solved very fast, and it is expected that the majority will exhibit this behavior. As algorithms become larger, we intend to use the Karmarkar algorithm for LP solving [ 191, which executes ten times faster than the sim- plex and has shown larger improvements in speed as the problem size increases. Although we cannot guarantee 0-1 solutions (the problem is NP-complete), most exam- ples we have optimized provide 0-1 solutions. For the first time, our IP model provides tight bounds on the ar- chitectural synthesis problem, which is extremely impor- tant for any synthesis solution technique, including sim- ulated annealing or the use of techniques described herein.

A second approach for dealing with complexity relies on the designer to select subsets of the model to explore the design possibilities. For example, one can first mini- mize the number of functional units and then, second, minimize the number of registers to explore the architec- tural design space. There also exist various input algo- rithm partitioning strategies that have been researched to deal with complexity, such as vertical partitioning in [20] or partitioning and pipelining of the algorithm as dem- onstrated with the matrix multiplication example herein. However, more importantly, we have demonstrated that over 100 code operations can be simultaneously sched- uled and allocated in very fast CPU times. This ability to optimally synthesize large complex algorithms is a sig- nificant contribution to the synthesis field.

The LP model presented herein is a set of general con- straints and cost functions that can be used for any appli- cation. In other words, there is no need to extract new facets or formulate any new constraints to synthesize an architecture for each new application. Although we have

not specifically placed execution time in the objective function, our IP model can be extended easily to do this by adding an end operation (analogous to the end-of-loop operation) and placing it in the objective function similar to [21]. For example, the execution time can be formu- lated using an additional operation, end, where k < * end, V k , and Texec = E,,, ( ( j - l)~,,~~~,~). Then the cost func- tion could model both area and delay. In addition, our IP model can be extended to support selection of chained functional units [13]. In this case, the designer can iden- tify what type of operations will be chained (e.g., k* < k + ) , and the iP model would use a fixed timing constraint in place of a precedence constraint for the operations that are assigned to the chained functional unit [time (k* , k , ) = 0 for all xLc,] ,k , where i, is a chained functional Unit, and k , < k , for all x,,,,~ where i # 4.1. The model can then simultaneously schedule, select, and allocate func- tional units.

X . CONCLUSIONS

Currently, we have formulated new models to simulta- neously allocate interconnect [22] support asynchronous interfaces [23], select clock period [24], and partition into multichip architectures [25]. In the future, we intend to extend the LP model for simultaneous scheduling, allo- cation, and binding (to minimize the number of multi- plexors and the number of inputs to multiplexors). We also plan to conduct additional research on the extraction of more facets and the use of this model for interfacing to analog signal processing domains.

The LP model presented herein is a significant contri- bution to the field of high-level architectural synthesis. For the first time, we have proof that the synthesized ar- chitectures are globally optimal. This is important for in- dustry since the early decisions made during architectural exploration have the greatest impact on the final design. Previous synthesizers could at best guarantee a locally op- timal synthesized solution, which may not meet design constraints. Finally, we have demonstrated that the IP ar- chitectural synthesizer can handle input algorithms with different types of structure, with over 100 code operations and with complex constraints. In summary, this research guarantees globally optimal synthesized solutions, syn- thesizes large input algorithms in practical execution times, and supports complex constraints and cost func- tions.

APPENDIX I: TIGHTER PRECEDENCE CONSTRAINT

Proof that constraint (5) is tighter than (5") is presented in this appendix. Let P5 represent the scheduling polytope whose constraint set is generated only by (5) and similarly for Ps*. This is equivalent to showing that P4 C P5*.

1) P5 # P5* Proof: Consider the following fractional solution to

GEBOTYS AND ELMASRY: ARCHITECTURAL SYNTHESIS 1277

the scheduling problem a , b ( a < * b, where the upper bound on the number of control steps is 4.

k ( j = l j = 2 j = 3 j = 4

0 .1

This solution is violated by ( 5 ) = ( ~ 3 , ~ + x 3 , b + x 2 . b ) = (.1) + ( . 5 ) + ( . 5 ) = 1 . 1 > 1 and feasible for (5*) =

= -1.3 I - 1. Therefore P5 # P5* Q.E.D.

a / b .9 .5 .5 0

C ; = l j x , , , - E j ” = 2 j ~ , , b = 1(.9) + 3(.1) - 2(.5) - 3( .5)

2 ) P5 P5*

n n

(a) (b)

Fig. 7. Example of facets forming the fixed timing constraint.

Proof:

&4, b

now we have to derive ( 5 * ) : x , , ~ + 2 x ~ , ~ + 3 x ~ , ~ - 2 X2,b - 3 X3,b - 4 X4,b I - 1 left-hand side Of ( 5 ” ) : 5 4 - 3 X2.b - 2 X 3 , b - X4,b - 2 ~ 2 , b - 3 X3,b - 4 X4,b (by substitution) = 4 - 5 (X2 ,b + x 3 , b + x 4 , J = -1.

Q.E.D. Since we have shown that P5 * P5* and P5 # P5*, we have proved that P5 C P5*. Thus ( 5 ) is a tighter formu- lation of the precedence constraint than ( 5 ” ) .

APPENDIX 11: EDGE REDUCTION ALGORITHM

The transitivity and alap analysis discussed in Section IV can be performed using the polynomial time algorithm described below in pseudocode.

Given the node-node adjacency matrix A i , j , where a i , j

= 1 e i < * j , where i, j are code operations. We will use this matrix to represent the DAG (or input algorithm). First a path matrix, T, is calculated. Second, the asap/ alap tables for each code operation and the path matrix are used to delete edges that cannot represent lifetimes.

1) Compute matrix T,j, where ti,, = 1 e 3 a path from code operation i to code operationj in the DAG.

2 ) Compute matrix Li,j, where l i , j = 1 edge i, j of the DAG cannot be eliminated by transitivity or alap analysis. Initially set L = A , and then eliminate en- tries V i,

v j i Y j 2 l a i , j , = ai,jz = 1. if((cjl,jz = 1) v ( a s a ~ ( j 2 )

I alap(j1))) * l i , j , = 0.

The algorithm is finished and the L matrix is used to gen- erate the register constraints since each entry represents a possible lifetime defining edge.

APPENDIX 111: FACET PROOF FOR FIXED TIMING CONSTRAINTS

We will show that the fixed timing constraints inequal- ities (10) of Section VII, and the minimum/maximum timing constraints, inequalities (1 1) and (12) of Section

VII, both define facets for the scheduling problem for two operations, a and b, where time (a, b) = 0. Fig. 7 shows the node packing graph for this sample problem defined over j = 1 , * - - , 5 . An example of the fixed timing con- straint is shown in bold in Fig. 7(a) and the minimum timing constraint in bold in 7(b) both generated f o r j , = 3 for inequalities (10)-(12). Since in this example these are cliques of the graph G they form facets [7] of the scheduling polytope for this example.

ACKNOWLEDGMENT We would like to thank the three reviewers for their

helpful comments.

REFERENCES

[ l ] S . Devadas and A. R. Newton, “Algorithms for hardware allocation in data path synthesis,” IEEE Trans. Computer-Aided Design, vol. 8, pp. 768-781, July 1989.

[2] M. R. Garey and D. S. Johnson, Computers and Intractability. Freeman and Co. , 1979.

[3] E. L. Lawler, Combinatorial Optimization Networks and Mutroids. New York: Holt-Rinehart-Winston 1976.

[4] P. G . Paulin, “Force directed scheduling,” IEEE Trans. Computer- Aided Design, vol. 8, pp. 661-679, 1989.

[5] M. McFarland, A. Parker, and R. Camposano, “Tutorial on high- level synthesis,” in Proc. Design Automation Conf., 1988, pp. 330- 336.

[6] L. Hafer and A. Parker, “A formal method for the specification, anal- ysis and design of register-transfer-level digital logic,” IEEE Trans. Compurer-Aided Design vol. CAD-2, no. 1, pp. 4-17, Jan. 1983.

[7] G. L. Nemhauser and L. A. Wolsey, Integer and Combinatorial Op- timization. New York: Wiley Interscience, 1988.

[8] J . Lee, Y. Hsu, and Y. Lin, “A new integer linear programming formulation for the scheduling problem in data path synthesis,” in Intl. Con$. Proc. Computed Aided Design, 1989.

[9] A. Brooke, D. Kendricke, and A. Meeraus, GAMSIMINOS Users Manual.

[lo] P. Pfahler, “Automated datapath synthesis: A compilation ap- proach,” Processing and Microprogramming, vol. 21, pp. 577-584, 1987.

[ l l ] H. Crowder, E. L. Johnson, and M. Padberg, “Solving large scale zero-one linear programming problems,” Op. Res. , vol. 31, no. 5 , pp. 803-834, Sept. 1983.

New York: Scientific Press, 1988.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 1993

K. R. Baker, Introduction to Sequencing and Scheduling. New York: Wiley, 1974. C. H . Gebotys, “A global optimization approach to architectural syn- thesis of VLSI Digital synchronous systems with analog and asyn- chronous interfaces,” Ph.D. thesis, Dept. ECE, University of Water- loo, July 1991. hlsw and B. Mayo (Coordinator), High-Level Synthesis Workshop Clearinghouse, email hlsw-request@decwrl. dec. com. 1988. C.-T. Hwang, J-H. Lee, and Y-C. Hsu, “A formal approach to the scheduling problem in high-level synthesis,” IEEE Trans. Computer- Aided Design, vol. CAD-IO, no. 4 , pp. 464-475, 1991. A. Parker, “Tutorial on high level synthesis,” in Can. Conf. on VLSI, 1991. R. Camposano, “Path based scheduling for synthesis,” IEEE Trans. Computer-Aided Design, vol. CAD-10, pp. 85-93, 1991. E. D. Lagnese, “Architectural partitioning for systems level design of integrated circuits,” Ph.D. thesis, CMUCAD-89-27, Carnegie- Mellon University, Pittsburgh, PA, 1989. N. Karmarkar, “A new polynomial time algorithm for linear pro- gramming,” Combinatorica, vol. 4 , pp. 373-395, 1984. F. Depuydt, G. Goossens, J . Van Meerbergen, F. Catthoor, and H. DeMan, “Scheduling of large scale signal flow graphs based on met- ric graph clustering, ” in IFIP Con$ High Level Architectural Synthe- sis and Logic Synthesis, 1990. C. H. Gebotys and M. I. Elmasry, “Simultaneous scheduling and allocation for cost constrained optimal architectural synthesis, ” in ACMilEEE Design Automation Conf. , 1991. C. H . Gebotys and M. I . Elmasry, “Optimal synthesis of high-per- formance architectures,” IEEE J . Solid-State Circuit, vol. SC-2, no.

C. H. Gebotys, “Synthesizing embedded speed-optimized architec- tures,” in IEEE Custom Integrated Circuits Conf., pp. 8.7.1-8.7.4, 1992. C. H. Gebotys, “Optimal scheduling and allocation of embedded VLSI chips,” in ACMilEEE Design Automation Conf., 1992. C. H . Gebotys, “Optimal synthesis of multichip architectures,’’ in IEEEIACM Int. Conf. CAD, 1992.

3 , p p . 389-397, 1992.

Catherine H. Gebotys (S’82-M’92) received the B.A.Sc. degree in engineering science in 1982 and the M.A.Sc. degree in electrical engineering in 1984, both from the University of Toronto, On- tario, Canada. She received the Ph.D. degree in electrical engineering in 1991 from the University of Waterloo, Ontario, Canada.

She worked at Litton Systems Canada Ltd., Display Systems Engineering, from January 1985 to December 1986, in the area of CAD for VLSI and chip design. From January 1987 to August

1989, she was a research associate in the VLSI Group, Department of Elec- trical Engineering, University of Waterloo. She has been Assistant Profes- sor in the Department of Electrical and Computer Engineering, University of Waterloo since September 1991. She is coauthor of the book Optimal VLSI Architectural Synthesis: Area, Performance and Testability, Kluwer Academic Publishers, 1992. Her research interests include global optimi- zation approaches to CAD for VLSI, high-level architectural synthesis, and design for test.

Mohamed I. Elmasry received the B.Sc. degree from Cairo University, Cairo, Egypt, and the M.A.Sc. and Ph.D. degrees from the University of Ottawa, Ontario, Canada, all in electrical en- gineering in 1965, 1970, and 1974, respectively.

He has worked in the area of digital integrated circuits and system design for the last 25 years. He worked for Cairo University from 1965 to 1968 and for Bell-Northern Research, Ottawa, Canada, from 1972 to 1974. He has been with the Depart- ment of Electrical and Computer Engineering,

University of Waterloo, since 1974, where he is a Professor and founding director of the VLSI research group. He has a cross appointment with the Department of Computer Science, where he is a Professor. He holds the NSERCiBNR Research Chair in VLSI design at the same university since 1986. He has served as a consultant to research laboratories in Canada and the United States, including AT&T Bell Labs, GE, CDC, Ford Microelec- tronics, Linear Technology, Xerox, and BNR, in the area of LSIiVLSI digital circuitisubsystem design. During sabbatical leaves from Waterloo he was at the Micro Component Organization, Burroughs Corporation (Un- isys), San Diego, CA, Kuwait University, Kuwait, and Swiss Federal In- stitute of Technology, Lausanne, Switzerland. He has authored and coau- thored over 150 papers on integrated circuit design and design automation. He has several patents to his credit. He is the editor of the IEEE Press books Digital MOS Integrated Circuits, 1981, Digital VLSI Sysrems, 1985, and Digital MOS Integrated Circuits 11, 1991. He is also author of the book Digital Bipolar Integrated Circuits (Wiley, 1983) and coauthor of the book Digital BiCMUS Inregrured Circuits (Kluwer, 1991). Dr. Elmasry has served in many professional organizations in different positions including the Chairmanship of the Technical Advisory Committee of the Canadian Microelectronics Corporation. He is a founding member of the Canadian VLSI Conference, the International Conference on Microelectronics and the founding president of Pic0 Electronics Inc.

Dr. Elmasry is a member of the Association of Professional Engineers of Ontario and is a Fellow of the IEEE for his contributions to digital in- tegrated circuits.