Effects of resource sharing on circuit delay: an assignment algorithm for clock period optimization

24
Copyright (C) 1997, 1998 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]

Transcript of Effects of resource sharing on circuit delay: an assignment algorithm for clock period optimization

Copyright (C) 1997, 1998 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page orinitial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, to republish,to post on servers, to redistribute to lists, or to use any componentof this work in other works, requires prior specific permission and/ora fee. Permissions may be requested from Publications Dept, ACM Inc.,1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]

E�ects of Resource Sharing on Circuit Delay:An Assignment Algorithm for Clock PeriodOptimizationSubhrajit Bhattacharya � Sujit Dey Franc Brglez yC&C Research Labs C&C Research Labs CBL, Dept. of CSNEC USA NEC USA North Carolina State UnivPrinceton, NJ 08540 Princeton, NJ 08540 Raleigh, NC 27695AbstractThis paper analyzes the e�ect of resource sharing and assignment on the clock period of thesynthesized circuit. The assignment phase assigns or binds operations of the scheduled behavioraldescription to a set of allocated resources. We focus on control- ow intensive descriptions, char-acterized by the presence of mutually exclusive paths due to the presence of nested conditionalbranches and loops.We show that clustering of multiple operations in the same state of the schedule, possibly leadingto chaining of functional units (FUs) in the RTL circuit, is an e�ective way to minimize the totalnumber of clock cycles and hence the total execution time. We present an assignment algorithmwhich is particularly e�ective for such design styles, by minimizing the data chaining and hence theclock period of the circuit, thereby leading to further reduction in the total execution time.Existing resource sharing and assignment approaches for reducing the clock period of the re-sulting circuit either increase the resource allocation or use faster modules, both leading to largerarea requirements. In this paper we show that even when the type of available resource units andthe number of resource units of each type is �xed, di�erent assignments may lead to circuits withsigni�cant di�erences in clock period.We provide a comprehensive analysis of how resource sharing and assignment introduces longpaths in the circuit. Based on the analysis, we develop an assignment algorithm which uses ahigh-level delay estimator to assign operations to a �xed set of available resources so as to minimizethe clock period of the resultant circuit, with no or minimal e�ect on the area of the circuit.Experimental results on several conditional-intensive designs demonstrate the e�ectiveness of theassignment algorithm.Subject: High-Level SynthesisKeywords: High-Level synthesis, Resource sharing, Delay minimization, Assignment1 IntroductionResource allocation, scheduling and resource sharing are interdependent tasks which a�ect many designparameters in the �nal synthesized circuit. Many approaches have been developed to perform these�Parts of this work was done when S. Bhattacharya was a student at Duke University, and was supported by a grantfrom NEC USA.yF. Brglez was supported in part by contracts from the Semiconductor Research Corporation (94-DJ-553), SE-MATECH (94-DJ-800), and ARPA/ARO (P-33616-EL/DAAH04-94-G-2080).1

tasks while satisfying performance, area, and clock period requirements.Scheduling a�ects not only the number of clock cycles required to execute a behavioral speci�cation[2], but also the amount of resource sharing, since operations in di�erent states can share a resource.In [3, 4, 5], scheduling is performed to optimize the performance in terms of clock cycles. However,aggressive scheduling to reduce the number of clock cycles can lead to packing many operations in thesame state. The packing of many operations in one state may adversely a�ect the area and the clockperiod of the design.The introduction of multiple choices to implement an operator, and using a combined schedulingand allocation algorithm, has produced designs with reduced area [6]. E�cient allocation of resourcesto minimize the area as a post-scheduling problem is studied in [8, 9]. The approaches considerminimization of functional units, registers as well as interconnection. However, sharing of resourcestypically increases the circuit delay.In [7], the delay of the design is reduced by re-scheduling. In [12], faster modules are used too�set the increase in delay due to sharing of resources. High level modelling of area and delay forstudying the area and delay tradeo�s due to di�erent resource allocations have been addressed in[10, 11]. In [13], an area-delay cost function has been used to synthesize circuits initially described atthe functional level to satisfy combined area and delay constraints.The above approaches reduce the clock period of the implementation at the expense of increasedarea, by either increasing the resources allocated, or using faster modules. Alternatives to such trade-o�s are analyzed in [1]. In [14], a particular case of sharing a functional unit amongst mutuallyexclusive operations and its e�ect on circuit delay is considered. In this paper, we show that evenwhen the type of available resource units and the number of resource units of each type is �xed, theclock period of the circuit can vary signi�cantly depending upon the particular assignment used tomap the operations to the resources.We present a comprehensive analysis of various other sources of long (critical) paths due to resourcesharing and assignment, and discusses assignment techniques to generate circuits whose critical pathshave small delays. To motivate, Section 2 illustrates two di�erent assignments of operations to a �xedset of resources, using the example of the Blackjack dealer process [15]. We show that the clock periodof the resulting circuits for the two di�erent assignments are 44.95 ns and 34.81 ns, a di�erence of23%. We also analyze the relationship between assignment and clock period in conditional-intensivedesigns which allow sharing of not only arithmetic units but also comparators, providing motivationfor an assignment algorithm which minimizes the clock period, given a �xed allocation of resourcesand a schedule.One may argue that the clock period minimization can be also achieved by using existing gate-level2

delay optimization techniques, but only after an RT-level architecture has been �xed, and a gate-levelimplementation has been derived. As this paper shows, high-level architectural trade-o�s can lead to avariety of RTL implementations for the same design, having a wide range of clock periods. Applicationof the gate-level techniques to an RTL architecture with a long clock period may not be as e�ectiveas deriving an RTL architecture with an optimal clock period that uses the proposed resource sharingtechnique which allows for architectural design space exploration.Terminology and de�nitions used in the paper are presented in Section 3. Section 4 analyzes sourcesof long (critical) paths due to resource sharing and assignment, and discusses assignment techniquesto generate circuits whose critical paths have small delays. In Section 5, we outline an assignmentalgorithm to perform resource sharing such that the clock period of the synthesized circuit is minimized,while satisfying the resource constraints. The algorithm is applied to several VHDL descriptions whichhave numerous conditional branches and loops. The experimental results in Section 6 demonstratethe e�ectiveness of the algorithm in minimizing the clock period of the synthesized circuits.2 MotivationSharing resources is a common technique used for minimizing the area of a design. Before analyzingthe e�ect of sharing resources on the area and delay of the synthesized designs, it is instructive tolook at Table 1 which reports the area and delay of the 8-bit and 16-bit implementations of severalcommonly used resources, using SIS technology mapper with fanout optimization [16], OASIS layouttools [17], and the lib2.genlib standard cell SCMOS 2.0 library [18]. Assigning multiple operations tothe same resource (sharing the resource) may require additional multiplexors to select the inputs of theoperations being shared. Hence, sharing a low-area unit like a (=)Comp unit, with an area similar toa mux, does not result in area savings unless the (=) operations to be shared have the same inputs. Incontrast, sharing a (<)Comp, a (<;=; >)Comp, an adder or an ALU, can result in signi�cant savingsin area.We next motivate the need for assignment techniques to minimize the increase in delay resultingfrom sharing of resources. Consider the control- ow graph (CFG) shown in Figure 1. The CFGof Figure 1 describes the dealer process, one of the processes in the behavioral description of theBlackjack circuit [15]. While the complete schedule for the dealer process is omitted for lack of space,Figure 2 shows three di�erent schedules for a subset of operations of the dealer process, ranging froma conventional schedule which does not allow data chaining (Figure 2(a)) to the third schedule (Figure2(c)) which aggressively packs several operations within the same state, with the potential of creatinglong chains in the RTL implementation.Table 2 gives the average number of clock cycles (CC) the three schedules of Figure 2 would require3

8 bit 16 bitUnit Area Delay Area Delay[mm2] � 100 [ns] [mm2] � 100 [ns]Comp(=) 8.5 4.82 17.5 5.47Comp(<) 17.2 10.70 41.9 21.26Comp(<;=; >) 19.5 12.29/11.59/11.22 45.2 22.61/21.79/21.29adder (cla) 19.6 10.52 46.1 12.28alu(+;�) 31.4 16.03 71.8 24.13alu(+;�) ! alu(+;�) 61.1 20.41 140.2 32.11mux 7.5 3.42 17.4 4.04Table 1: Area/delay for library modules in SCMOS 2.0 technology.ASuit < Hearts

AValue := Ace

ASuit := ASuit + 1

AValue <= King

PresentSuit != NoSuit

Card < DeckSize

Card := Card + 1 Card := 1

TheSuits[Card] := ASuit

TheValues[Card] := Avalue

Card > Limit

Card := Card - Limit Card := Card + Seed

AValue := AValue + 1

PresentSuit := TheSuits[Card]

PresentSuit := TheSuits[Card]

T

F

T

T

T

F

F

F

F T

9

10

11

12

13

14

15 16

17

18

19

20

21 22

24

23Figure 1: Control- ow graph for the Blackjack dealer process.4

to execute the same subset of operations. To calculate the average CC for the subset of operationsscheduled, we assume that the probability of executing each path in the control- ow graph is the same.The clock period (CP) in Table 2 is calculated by implementing each schedule using the same set offunctional units. For the third schedule, shown in Figure 2(c), there are two di�erent assignmentsof the operations to the functional units which produce implementations with substantially di�erentclock periods of 50.15 ns and 36.62 ns. The product of CC and CP gives the total execution time andis shown in the third column of Table 2. It is clear that the implementation corresponding to the thirdschedule is the fastest for the subset of operations under consideration.It can be seen from the table that while the conventional schedule (Figure 2(a)) results in the leastCP, the third schedule (Figure 2(c)) with data chaining leads to a faster execution time. Secondly,even when the schedule and the set of resources to be used is �xed, assignment of the operationsto the resources can have a signi�cant impact on the clock period of the implementation. This isillustrated by the results of the two assignments corresponding to the same schedule (Figure 2 (c))and same resources as shown in Table 2 with clock period ranging from 50.15 ns to 36.62 ns. We nextexplain the two alternative assignments of the operations of the schedule of Figure 2(c) to a �xed setof resources to illustrate the e�ect of assignment on circuit delay.Let us assume that the allocated resources are three comparators and two ALUs, and resourcesharing is performed. Two possible assignments of the operations of Figure 2(c) to the availableresources, and the resultant RT-level circuits are shown in Figures 3(a) and 3(b) respectively. We saythat two units X and Y are chained, if there exists a path from unit X to unit Y. The assignment ofFigure 3(a) creates many chained units, explained in Section 4.1, like cmp1 and cmp2 in the circuit ofFigure 3(a). Similarly, cmp2 is chained to cmp3 and cmp3 to ALU1. The longest path thus consists ofthree comparators, one ALU and two muxes. This path is critical in determining the clock period. Asshown in Table 2, the clock period of the circuit implementing the complete dealer process of Figure 1under assignment 1 is 50.15 ns. In contrast, the longest path in the circuit corresponding to assignment2 shown in Figure 3(b) has one comparator unit, one ALU and two muxes. The clock period of thecomplete implementation under assignment 2, as shown in Table 2, is 36.62 ns, a reduction of 27%Avg. Clock Clock PeriodCycles (CC) Assignment (CP) (CC * CP)3.8 Assignment 1 30.79 1172 Assignment 1 26.89 53.78Assignment 1 50.15 50.151 Assignment 2 36.62 36.62Table 2: Variation of execution time (CC *CP) with schedule length (CC) and assignment.5

(c)AValue <= King

PresentSuit != NoSuit Asuit < Hearts

Asuit = Asuit + 1

Card > Limit

Card = Card - Limit Card = Card + Seed

12

913

Card < DeskSize

Card = Card + 1 Card = 1

14

15 16

10

20

21 22

(b)AValue <= King

PresentSuit != NoSuit

12

13 Asuit < Hearts

Asuit = Asuit + 1

9

10

Card < DeskSize

Card = Card + 1 Card = 1

14

15 16Card > Limit

Card = Card - Limit Card = Card + Seed

20

21 22

(a)AValue <= King

PresentSuit != NoSuit

12

13 Asuit < Hearts9

Asuit = Asuit + 110

Card < DeskSize

Card = Card + 1 Card = 1

14

15 16

Card = Card - Limit Card = Card + Seed

Card > Limit20

21 22

Figure 2: Three di�erent schedules for a subset of operations of the dealer process to illustrate the clockcycle/clock period tradeo�. States are indicated by dashed rectangles.! This example illustrates that, even when the number of resources is �xed, di�erent assignmentscan lead to circuits with signi�cant di�erence in clock period. Consequently, an e�ective assignmentalgorithm is needed which, given a �xed set of resources, assigns the operations to minimize the clockperiod of the �nal circuit.3 Terminology and De�nitionsIn this section, we de�ne the terms that we will be using in the remainder of this paper. The de�nitionsfor the common ancestor and level number are with respect to an acyclic, directed and rooted graph.The Common Ancestor (CA) of a pair of nodes vi and vj is a comparison node vk, such thatthere are disjoint paths from vk to vi and vj . The common ancestor of a set of nodes S, is the set CAof all comparison nodes vk , such that there exists disjoint paths from vk to at least any two nodes in6

ALU2(+, --)

0 1 0 1Si

AsuitCard Seed "1"

0 1

0 1

0 1

Cmp1(<, =, >)

Cmp2(<, =, >)

Cmp3(<, =, >)

Hearts

DeckSize

ASuitCard

Si

Si

Avalue King PresentSuit NoSuit

(b)

Limit

ALU1(+, --)

0 1

Card "1"

Si

Limit

Assignment #2(12) --> cmp1(13) --> cmp2(14,20,9) --> cmp3(15,21) --> alu1(22,10) --> alu2

0 1

Cmp3(<, =, >)

DeckSize Limitd

Si

Si0 1

Asuit Card "1"

Cmp2(<, =, >)

Car

PresentSuit

NoSuit

0 10 1

ASuit Hearts

Si

Cmp1(<, =, >)

Avalue King

(a)

Card

Si

Seed Limit

Assignment #1(12) --> cmp1(13,9) --> cmp2(14,20) --> cmp3(21,22) --> alu1(15,10) --> alu2

0 1

c1gt

c2eq

c3gt

ALU2(+, --)

ALU1(+, --)

ALU2

"1"

0 10 1

Card ALU1

Card

0 1sel1

sel2 sel3

Asuit

0 1

Asuit ALU2

sel3 = c1gt.c2eq.c3lt

sel1 = c1gt.c2eq.c3lt +

c1gt.c2eq

sel2 = c1gt.c2eq.c3lt + c1gt.c2lt

ALU2

"1"

0 10 1

Card ALU1

Card

0 1sel1

sel2 sel3

Asuit

0 1

Asuit ALU2

sel3 = c1gt.c2eq.c3lt

sel1 = c1gt.c2eq.c3lt +

c1gt.c2eq.c3gt + c1gt.c2eq.c3lt

sel2 = c1gt.c2eq.c3gt

c1gt c2eq

c3lt

sel4

sel4 = c1gt.c2lt

sel4 = c1gt.c3lt

sel4

c3gt

c2lt

c3lt

S S

from PIs / cmp. outputs

FF

NextStateLogic

i•••

StateDecodeLogic

S S

from PIs / cmp. outputs

FF

NextStateLogic

i•••

StateDecodeLogicFigure 3: Two implementations of the schedule of Figure 2(c) for two di�erent assignments.7

S. In Figure 2(c), CA(f16; 22g) = f13g, while CA(f16; 22; 9g) = f13; 12g.The Level Number (level) of a node vi in the graph is de�ned as:level(vi) = 1, vi does not have a successor.level(vi) = MAXflevel(vj) + 1 j vj is a successor of viglevel(S) = MAXflevel(vi) j vi 2 Sg, where S is a set of nodes.In the graph of Figure 7(a), the level numbers are shown in parentheses at the left of the vertices,while the operation numbers are shown on the right. For example, the level number of operation 15is 1, and the level number of operation 14 is 2.A pair of operations can share the same resource, that is, they are compatible, if (a) they arenever executed in the same clock cycle and, (b) it is possible to assign them to the same resource. Twooperations are never executed in the same clock cycle if (a) they are scheduled in separate states, or(b) they are in the same state but on mutually exclusive paths. Compatible operations can be sharedin three fundamental ways: across mutually exclusive paths (Section 4.1), shared implicitly (Section4.2), and across states (Section 4.3).The compatibility graph contains a node for every operation that needs to be assigned to afunctional unit. There is an edge between two nodes of the compatibility graph if the pair of nodesis compatible. A con ict graph is the complement of a compatibility graph. Hence, in the con ictgraph, there is an edge between a pair of nodes i� there is no edge between the pair in the correspondingcompatibility graph.A set of nodes in a graph is said to form a clique if, for every pair of nodes in the set, there isan edge between the pair of nodes in the graph. A clique partitioning is a set of cliques such thatevery node in the graph belongs to one and only one clique in the clique partitioning.4 Analysis and Minimization of the E�ect of Resource Sharingon Clock PeriodIn this section, we analyze the e�ect of di�erent types of resource sharing on the clock period of thesynthesized circuit. We show that in spite of having a �xed number of resources to share, assignmentplays a critical role in determining the clock period of the �nal circuit. Three fundamental ways ofsharing and their role in chaining resource units to create long paths are detailed. The transitivity ofchaining is illustrated. Assignment techniques to derive circuits with short critical paths are suggested.4.1 Sharing Mutually Exclusive OperationsLet two operations on mutually exclusive paths be assigned to the same unit u2, and their commonancestor (CA) to unit u1. There is a combinational path from unit u1 to unit u2 (chaining from8

u1 to u2) if and only if, the two operations being shared have at least one operand di�erent. Thechaining arises because the output of u1 has to decide which of the di�erent operands should bean input to u2. In Figure 3(a), considering assignment 1, op13 and op9 are assigned to cmp2 andtheir CA, op12 to cmp1. Since op13 and op9 have di�erent inputs, circuit 1 (Figure 3(a)) has acombinational path from cmp1 to cmp2. Similarly, assignment 1 creates a path from cmp2 to cmp3 andyet another from cmp3 to ALU1. By transitivity of combinational paths, we get a long combinationalpath (cmp1 ! mux! cmp2 ! mux! cmp3 ! mux! ALU1)We propose the assignment rules R1 to R4, to perform resource sharing while avoiding the for-mation of the resource chains that create long paths in the resultant circuit.(R1) Share operations whose corresponding inputs are the same. Since input selection is not necessary,there will be no chaining with the ancestor's output.(R2) Share those operations such that the sum of the delay of the resource unit on which the operationscan be implemented and the delay of the unit on which their CA is implemented is minimum. Figure4 shows the CFG of the operations scheduled in one of the states of a schedule for the benchmarkFancy [19] and the available resources. Note that all the add operations, op2, op4 and op5, can beshared. Sharing op4 and op5 chains a (<) unit with an adder through a mux (delay 37.58 ns, from 16bit column of Table 1). However, sharing op2 and op4 creates a chain from the (=) unit to an adderthrough a mux with delay 21.79 ns. This example illustrates the advantage of sharing operationswhose CA has the least delay. The next two rules e�ectively break up long chains of resource units.(R3) Share operations whose common ancestor (CA) has the highest level number. Referring to thedatapath corresponding to assignment 1 in Figure 3(a), there is a long chain (cmp1 ! cmp2 ! cmp3 !ALU1). If, instead of sharing op21 and op22 (CA(21,22) has level number 2) as in assignment 1, op15was shared with op21 (CA(15,21) has level number 3) as in assignment 2 Figure 3(b), cmp2 wouldbe directly chained with ALU1, thus bypassing the intermediate cmp3. This illustrates that, sharingoperations whose CA is closer to the root of the CFG creates smaller chains.(R4) Share operations which have smaller level numbers. Consider sharing operations (20,9,14) whereall three have level number 2, and assign op13 which has level number 3 to a separate comparator asdone in assignment 2 (Figure 3(b)), as opposed to sharing operations (9,13) and (14,20) as in assign-ment 1. Since op13 assigned to cmp2 is not shared at all, the output of cmp1 is no longer chainedwith cmp2. This further breaks up the chain (cmp1 ! cmp2 ! cmp3) of the circuit produced byassignment 1 (Figure 3(a)), to generate the circuit shown in Figure 3(b) where the longest chain is(cmp1! ALU1). 9

Resources available 1 (=) unit 1 (<) unit 2 (+) units

1=

2+

4+

5

+

3< Assignment #1: (1) -> cmpeq, (3) -> cmplt (2) -> add1, (4, 5) -> add2

Assignment #2: (1) -> cmpeq, (3) -> cmplt (2, 4) -> add1, (5) -> add2 Figure 4: The Fancy example: Illustrating sharing of operations whose CA has smaller delay.4.2 Implicit Sharing Inside a StateWe de�ne implicit sharing inside a state with respect to the CFG of operations scheduled in thatstate. A functional unit F is said to be implicitly shared inside a state if and only if, there exists

Count := MaxPkt

(b)

Count := t1

t2 := ByteCount + MaxPkt

t1 > MaxPkt

t2 := ByteCount + t1

1

4.2

4.1

32

(a)

t2 := ByteCount + Count 4

Count := MaxPkt Count := t1

t1 > MaxPkt 1

2 3 0 1

Add 1( + )

Cmp( > )

t2 count

t1 MaxPkt Bytecount

Cmp( > )

Add 1( + )

Add 2( + )

count

0 1

t2

BytecountMaxPktt1

0 1Figure 5: (a) Flowgraph to illustrate implicit sharing inside a state. (b) Duplicating operations to avoid implicitsharing.two or more mutually exclusive paths in the CFG of the state from a conditional operation Oi, toan operation Ok that is assigned to F, and the inputs to Ok di�er depending upon the path executed.Let the conditional operation Oi deciding which path to execute be assigned to a comparator unit C.Implicit sharing creates a combinational path from the comparator unit C to the implicitly sharedunit F.An example CFG is shown in Figure 5(a). The CFG contains the operations to be executed inone of the states of a schedule of the send process of a network protocol X.25 [20]. If the comparisonoperation op1 is assigned to a comparator, and op4 is assigned to an adder, from the above de�nitionwe have implicit sharing of the adder and there will be a path from the comparator to the adder,10

as is shown in the datapath of Figure 5(a). However, consider Figure 5(b), where op4 has been splitinto two operations, one for each mutually exclusive path to op4. If each copy of op4, op4:1 and op4:2were to be assigned to a separate adder, there would be no implicit sharing. Also, there would be nochaining from the output of the comparator to which op1 has been assigned to any of the two adders,as is shown in the datapath of Figure 5(b). Splitting implicitly shared operations converts a CFGinto a tree and is hence referred to as a control ow tree (CFT). Given a CFT, the assignmentalgorithm makes the choice whether the split operations should be assigned to the same unit or not,depending upon the availability of resource units.4.3 Implicit Sharing across StatesAll existing resource sharing techniques including the technique presented in this paper consider shar-ing a resource amongst operations that belong to di�erent states. However, sharing across states canalso lead to implicit sharing. Since none of the existing resource sharing techniques consider the e�ectof implicit sharing across states, we discuss the subject in detail next. In a behavioral description withmutually exclusive paths, it is possible for an operation to be scheduled in more than one state. Whenthe occurrences of this operation in two or more states is assigned to the same unit, implicit sharingacross states take place. Implicit sharing across states can create resource chaining resulting in longcombinational paths.In Figure 6(a), we show a fragment of a CFG extracted from the control software code for theAutoPilot of an Unmanned Aerial Vehicle (UAV), written for the Motorola MC68HC11microcontroller[21]. The available resources are two comparators, and two (+=�)ALUs. Note from Figure 6(b), op6could not be scheduled in state s0 along path 1,3,5,6, since it would violate the resource constraint oftwo ALUs. We show that for the given schedule, implicit sharing across states creates a long path. Analternative assignment is given which creates shorter paths by avoiding implicit sharing across states.In Figure 6(b), operation 6 is scheduled in two states, s0 and s2. Two possible assignments and theresultant datapaths are given in Figures 6(c) and (d). Let sji be an instance of operation i scheduled instate sj . Since assignment 1 in Figure 6(c) assigns s60 and s62 to ALU2, by de�nition we have implicitsharing across states, and the path cmp1 ! ALU1 ! ALU2 ! cmp2 is created. However, if s62 isassigned to ALU1, two smaller chains cmp1 ! ALU1 ! ALU2, and cmp1 ! ALU1 ! cmp2 arecreated as shown in Figure 6(d). Instead of assuming that the occurrences of the same operation indi�erent states will be assigned to the same unit, we treat each occurrence as a separate operation.The assignment algorithm chooses the best assignment to minimize the clock period.11

(a) EndVal >= BegVal 1

t1 = EndVal - BegVal 2

PulseWidth < OneMsec 7

RTInterrupt = False 8

t2 = 0 4

t1 = MaxVal + EndVal 3

t2 = BegVal + Offset 5

PulseWidth = t1 - t2 6

S0

3+

>= 1

2

5+= 4

6

S2

6

< 7

= 8

S1

< 7

= 8

(b)

OneMsec

State S0 op1 --> cmp1 op2 --> alu1 op3 --> alu1 op5 --> alu2 op6 --> alu2

State S1 op7 --> cmp2

State S2 op6 --> alu2 op7 --> cmp2

MaxVal

AAAAAA

AAAAAA

PulseWidth

AAAAAAAAAA

t2 t1

alu1( + / -- )

0 1

t1t2

"0"BegVal

a lu2( + / --)

EndVal

0 10 1

0 10 1

PulseWidth

OneMsec

Cmp2( < )

S10 1

Offset

S0 S0

Cmp1( > )

(c)

Same as theassignment #1,except for

State S2 op6 --> alu1

Cmp1( < )

MaxVal

AAAAAAAA

PulseWidth AAAAt2 t1

alu1( + / -- )

0 1

t1 t2

"0"

BegVal

a lu2( + / --)

EndVal

0 10 1

0 1Offset

S0 S00 1

PulseWidth

Cmp2( < )

S10 1

(d)

Assignment #1 Circuit #1

Assignment #2 Circuit #2

Figure 6: UAV AutoPilot: Illustrating implicit sharing across states.12

5 An Assignment Algorithm for Minimizing the Clock PeriodA traditional approach for �nding an assignment is to �nd a clique partitioning of the compatibilitygraph, such that the operations in each clique can be assigned to a unique available resource unit. Sinceclique partitioning is an NP complete problem [22], existing approaches to the assignment problemoften use fast approximate algorithms [23]. Various features are added to the compatibility graphdepending upon the goal of the assignment algorithm. The assignment algorithm proposed in [23]uses edge weights assigned to the compatibilty graph edges to derive an assignment which minimizesthe area of the �nal implementation. In [24], constraints are imposed on the con ict graph to generateself-testable data paths.We present an assignment algorithm which, given a scheduled CFG and a set of allocated resources,synthesizes a circuit with minimum clock period. The assignment algorithm uses a special compati-bility graph called the assignment graph (AG), which has the following components.(1) For each state of the scheduled CFG there is a corresponding CFT in the AG. All assignmentnodes of the form x y are removed from the CFT's. Each node in a CFT is assigned a level number.(2) There exists a compatibility edge between each pair of compatible operations, with an associatededge weight. The operations can be on the same CFT (mutually exclusive) or on di�erent CFT's (acrossstates). The weight is an estimate of the delay penalty of sharing the two operations connected by thecompatibility edge. The delay penalty can be estimated by the delay estimation algorithm discussedin [25].Figure 7(a) shows the assignment graph corresponding to the CFG in Figure 2(c). The levelnumbers are shown on the left of each node inside parentheses in Figure 7(a), the node numbersare shown on the right. The dashed edges are the compatibility edges. Since op15 and yop21 areon mutually exclusive branches of op13, and they can be assigned to the same adder, there is acompatibility edge between them. The weight on the compatibility edge, also called the delay penalty,is the delay of the longest combinational path that results from sharing the two operations connectedby the compatibility edge. The delay penalty of sharing op15 and op21 includes the delay of the adder,a mux at the input of the adder, and another comparator unit to which op13 is assigned to controlthe mux and is 28.65ns (using the delay values in the 8 bit column of Table 1). Details of the delayestimation algorithm are given in [25].5.1 The Assignment AlgorithmWe present an algorithm ClkMin which, given the special compatibility graph called the assignmentgraph (AG) and a set of allocated resources, assigns the operations in the graph to the resources, so asto minimize the clock period of the resultant circuit. The assignment graph consists of the CFT of each13

AValue <= King

PresentSuit != NoSuit Asuit < Hearts

Card < DeskSize

suit = Asuit + 1

Card = Card + 1 Card > Limit

Card = Card - Limit Card = Card + Seed

(a)

28.65

28.65

28.65

28.65

28.65

28.00

28.

0028.0028.00

28.

65

4)

(3)

(2)

(1)

(1)

(2)

(2)

(1)

(1)

12

13 9

10

20

2221

15

14

AValue <= King

PresentSuit != NoSuit Asuit < Hearts

Card < DeskSize

Card = Card + 1 Card > Limit

(b)

32.07

12

139

14

(22,10)

15 20

Asuit = Asuit + 1Card = Card + SeedCard = Card - Limit

21

28.65

32.07Figure 7: (a) Assignment graph for the CFG of Figure 2. (b) Assignment graph after assigning op21 and op22to an alu unit (only relevant portions shown).state of a given schedule. Initially, each node in the AG contains one operation. Procedure ClkMiniteratively merges compatible nodes in AG such that in the �nal AG, each node (containing one or moreoperations) can be assigned to a separate available resource unit. The procedure Select Nodes To Shareuses assignment rules (R2) through (R4) (Section 4) to select a pair of nodes to be shared such thatthe clock period of the �nal implementation is minimized. Whenever a pair of AG nodes are chosento be shared, the nodes are merged and the AG is updated (routine Update). If possible, a subset ofthe nodes of the updated AG are assigned to available resource units. The above process of selecting,updating and partial assignment is repeated until all AG nodes are assigned.Assignment Pitfalls. There are two situations that have to be avoided by an assignment al-gorithm. We �rst illustrate the situation where an iterative assignment algorithm fails to �nd anassignment even though an assignment exists. The operations in Figure 8(a) have to be assigned to 3(<) units, 1 (+/-) unit, and 1 (-/*) unit. In Figure 8(b) we show the assignments made by an iterativealgorithm. The algorithm assigns operations op3 and op7 to the (+=�) unit. Since op5 and op6 arenot compatible with op3, and op3 has been assigned to the (+=�) unit, hence op5 and op6 can notbe assigned to the (+=�) unit. Since the (+=�) unit is the only unit on which op5 and op6 can beexecuted, sharing operations op3 and op7 prevents an assignment of all the operations to the available14

Resources available 3 (<) units 1 (+/-) unit 1 (-/*) unit

8

*

7

+

2<

1<

3-

6+5+

4<

(a)

Infeasible(1) -> (<)1(2) -> (<)2(3) -> (+/-)(4) -> (<)3(7) -> (+/-)(8) -> (-/*)(5) -> ?(6) -> ?

Iterative Assignment

Feasible(1) -> (<)1(2) -> (<)2(3) -> (-/*)(4) -> (<)3(7) -> (+/-)(8) -> (-/*)(5) -> (+/-)(6) -> (+/-)

(b) (c)Figure 8: Example of failure of iterative assignment to produce an assignment.resources. However, as shown in Figure 8(c), an assignment does exist.The second situation might arise where there are two assignment choices, such that the �rst oneproduces a circuit with a combinational loop while the second assignment does not. Since most logicsynthesis systems and timing analyzers are not able to handle combinational loops, it is desirable togenerate circuits without combinational loops. Formation of combinational loops due to assignmentand an algorithm to produce loop-free assignments was presented in [26], and hence is not discussedfurther in this paper.Functions Select Nodes To Share and Safe Share ensure that ClkMin always produces an assign-ment which does not create loops when such an assignment exists. If there are more than one suchassignment, then the assignment selected minimizes the clock period.Safe Share. Function Safe Share ensures that unsafe assignment decisions, as discussed in theabove two paragraphs, are never made. Let us assume that sharing a resource between nodes (vi) and(vj) is favorable for minimizing the clock period. However, the sharing might prevent assignment ofall the remaining operations to available resources. To ascertain whether it is safe to do the sharing,function Safe Share is invoked. The pseudo-code for function Safe Share is given at the end of thissection.In the �rst step, a compatibility graph is created from the assignment graph by removing all edgesfrom the AG except for the compatibility edges. Next, an updated compatibility graph is producedfrom the compatibility graph. In the updated compatibility graph, the two nodes (vi) and (vj) aremerged to produce one node (vi; vj). If any node in the original compatibility graph had edges toonly one of (vi) or (vj), then the edge is deleted in the updated compatibility graph. If there wasan edge from a node (vk) to both (vi) or (vj), then in the updated graph an edge is created from15

(vk) to the node (vi; vj). All other edges in the original compatibility graph are maintained in theupdated compatibility graph. Finally, for every FU allocated, a node is added to the compatibilitygraph. Edges are added between an operation node (which are the nodes derived from the AG) andan FU node if all the operations of the operation node can be executed on the FU node.In the min-clique-partitioning of the above graph, i� every clique contains an FU node, then anassignment of the operations to the allocated FU's exist. The complete proof is straightforward and isomitted. However, it can be seen that a min-clique-partition which satis�es the above property impliesthe existence of an assignment, since the operations corresponding to each clique can be assigned tothe FU belonging to the same clique. The above algorithm combined with the algorithm in in [26]which produces loop free assignments are used to determine whether merging nodes (vi) and (vj) issafe.Select Nodes To Share. This function which forms the core of our assignment algorithm choosesa compatible pair to be shared such that the �nal clock period is minimized. While rule (R1) of Section4 is implemented by line 1 in ClkMin, rules (R2), (R3) and (R4) are implemented by lines 3 to 5 ofSelect Nodes To Share. Line 3 of the code implements rule (R4) since the average of the level numbersof nodes with smaller level numbers will be smaller than the average of nodes with higher level numbers.Referring to Figure 7, applying line 3 results in operations (14, 20, 9) sharing the same resource (as inassignment 2), as opposed to operations (9, 13) and (14, 20) sharing resources (as in assignment 1).Line 4 directly implements (R3). Line 5 implements rule (R2). Since the compatibility edge weight isan estimate of the cost of sharing the two (sets of) operations, it includes the delay of the CA of thetwo operations and the cost of the unit on which the operations to be shared can be assigned as wellas the cost of the mux which will have to be used to switch inputs into the resource unit. Hence, bypicking the pair whose compatibility edge has minimum weight, it implements rule (R2).Function Select Nodes To Share ensures that ClkMin always �nds an assignment when one exists.In line 7 of function Select Nodes To Share, the function Safe Share is invoked to ascertain whether itis safe to assign the two nodes under consideration to the same resource. If a pair is unsafe, then thecompatibility edge between them is deleted, thus preventing them from being assigned to the sameresource. This ensures that ClkMin always �nds an assignment which does not create loops if at leastone such assignment exists. If there are more than one such assignment, then ClkMin selects theassignment which minimizes the clock period.Update. After a pair of nodes (vi), (vj) have been selected to be shared, the assignment graph isupdated as follows:� The nodes are merged into a new node (v) = (vi; vj), such that (v) contains all the operationspreviously contained in (vi) and (vj). 16

� A compatibility edge is introduced between the new node (v) and node (vk), if and only if, there werecompatibility edges between (vk) and (vi) and also between (vk) and (vj) before merging. Considerthe Assignment Graph shown in Figure 7(a). After merging nodes 14 and 9, an edge is introducedbetween the new node (14,9) and 20, but not between (14,9) and 13.� The compatibility edge weights need to be updated. In the AG shown in Figure 7(a) assume node10 and node 22 have been merged. This implies they will be assigned to the same unit (as in ALU2of assignment 2 shown in Figure 2). Note that the edge between node 15 and the merged operations(22,10) shown in the updated AG in Figure 7(b) now has weight 32.10. The updated weight re ectsthe extra delay for sharing three operations on the same resource unit (due to more muxes at theinput), than sharing two operations on one unit (as re ected by the weight between node 15 andnode 21 which is 30.28). The updating of the edge weights is done by the RT-level delay estimationalgorithm described in [25].� The new node (v) = (vi; vj), is assigned a level number which is given by MAX(level((vi)), level((vj))).Assign. Let V be a set of nodes in AG and R a subset of the allocated resource types. There canbe more than one unit of a particular resource type available. The procedure Assign assigns the nodesin V to the units of R if the following two conditions hold:(1) Each node in V can be assigned to a separate resource unit of a type which is in R;(2) operation nodes which are in AG but not in V are not implementable on a resource type belongingto R.Condition (1) assures that the assignment is feasible. Condition (2) implies that nodes in (AG�V)cannot be implemented on any resource of type R. Hence, if an assignment of nodes in (AG�V) existedbefore assigning the nodes in V to resources of type R, the assignment would exist even after the nodesin V were assigned. Thus when conditions (1) and (2) are satis�ed, the nodes in AG correspondingto the nodes in V are marked assigned and are not considered for further merging, and the resourceunits in R are removed from the list of resources, resource list. The number of node pairs which haveto be considered for merging are reduced, thus speeding up the assignment process.The pseudo-code of the assignment algorithm is given below.ClkMin(resource list, AG) f1. merge all nodes in AG which have common inputs;2. Assign(AG, resource list);3. while (9 nodes in AG that have not yet been assigned) f4. (vi; vj) = Select Nodes To Share(AG);5. if (no sharable pair (vi; vj) exists)6. return ASSIGNMENT FAILED; 17

7. Update(vi; vj, AG);8. Assign(AG, resource list);g9. return assignment;gSelect Nodes To Share(AG, rsrc alloc) f1. while (9 compatible pairs in AG) f2. C0 all compatible pairs (vi; vj) in AG s.t. both vi; vj are not assigned;3. C1 all (vi; vj) 2 C0 s.t. j level(vi) + level(vj) j =2 is minimum;4. C2 all (vi; vj) 2 C1 s.t. level(common ancestor(vi; vj)) is maximum;5. C3 all (vi; vj) 2 C2 s.t. compat edge weight(vi; vj) is minimum;6. for every (vi; vj) 2 C3 f7. if(Safe Share(vi, vj, AG, rsrc alloc) = SAFE TO SHARE)8. return (vi; vj);else9. in AG, delete compatibility edge between vi and vj ;gg10. return (;; ;);gSafe Share(vi; vj, AG, rsrc alloc) f1. compat graph = create compat graph(AG);2. compat graph = update compat graph(compat graph,vi; vj);3. compat graph = add FU nodes(rsrc alloc);4. for every min clique partitioning P of compat graph f5. if (9 assignment P with available resources) & (assignment P does not create loops)6. return SAFE TO SHARE;gg7. return UNSAFE TO SHARE;g 18

6 Experimental ResultsThe proposed assignment algorithm, ClkMin, has been integrated into the high level synthesis system,SECONDS [27]. To evaluate the e�ectiveness of the proposed assignment technique, we have syn-thesized the following conditional-intensive VHDL descriptions: the dealer process of Blackjack [15],Fancy [19], the controller for the AutoPilot of an Unmanned Aerial Vehicle (UAV) [21], a subcircuitof a Vending machine controller [28], and the graphics controller [29]. Table 3 shows the synthesisresults.Each description is scheduled to satisfy the resource constraints speci�ed in column Resources.Assignment 1 is a feasible assignment which satis�es the resource constraints, but does not focuson minimizing clock period. Assignment 2 has been obtained by the proposed algorithm ClkMinto minimize the clock period of the resultant circuit. The relevant portions of the CFG's for theBlackjack, Fancy and UAV AutoPilot processes, and the mapping of the CFG operations under thetwo assignments, are illustrated in Figures 3, 4, and 6. Clock Prd [ns] Area[mm2]Benchmark Resources Assignment FEST SIS � 100Blackjack 3 (<;=; >) Assignment1 50.15 44.95 3.56(dealer) 2 (+;�)(8 bit) Assignment2 36.62 34.81 3.83Fancy 1(=) Assignment1 52.21 44.78 3.68(16 bit) 1(<)2(+) Assignment2 37.18 30.06 3.69UAV 2(<) Assignment1 57.14 47.34 3.21AutoPilot 2(+=�)(8 bit) Assignment2 51.67 42.64 3.23Vender 2(�) Assignment1 47.46 44.33 3.99(8 bit) 1(+)3(<) Assignment2 31.69 30.53 4.05Graphics 2(<;=; >) Assignment1 47.92 40.63 7.89Controller 2(+=�)(8 bit) Assignment2 31.11 26.20 7.92Table 3: Results of appying ClkMin on several benchmarks.Each RT-level circuit generated by the assignments is subjected to technology-dependent delay19

No Rules R2, R3, R1, R3, R1, R2, R1, R2, R1, R2,Used R4 R4 R4 R3 R3, R4Blackjack 50.15 36.62 36.62 36.62 50.15 36.62(dealer)Fancy 52.21 37.18 37.18 37.18 37.18 37.18UAV 57.14 51.67 53.92 53.92 51.67 51.67AutoPilotVender 47.46 31.69 31.69 31.69 47.46 31.69Graphics 47.92 31.11 47.92 44.70 31.11 31.11ControllerTable 4: The e�ect of rules R1 through R4 on minimizing the clock period.optimization, including fanout optimization, using the SIS technology mapper [16] and the lib2.genlibstandard cell SCMOS 2.0 library [18]. Subsequently, OASIS [17] place and route tools are used toobtain the standard cell layout. Columns Clock Period and Area report the delay and area of the�nal implementations. The column under FEST reports an estimate of the delay computed by theRT-Level delay estimation tool FEST [25]. The delay of the mapped gate-level circuit is computedusing SIS and is reported under Column SIS.The results demonstrate that the RT-level delay estimates are within 15% of the gate-level delay.The results also demonstrate the e�ectiveness of the assignment algorithm in minimizing the clockperiod of the resultant circuit. For instance, in the case of Blackjack, assignment 1 produces a circuitwith a clock period of 44.95 ns and an area of 3.56 mm2. In contrast, the circuit produced by theproposed assignment algorithm ClkMin has a clock period of 34.81 ns and an area of 3.83 mm2. Aclock period reduction of 23% could be achieved by ClkMin for a nominal area penalty of 7:5%. Onan average, a clock period reduction of 26% could be achieved. The average increase in area is 3.5%,because the clock period reduction was not e�ected by increasing the resources used, but by di�erentassignments to the same resources.In Table 4, we report delay results using FEST [25] which demonstrate the e�ect of the usage ofthe rules R1 through R4 of Section 4 in the assignment algorithm ClkMin. The results in the �rstcolumn report the implementation delay as a result of operation assignment without using any of therules (Assignment 1 of Table 3). The last column of Table 4 reports the implementation delays whenoperation assignment is done using all the four rules (Assignment 2 of Table 3). Columns two to �veshow the e�ect of ignoring rule R1, R2, R3, and R4 respectively while doing assignment. For example,when rule R2 is ignored, the implementation delays of the resulting assignments are reported under20

the third column in Table 4. For the graphics controller, ignoring rule R2 gives an implementationwith delay of 47.92 ns as opposed to the delay of 31.11 ns when rule R2 is used with the otherthree rules. The experimental results demonstrate that if rules R2, R3 and R4 are not used duringoperation assignment, sub-optimal assignments result. Rule R1 does not seem to a�ect the quality ofthe assignments for the benchmarks considered. A closer investigation of the implementations revealthat for the benchmarks considered, rule R1 reduces the delay along non-critical paths only.7 ConclusionsThis paper has investigated the e�ect of resource sharing and assignment on the clock period ofcircuits synthesized from conditional-intensive behavioral descriptions. Under a �xed set of availableresources, di�erent assignments can lead to circuits with signi�cant di�erences in clock period. Anassignment algorithm has been proposed to identify an assignment which leads to a minimal clockperiod implementation. Experimental results demonstrate the feasibility of synthesizing high-speedcircuits, not by using more resources or faster modules, but by e�ective assignment using a �xed setof resources.Acknowledgments: We thank T. Misawa, K. Wakabayashi, and W. Wolf for helpful discussions.References[1] S. Bhattacharya, S. Dey, and F. Brglez. Clock Period Optimization During Resource Sharingand Assignment. In Proceedings of the 31th ACM/IEEE Design Automation Conference, pages195{200, June 1994.[2] S. Bhattacharya, S. Dey, and F. Brglez. Performance Analysis and Optimization of Schedulesfor Conditional and Loop-Intensive Speci�cations. In Proceedings of the 31th ACM/IEEE DesignAutomation Conference, pages 491{496, June 1994.[3] K. Wakabayashi and T. Yoshimura. A Resource Sharing and Control Synthesis Method forConditional Branches. In Proceedings of the IEEE International Conference on Computer-AidedDesign, pages 62{65, Nov. 1989.[4] T. Kim, J. W. S. Liu, and C. L. Liu. A Scheduling Algorithm For Conditional Resource Sharing.In Proceedings of the IEEE International Conference on Computer-Aided Design, pages 84{87,Nov. 1991.[5] P. G. Paulin and J.P. Knight. Force-Directed Scheduling for the Behavioral Synthesis of ASICs.IEEE Transactions on Computer-Aided Design, Vol. 8(No. 6):661{679, June 1989.21

[6] L. Ramachandran and D. D. Gajski. An Algorithm for Component Selection in PerformanceOptimized Scheduling. In Proceedings of the IEEE International Conference on Computer-AidedDesign, pages 92{95, Nov. 1991.[7] S. C-Y. Huang and W. H. Wolf. Performance-Driven Synthesis in Controller-Datapath Systems.IEEE Transactions on VLSI Systems, Vol. 2(No. 1):68{80, March 1994.[8] N-S. Woo. A Global, Dynamic Register Allocation and Binding for a Data Path Synthesis System.In Proceedings of the 27th ACM/IEEE Design Automation Conference, pages 505{510, June 1990.[9] R. A. Bergamaschi, R. Camposano, and M. Payer. Data-Path Synthesis Using Path Analysis. InProceedings of the 28th ACM/IEEE Design Automation Conference, pages 591{596, June 1991.[10] R. Jain, M. J. Mlinar, and A. Parker. Area-Time Model for Synthesis of Non-Pipelined Designs.In Proceedings of the IEEE International Conference on Computer-Aided Design, pages 48{51,Nov. 1988.[11] C. Ramachandran, F. J. Kurdahi, D. D. Gajski, A. C.-H. Wu, and V. Chaiyakul. Accurate LayoutArea and Delay Modeling for System Level Design. In Proceedings of the IEEE InternationalConference on Computer-Aided Design, pages 355{361, Nov. 1992.[12] B. Gregory, D. MacMillen, and D. Fogg. ISIS: A System for Performance Driven Resource Sharing.In Proceedings of the 29th Design Automation Conference, pages 285{290, June 1992.[13] E. A. Rundensteiner and D. D. Gajski. Functional Synthesis Using Area and Delay Optimization.In Proceedings of the 29th ACM/IEEE Design Automation Conference, pages 291{296, June 1992.[14] S. C-Y. Huang and W. H. Wolf. How Datapath Allocation A�ects Controller Delay. In Interna-tional Symposium on System Synthesis, pages 158{163, 1994.[15] CLSI Users Guide.[16] E. M. Sentovich, K. J. Singh, C. Moon, H. Savoj, R. K. Brayton, and A. Sangiovanni-Vincentelli.Sequential Circuit Design using Synthesis and Optimization. In Proceedings of the InternationalConference on Computer Design, pages 328{333, October 1992.[17] K. Kozminski(ed.). OASIS Users Guide. MCNC, Research Triangle Park, N.C. 27709, 1992.[18] S. Yang. Logic Synthesis and Optimization Benchmarks, User Guide Version 3.0. In InternationalWorkshop on Logic Synthesis, MCNC, Research Triangle Park, NC, May 1991.22

[19] S. Bhattacharya, F. Brglez, and S. Dey. Transformations and Resynthesis for Testability of RT-Level Control-Data Path Speci�cations. IEEE Trans. on VLSI Systems, Vol. 1(No. 3):304{318,Sept. 1993.[20] A. S. Tanenbaum. Computer Networks. Prentice Hall, Englewood Cli�s, N.J., 1989.[21] Kenneth Hintz and Daniel Tabak. Microcontrollers: Architecture, Implementation, and Program-ming. McGraw-Hill, New York, NY 10020, 1992.[22] M. R. Garey and D. S. Johnson. COMPUTERS AND INTRACTABILITY, A Guide to theTheory of NP-Completeness. W. H. Freeman and Company, New York, 1979.[23] C.J. Tseng and D.P. Siewiorek. Automated Synthesis of Data Paths on Digital Systems. IEEETransactions on Computer-Aided Design, Vol. 5(No. 3):379{395, July 1986.[24] L. Avra. Allocation and Assignment in High-Level Synthesis for Self-Testable Data Paths. InProceedings of the International Test Conference, pages 463{472, Oct. 1991.[25] S. Bhattacharya, F. Brglez, and S. Dey. Fast True Delay Estimation During High Level Synthesis.IEEE Trans. on Computer-Aided Design, Vol. 15(No. 9):1088{1105, Sept. 1996.[26] L. Stok. False Loops Through Resource Sharing. In Proceedings of the IEEE InternationalConference on Computer-Aided Design, pages 345{348, Nov. 1992.[27] S. Bhattacharya. Hardware Synthesis and Analysis of Control-Intensive Designs from High LevelSpeci�cations. PhD thesis, Computer Science Dept. Duke University, April 1995.[28] D. Perry. VHDL. McGraw-Hill, New York, NY 10020, 1989.[29] S. Dey, A. Raghunathan, N. K. Jha, and K. Wakabayashi. Controller re-speci�cation to minimizeswitching activity in controller/data path circuits. Technical Report 96-C007-4-5016-1, NEC C&CResearch Labs, Princeton, NJ, March 1996.23