High-performance asynchronous pipeline circuits

12

Transcript of High-performance asynchronous pipeline circuits

High-Performance Asynchronous Pipeline CircuitsKenneth Y. Yun�Department of ECEUC San Diego Peter A. BeerelyEE-Systems DepartmentUSC Julio ArceoDepartment of ECEUC San DiegoAbstractThis paper presents design and simulation results oftwo high-performance asynchronous pipeline circuits.The �rst circuit is a two-phase micropipeline but usespseudo-static Svensson-style double edge-triggered D- ip- ops (DETDFF) for data storage in place oftraditional transmission gate latches or Sutherland'scapture-pass latches. The second circuit is a four-phase micropipeline with burst-mode control circuits.We compare our DETDFF and four-phase imple-mentations of a FIFO bu�er with the current state-of-the-art micropipeline implementation using four-phase controllers designed by Day and Woods for theAMULET-2 processor. We implemented Day andWoods's design and both of our designs in the MO-SIS 1:2�m CMOS process and simulated them witha 4.6V power supply and at 100�C. Our SPICE sim-ulations show that our DETDFF and four-phase de-signs have 70% and 30% higher throughput respec-tively than Day and Woods's design. This higherthroughput for the DETDFF design is due to latchingthe data on both edges of the latch control, removingthe need of a reset phase and simplifying the controlstructures. Our four-phase design, on the other hand,has higher throughput because of the simpli�ed con-trol structures and the removal of the latch enablebu�ers from the critical path. The four-phase design,though not quite as fast as the DETDFF design, re-quires much smaller area for data storage.1 IntroductionAsynchronous designs have been successfully appliedto control oriented applications (such as chip inter-faces [21, 11], bus controllers [24], cache controllers[16], and network communication controllers [5, 10]),�This research was supported in part by a gift from IntelCorporation.yThis research was funded in part by a Zumberge ResearchFund for Assistant Professors at USC and a NSF Career Award.

datapath components (such as adders [13, 9], multi-pliers [7], and dividers [20], as well as general pur-pose microprocessors (such as the CALTECH Asyn-chronous Microprocessor [14], the NSR processor [2],the AMULET project [8], and the SUN Counter-Flow Processor [18]). Asynchronous designs do notrequire a global clock for synchronization. Rather,they are synchronized using event-driven communica-tion in which transitions on wires act to request thestart of a computation and acknowledge its comple-tion. By removing the global clock, asynchronous de-sign has the advantages of absence of clock-skew prob-lems, freedom from worst-case design restrictions, andautomatic power-down of unused circuitry.Micropipelines is a popular design style for build-ing asynchronous circuits, invented by Ivan Suther-land and introduced in his Turing Award Lecture in1988 [19]. Micropipelines is a building block approachwhich makes designing deeply pipelined circuits rela-tively easy and facilitates the reuse of combinationalfunctional units previously designed for use in syn-chronous circuits. Micropipelines is based on a two-phase, or transition-signaling, event-driven communi-cation protocol in which an event is either a low-to-high or high-to-low transition on a control wire with nodistinction being made between the two. Since tradi-tional latches are level-sensitive, however, traditionaldesigns required two-to-four and four-to-two phaseconverters, which hinder performance. The AMULETgroup recently designed a new set of control blocksusing four-phase event-driven communication whicheliminates the need for phase-converters thereby sig-ni�cantly improving performance.This paper discusses the issues involving two highperformance micropipeline circuits. In order to makeasynchronous circuits faster than synchronous circuits,statistically common cases must be made faster possi-bly at the expense of sacri�cing rare case performancesomewhat so that the average case performance willbe higher than synchronous circuits' worst case per-formance. In fact, synchronous circuits optimized forcommon cases (thus may fail to operate correctly for

rare cases) should always be faster than asynchronouscircuits that implement the same functions. Thusa well-designed asynchronous circuit should performnearly as fast as the synchronous circuit optimized forcommon cases and slow down gracefully for rare casesor slow environment. Our asynchronous pipeline cir-cuits mimic synchronous circuits when operating athigh speeds and yet slow down gracefully and dynam-ically for slow environment.Our �rst circuit is a two-phase micropipeline witha simple C-element to control each pipe stage. Weintroduce a simple, but e�ective, idea of replacing tra-ditional latches (e.g., capture-pass and transparent)in two-phase micropipelines with pseudo-static doubleedge-triggered D- ip- ops (DETDFF). DETDFFs arefast and compact when designed with two Svensson-style latches placed in parallel [1] and can be easilymade pseudo-static. They seamlessly integrate intothe two-phase micropipeline methodology without theneed of costly two-to-four phase converters. Our sec-ond circuit is a four-phase micropipeline controlled bya compact burst-mode circuit.We designed both a DETDFF FIFO bu�er and afour-phase FIFO bu�er controlled by burst-mode cir-cuits as well as the leading (four-phase) FIFO bu�erdesigned by Day and Woods [6] in the MOSIS 1:2�mCMOS process using the same transistor sizing. Forour four-phase FIFO, we adjusted the capacitive load-ing on the control signals that enable storage elementsto match Day and Woods's experiments (1pF). Forthe DETDFF FIFO, however, we optimized the de-sign for 4pF loading because our DETDFFs have fourtimes the input capacitance than transparent latchesused in Day and Woods's design. We simulated allthree designs with a 4.6V power supply and at 100�C.Our DETDFF and four-phase designs achieve cycletimes of 3:9ns and 4:9ns respectively, while Day andWoods's design achieves a cycle time of 6:7ns. Thus,our designs are 70% and 30% faster than their design.We assess that the higher throughput for theDETDFF design is due to latching the data on bothedges of the latch control, removing the need of areset phase and simplifying the control structures.Our four-phase design, on the other hand, has higherthroughput because of the simpli�ed control struc-tures and the removal of the latch enable bu�ers fromthe critical path.When the processing logic is added between thepipeline stages of a DETDFF pipeline circuit, the min-imum cycle time increases by the processing logic de-lay only | there is no extra overhead. In our four-phase circuits, on the other hand, the minimum cy-

cle time increases by twice the processing logic delay;however, they require smaller area for data storage(transparent latches instead of DETDFFs). There-fore, our four-phase circuits are useful for compactFIFO applications, whereas our DETDFF pipeline cir-cuits are attractive for general purpose pipelined dat-apath.Neither one of our pipeline circuit designs su�ersfrom the restriction that only half of the pipelinestages can be occupied when full. In contrast to Dayand Woods's design, both of our designs employ datastorage schemes (blocking latches) that hold data with-out releasing except during actual transfers | an at-tractive feature for low-power designs. For example,in our four-phase design, data never propagate beyondthe next pipe stage when the latches are open becauselatch enable signals in successive stages are designedto be non-overlapping.2 Background: MicropipelinesPipelining is a standard way of decomposing an op-eration into concurrently operating stages to increasethroughput at a moderate increase in area. A wide va-riety of applications, such as digital �lters, video com-pression, and general purpose microprocessors, canall be decomposed into pipeline structures. The sim-plest example is a FIFO bu�er implemented as a 1-dimensional pipeline which simply passes data fromits input to its output ensuring that the order is �rst-in �rst-out.Pipelines can be implemented both synchronouslyand asynchronously. In a synchronous pipeline thecommunication of data between stages is regulated bythe global clock. It is assumed that each stage takes nolonger than the period of the clock and data is trans-ferred between consecutive stages simultaneously. Inasynchronous pipeline the communication of data be-tween the stages is regulated by local communicationbetween stages. When one stage has data which itwould like to send to a neighboring stage, it sends arequest to that stage. If that stage can accept newdata, it accepts the new data and returns an acknowl-edgment.This section reviews micropipelines, a buildingblock approach for designing asynchronous pipelines[19]. The key aspects of the micropipeline strategy isthe bundling constraint and the request/acknowledgecommunication protocol that governs the transfer ofdata between stages.

2.1 Bundled Data ConstraintThe transfer of data between stages in a micropipelineis based on a data-bundling scheme. The interface be-tween stage i and stage i+1 includes a bundle of datawhich carries information and a request wire. Whenthe request wire rises and is sensed at stage i+1, stagei+ 1 assumes that that new valid data is available atits inputs.An advantage of this approach is that it facilitatesuse of combinational datapath blocks that may havebeen previously designed for use in synchronous cir-cuits. Namely, the datapath block can easily be placedbetween two communicating micropipeline stages. Todo this, a simple delay line must be built which modelsthe worst-case delay of the combinational logic. Thisdelay line is then incorporated into the request linebetween the two communicating stages. The requestis sent after the data at the input of combinationalblock is valid. Because of the delay line, it is sensedat the receiver only after the data is guaranteed to bestable at the receiver's input.The principle disadvantage of this approach is thatit requires worst-case design of the delay line, pre-venting the circuit from achieving superior average-case delay. It is important to note, however, that mi-cropipelines can also use datapath blocks that providecompletion detection (designed specially for use inasynchronous circuits). For example, an asynchronousALU [9] and an asynchronous divider [20] have demon-strated signi�cantly smaller average-case delays thanthe worst-case delays of comparable synchronous dat-apath blocks.2.2 Signaling ConventionsNumerous implementations of this re-quest/acknowledge communication has been sug-gested and analyzed. As depicted in �gure 1,Sutherland's original paper [19] suggests an imple-mentation in which neighboring stages communicateby two wires, one wire for request and one wire forR in

in

Rout

out

A in Aout

stage i −1 stage i stage i +1

Rout

Aout

Rout

Aout

R in

A in

R in

A in

Data DataFigure 1: An asynchronous pipeline.

acknowledgment. The request wire is labeled Routat the output of stage i (the source) and Rin atthe input to the stage i + 1 (the sink). Similarly,the acknowledge wire is labeled Ain at the input tostage i + 1 (the source) and Aout at the output ofstage i (the sink). The Sutherland paper suggests atwo-phase handshaking protocol in which both thelow-to-high transition and the high-to-low transitions,both referred to as events, of any control wire havethe same meaning. For example, both Rin eventsindicates that new data is valid at the input to thestage.Data in

Rin

Ain

Rout

Aout

Dataout

Figure 2: The two-phase signaling protocol.As illustrated in �gure 2, the protocol of a singlestage of a 1-dimensional micropipeline FIFO bu�er us-ing transparent latches is as follows. First, input Rinrises which indicates that new data is available at thestage's input. Assuming the stage is currently empty(i.e., the latch is open), the input data can be latched.After latching the data, the stage raises Ain telling theprevious stage it no longer needs it. If the currentstage has no associated processing logic, the inputdata propagates through the latch very quickly andthe stage can raise Rout indicating to the subsequentstage that it has new input data available. Otherwisethe current stage should wait until after it properlysets up its output data before raising Rout. Some timeafter Rout rises, the subsequent stage will raise Aout toindicate that it has consumed (i.e., latched) the out-put data. Concurrently, the previous stage can lowerRin indicating that a subsequent data is available andthe second half of the protocol can begin. Notice thatnew input data is latched only after it is availableAND any previously stored data has been consumedby a subsequent stage, thereby ensuring that no datais lost. This event-ANDing can be implemented e�-ciently using a Muller C-element.The principle di�culty in building a fast implemen-tation of this protocol occurs when using standardlatch designs with level-sensitive enables. The inte-

gration of such latches requires building a two-to-four-phase interface around the level-sensitive enables. Asimple implementation using an xor and a toggle ele-ment is illustrated in �gure 3 [8, 17].R in

inA

inD D out

outA

R out

C

T−latchbank

En

Toggle

Figure 3: Simpli�ed micropipeline stage based ontransparent latches and a two-to-four-phase interface.Initially, all wires except for the latch control arelow and the latches are transparent. An operationbegins when Rin rises, causing the C-element outputto rise. This event propagates through the xor gateand causes En to rise, closing the latches. Meanwhile,the toggle element detects the transition of the xorgate and changes the dot output, which is assumed tooccur after the latches are safely closed. This eventdrives Ain high indicating to the previous stage thatthe input data is no longer needed and also drives Routhigh indicating that the output data is now ready forfurther processing. When the next stage is done withthe data, it changes Aout which propagates throughthe xor gate and opens the latches. The toggle ele-ment detects this event and changes the blank outputwhich primes the C-element for the next Rin event (Rinfalling). Notice that the xor gate acts to merge the Rinand Aout events forming a two-to-four-phase converterand the toggle acts to separate the events that openand close the latches, forming the four-to-two-phaseconverter. The cycle time of this two-phase design in-cludes the delay overhead of both the xor and toggleelements, limiting its performance.To improve the performance, the AMULET groupsuggested a number of optimizations [17, 6]. Thefastest circuits they have developed abandon the two-phase paradigm and adopt a four-phase signaling de-picted in �gure 4. In four-phase (or level-sensitive)signaling the rising and falling transitions of the inter-face signals are distinguished. The rising transitionsare the active transitions that indicate data valid and

data consumed and the falling transitions reset thecontrol wires for the next communication.Data in

Rin

Ain

Rout

Aout

Dataout

Figure 4: The four-phase signaling protocol.To implement four-phase micropipeline control cir-cuits careful consideration must be taken into theplacement of the reset phase of the request and ac-knowledge control lines. The fastest four-phase mi-cropipeline control circuit for FIFO applications thatthey developed is the semi-decoupled circuit depictedin �gure 5.weak

Z

B

C

CR in C+

inA

T−latchbank

En

inD D out

outA

R outX

A

+C

ABC

Z

set = BC

reset = AB

Figure 5: Generalized C-element notation (top) anda semi-decoupled four-phase micropipeline stage (bot-tom).

3 Two-Phase Pipeline CircuitWe can build a simple 2-phase asynchronous pipelinecircuit using double edge-triggered D- ip- ops as stor-age elements and a C-element to control each stage ofthe pipeline as shown in �gure 6.This pipeline circuit is similar to Sutherland's mi-cropipeline, except that it uses double edge-triggered ip- ops in place of capture-pass latches and relies ona simple set of timing constraints for correct opera-tions.To make this circuit simpler to understand, assumenegligible delays on latch control (\clock") bu�ers forDETDFFs for now. Consider stage i whose inputs areRin and Aout and whose outputs are Ain and Rout.When its left neighbor toggles Rin, signaling that thedata Din is valid, stage i's pipeline control toggles itsoutput Ain (assuming that the previous request to theright neighbor has been acknowledged). Toggling ofthe control C-element acknowledges the receipt of datato its left neighbor, latches the data in stage i's stor-age (DETDFFs), and enables Rout to toggle after abundling delay. The bundling delay is necessary toprovide su�cient data setup time for stage i+ 1, i.e.,Dout of stage i must be valid before stage i+1's latchcontrol toggles.In practice, the latch control bu�ers incur signi�-cant delays (1.9ns delay to drive 32 DETDFFs in oursimulation). Thus, the data can be delayed accord-ingly. In that case, toggling of the request line meansthat the data will be available at the next stage af-ter some delay (roughly the same as the latch controlbu�er delay).In order for this pipeline circuit to function cor-rectly, two timing constraints must be met:� Data setup timeThe bundling delay (td) must be long enough tomeet the data setup time (tsu):tdi+tR!A0 i+1+tbuf i+1 > tbuf i+tck!Qi+tlogici+tsui+1Thus tdi >tck!Qi+tlogici+tsui+1+(tbuf i�tbuf i+1)�tR!A0 i+1:For carefully laid out data paths, latch controlbu�ers would have nearly identical delays, sotbuf i �= tbuf i+1. Then the minimum bundling con-straint would betck!Qi + tlogici + tsui+1 � tR!A0 i+1:� Data hold timeThe C-element delay (tA0!A0) must be long

enough to meet the data hold time (th):tA0!A0 i+ tbuf i+ tck!Qi+ tlogici > tbuf i+1+ thi+1:This inequality is trivially satis�ed becausetbuf i �= tbuf i+1 and tck!Qi > thi+1 in general.The minimum cycle time of this pipeline is the min-imum time interval between successive toggling of arequest signal: tdmin + tR!A0 + tA0!A0which simpli�es totck!Q + tlogic + tsu + tA0!A0 :Note that tck!Q + tlogic + tsu is the minimum clockperiod of the equivalent synchronous pipeline circuit| the asynchrony costs just one C-element delay(tA0!A0). Another important point to note is thatthere is no additional delay overhead for the pipelineswith processing logic. The minimum cycle time of Dayand Woods's semi-decoupled design includes twice theprocessing logic delay (because resetting of the ac-knowledge signal requires the next stage to completeits processing). In order to circumvent this problem,the AMULET group developed another design calledfully-decoupled latch control circuit, which avoids thepitfall the semi-decoupled control has but slows downthe circuit with no processing logic (i.e., FIFO). OurDETDFF circuit, however, has the same low overheadregardless of the amount of processing logic.3.1 Double Edge-Triggered D-Flip-Flop ImplementationOur implementation of pseudo-static double edge-triggered D-FF is shown in �gure 7. It is a staticizedversion of a true single-phase DETDFF described in[1]. The top half of the DETDFF is a rising edge-triggered DFF (RETDFF), and the bottom half afalling edge-triggered DFF (FETDFF). The outputsof these sections are dotted to form a DETDFF. When� is high, the FETDFF section is cut o� and Q0 isdriven by the RETDFF. Conversely, when � is low, theRETDFF is cut o� and Q0 is driven by the FETDFF.The RETDFF functions as follows. Input D muststabilize to 0 or 1 before � rises, and the rising tran-sition of � latches the D value. If D stabilizes to 1before � rises, N1 turns on and node x is pulled down,so the N-stack of the middle inverting stage is turnedo�. Because � is low, node y is pulled up. When �rises, the N-stack of the right inverting stage is turned

C

D Q

C

D Q

C

D Q

td

tck−Q

C

D Q

R in

Ain

Rout

Aout

inD Dout

tR−A’ tA’−A’

tbuf

t logic

tR−A’

tbuf

tsu

Stage i Stage i +1Stage i −1 Stage i +2

R

A A

R

R

A

Figure 6: 2-phase pipeline circuits with double edge-triggered D-FFs.D

φQ

weak

weak

W = for PMOS for NMOS

4λW = L =

4λW = L =

L = 2λ for all except weak transistors

W =

x

y

z

N1 N2

N3

N4

N5

P1

P2

P3 P4

P5P6

Q’

10λ 10λ 10λ

10λ10λ22λ

22λ 22λ 22λ 10λ22λ

Figure 7: A true single-phase pseudo-static doubleedge-triggered D-FF.

on, which pulls down Q0 and thus raises Q. If D stabi-lizes to 0 before � rises, the P-stack of the left invertingstage is turned on, pulling up node x. When � rises,P3 turns o� and the N-stack of the middle invertingstage turns on. This causes node y to be pulled down,which in turn turns on P4 and turns o� the N-stack ofthe right inverting stage. This causes Q0 to rise andQ to fall. Once � rises, switching D has no e�ect onthe output.In Afghahi and Yuan's original design, once � rises,nodes x and y oat, so it must rely on the capaci-tance on those nodes to hold the charge. However,DETDFFs used in asynchronous pipelines cannot de-pend on the capacitance to hold the logic level, be-cause the next transition of � may arrive at an arbi-trary time. Our design uses two weak PMOS tran-sistors to maintain the voltage level when � is high.When Q0 becomes low, y is pulled up by P6, whichkeeps the N-stack of the right inverting stage turnedon, keeping Q0 low. On the other hand, to keep Q0high after it becomes high, y must remain low. Ourcircuit accomplishes this by turning on P5 when y be-comes low, which maintains the N-stack of the middleinverting stage turned on. The FETDFF functionssimilarly as � switches from high to low.The DETDFF is formed by dotting the outputs ofRETDFF and FETDFF sections. Because the dottedoutput is fed back to the middle inverting stages ofboth sections, we must check that the active sectioncauses no side e�ect on the inactive section. It turnsout that our circuit topology prevents the inactive sec-tion from turning on inadvertently by the fed-back Q0.For example, when � is high, the RETDFF is the ac-tive section driving Q0. Clearly, Q0 = 0 has no e�ecton the FETDFF section. Neither does Q0 = 1, be-cause node z is already pulled down due to � beinghigh.

3.2 Simulation ResultsWe implemented 5 stages of our DET pipeline circuitand Day and Woods's semi-decoupled pipeline circuitin MOSIS 1:2�m CMOS process. We used consistenttransistor sizing for both designs: W=L of the min-imum size transistors for control circuits is 45�=2�for PMOS and 20�=2� for NMOS. In addition, forboth designs, we adjusted the capacitive loading onthe latch control signals to drive 32-bit latches/ ip- ops. In Day and Woods's design, the capacitive load-ing of 32-bit latches corresponds to 1pF. Because thecapacitive loading of the latch control signal in ourDETDFF is 4 times as much as the n-type true single-phase latches used in Day and Woods's design (8 vs2 transistors), we designed our latch control bu�er todrive 4pF in each stage.We simulated both designs using a 4.6V power sup-ply at 100�C. Mentor Graphics Accusim analog sim-ulator was used for simulation. The test setup shownin �gure 8 was used for both circuits.R in

inD

stage 1 2stage 3stage 4stage stage 5Rout

Dout

A in AoutFigure 8: Test con�guration.The simulation traces are shown in �gure 9. Themaximum throughput simulated was 3.9ns cycle timefor our design and 6.7ns cycle time for Day andWoods's (see table 1). Thus our design is 70%faster than Day and Woods's. The worst-case tck!Qwas 1.7ns for the double edge-triggered D-FF; tR!A0and tA0!A0 were 1.0ns and 1.3ns respectively. Thebundling delay needed was td = 1:6ns.There are two key factors that contributed to thedi�erence in cycle times. The �rst obvious factor isthat the DETDFF is capable of latching data on bothedges of its enable signal. The second factor is thatthe acknowledgment of the receipt of data is made veryearly | as soon as the request from the left neighboris detected (and, of course, the last request to the rightneighbor has been acknowledged). We can a�ord todo that because the previous pipe stage can generatethe next data no faster than the delay it takes for theacknowledgment signal to propagate through the latchenable bu�er.To make the comparison as fair as possible, we canremove the latch enable bu�er delays from the critical

3.80e-08 4.00e-08 4.20e-08 4.40e-08 4.60e-08 4.80e-08 5.00e-08 5.20e-08 5.40e-08 5.60e-08 5.80e-08

TIME (sec)

0.0

2.0

5.0

Dou

t

0

2

5

ck4

0

2

5

nAou

t

0

2

5

Rou

t

0

2

5

Din

0

2

5

ck3

0

2

5

nAin

0

2

5

Rin

DET Micropipeline Circuit

8.200e-08 8.400e-08 8.600e-08 8.800e-08 9.000e-08 9.200e-08 9.400e-08 9.600e-08 9.800e-08 1.000e-07 1.020e-07

TIME (sec)

0.001.002.003.004.005.00

X

0.001.002.003.004.005.00

Dou

t

0.001.002.003.004.005.00

nAou

t

0.001.002.003.004.005.00

Rou

t

0.001.002.003.004.005.00

Din

0.001.002.003.004.005.00

nAin

-1.000.001.002.003.004.005.00

Rin

Day and Woods’s 4-Phase Micropipeline

Figure 9: 2-phase DET (top) and Day and Woods'ssemi-decoupled 4-phase (bottom) micropipeline cir-cuit simulations. MOSIS 1:2�m AMULET-2Rin " ! Rout " 2.4ns 4.0nsRout " ! Aout # 1.5ns 3.2nsAout # ! Ain " 2.3ns 3.8nsAin " ! Rin " 0.5ns 1.7nsTotal 6.7ns 12.7nsTable 1: Cycle time delay breakdown of Dayand Woods's 4-phase micropipeline with the semi-decoupled latch control: the ratio of the delay in ourimplementation of Day and Woods's circuit to the de-lay published by Day and Woods is nearly the samefor every component of the cycle time, which indicatesthat our experiments were conducted accurately.

path of Day and Woods's design. As illustrated in �g-ure 10, the minimum cycle time of their design wouldthen be shorter by somewhat less1 than 1.1ns. Evenso, our design would still be 50% faster than Day andWoods's design.+R in

−Aout

R +out

Critical Path

Ain+

+X

X−

+X

istage

0.5ns

0.6ns

+1istageistage −1

R in−

Ain−

Figure 10: Semi-decoupled pipeline circuit: latch en-able bu�er in the critical path (see �gure 5 for thecircuit diagram).4 Burst-Mode 4-phase PipelineCircuitThe speci�cation and an alternative implementationfor the 4-phase pipeline circuit described in this sec-tion was originally presented in [24]. The block dia-gram of the pipeline circuit is shown in �gure 11.The steady-state operation of this pipeline circuitis as follows. Again, to make this circuit simpler tounderstand, assume negligible delays on latch enablebu�ers for now. Consider stage i (see �gure 12) whoseinputs are Rin and Aout and whose outputs are Ainand Rout. When stage i receives a request (Rin beingasserted) from its left neighbor (stage i� 1), it opensits transparent latches and acknowledges stage i�1 byasserting Ain. When stage i � 1 negates the request,stage i latches the new data and asserts Rout (if theprevious data transfer to stage i + 1 has been com-pleted) signaling stage i+1 that the new data is ready.When stage i + 1 acknowledges, stage i negates Rout(enabling stage i + 1 to latch the data) and negatesAin (completing the current data transfer cycle withstage i� 1).1A minimum size inverter still needs to be added to have thecorrect signal polarity.

G G

3A

3R

1A −

1R +

1R −

1A +

2A +

2R −

2R +

3R −

3A +

2A −

3A −

3R +

G

2A

2R

G

1R

1A

A +4

−R4

R +4

R −5

A +5

A −4

R +5

−A5

A +6

R −6

A6−

R +6

R4

A4

R5

A5

R6

A6

G

Figure 11: 4-phase pipeline circuit (5 stages).In practice, latch enable bu�ers incur non-negligibledelays (tbuf), so we must ensure that the data is heldfor at least tbuf longer after Rin is negated. This re-quirement is easily met in our circuit because the ac-tive phases of request signals are designed to be non-overlapping and tbuf is the same for every stage (see�gure 12).Many asynchronous designs attempt to minimizethe cycle time by making the control signal transi-tions as concurrent as possible. In fact, those designsdo minimize the number of external signal transitionsper cycle. However, by making circuits highly con-current sometimes means that internal state variablesare required. The semi-decoupled latch control circuitused in AMULET-2, designed by Day and Woods,only requires 4 external signal transitions per criti-cal cycle as shown in �gure 10: Rin+ ! Rout+ !Aout�! Ain+! Rin+. However, one of the externalsignal transitions requires an extra internal state vari-able transition before the external signal can change.In addition, the inverter delays required to generateactual latch enable signals are in the critical paths.Clearly, it was done so to make circuit operation asspeed-independent as possible; nevertheless, these ex-tra delays increase the cycle time.Our circuit, although it requires 6 external signaltransitions per cycle as shown in �gure 11, has nohidden internal signal transitions. In fact, our circuitreduces the cycle time by reducing concurrency: i.e.,Rin is reset before Rout is asserted, which simpli�esboth Rout and Ain logic. In addition, sequentializingRin� and Rout+ transitions leads to non-overlappingrequest signals (and thus non-overlapping latch enable

G G

Ain

AoutAin

Aout

LEi

LEi+1

istage

LEi LEi+1

R in

Rout

tbuf

tbuf

R in Rout

inD outD

datalatched

datalatched

outD

tbufFigure 12: 4-phase pipeline circuit timing.signals), which makes the blocking latch scheme pos-sible. Furthermore, the bu�er delay from Rin signalto the corresponding latch enable signal is not in thecritical path, as illustrated in �gure 12.Although our four-phase circuit is compact and fastfor FIFO applications, it has the same pitfall as Dayand Woods's semi-decoupled circuit, i.e., when theprocessing logic is added, the minimum cycle time in-creases by twice the processing logic delay.4.1 Control Circuit DesignThe speci�cation of the control circuit for each stage isshown in �gure 13, both in the signal transition graph[4] and in extended burst-mode state diagram [23, 22].in

0

1

2

+Ain

−R in

A −in

R +in

∗ /out +in

/−

/A R A in+

A R +outRAin out−R−

outA +

outA +

+outR

out−R

outA −

out−Figure 13: 4-phase pipeline controller speci�cation.

Background: XBM speci�cationFigure 13 describes an extended burst-mode (XBM)state machine having 2 inputs (Ain, Rout) and 2 out-puts (Rin, Aout). Signals ending with + or � are ter-minating signals ; the ones ending with � are directeddon't cares. If a state transition is labeled with a di-rected don't care a�, then the following state transitionmust be labeled with a� or a+ or a�. A terminatingsignal a+ denotes a 0 ! 1 transition of a if a wasinitially 0, and no transition at all if a was initially1. A sequence of state transitions labeled with a� andterminated with a+ represents a single 0 ! 1 transi-tion of a at any point in the sequence. A terminatingsignal not immediately preceded by a directed don'tcare represents a compulsory transition.An input burst is a non-empty set of input edges(terminating or directed don't care) at least one ofwhich must be a compulsory transition. An outputburst consists of a possibly empty set of output edges.In a given state, when all the speci�ed terminatingedges in the input burst have arrived, the machinegenerates the corresponding output burst and movesto a new state. Speci�ed edges in the input burst mayarrive in arbitrary temporal order. Outputs may begenerated in any order, but the next set of compulsoryedges from the next input burst may not arrive untilthe machine has stabilized.SynthesisWe synthesized the above 4-phase micropipeline con-trol circuit using the GC-based synthesis method [25].The synthesis algorithm assumes that the target im-plementation is a pseudo-static asymmetric CMOScomplex gate, known as generalized C-element [12, 3,15]. The synthesis procedure consists of two steps:state assignment and logic implementation. The stateassignment step ensures that every transition speci�edin the speci�cation is function-hazard-free. Each spec-i�cation state is assigned to a layer; compatible layersare merged to minimize the number of layers; �nally,layers are encoded so that every speci�ed transitionbetween the layers is free of critical races. The logicimplementation step �nds non-overlapping covers forthe set and reset regions for each output and maps theminimized set and reset logic to N and P stacks (orvice versa) of a CMOS gate followed by a sustainer.For example, the set and reset functions for outputAin are: set = Rin Rout reset = Aout RoutThis speci�cation does not have the complete state

coding property [4]. AoutRinAinRout = 1010 is reach-able in both states 1 and 2 of the burst-mode speci�ca-tion but the next value of Ain when AoutRinAinRout =1010 is di�erent in states 1 and 2. The synthesistool inserts a state variable as shown in �gure 14 tomake the speci�cation race-free during the sequentialsynthesis step. However, the logic synthesis step for0

1

2

+Ain

−R in

A −in

R +in

/∗ /out +inA R A in+

Ain out−R−outA +

outA +

+outR

out−R

outA −

in /−A R +outRout−

y + y +

y −

y −

Figure 14: 4-phase pipeline controller speci�cationwith CSC.the GC-based implementation eliminates Ain's depen-dency on the state variable because AoutRinAinRout =1010 is unreachable for Ain's cone of logic in state 2of the burst-mode speci�cation, i.e., Ain falls beforeRout falls in state 2 because of the feedback delay onRout. Furthermore, Rout does not depend on the statevariable. Thus we selected an implementation whichdoes not require a state variable at all. The resultingcircuit diagram is shown in �gure 15.weak

reset

Rout

R in

Ain

Aout

Rout

AinFigure 15: 4-phase burst-mode control circuit imple-mentation.The signal transition graph including all invertedsignals used in the implementation is shown in �g-ure 16. The dotted arcs represent timing constraintsfor inverters, which are satis�ed trivially. The arcs2 and 3 represent the fundamental mode timing con-straint : e.g., the environment must wait for Rout tofall before asserting Rin after it detects that Ain has

risen. Both of these constraints are satis�ed triviallyas well. The arc 1 is a timing constraint necessary toensure that output Ain rises before the feedback signalRout falls2 when enabled by Aout falling in state 2 ofthe burst-mode state diagram. This constraint is alsosatis�ed without much di�culty.1

2

3

R in −

outR −

outA +

Ain −

−outA

+outR

outR −+AinAin −

+Ain

R in +

−outA

+outR

outA +

Figure 16: 4-phase control STG with inverted signalsinserted.4.2 Simulation ResultsWe implemented and simulated 5 stages of thispipeline circuit and Day and Woods' pipeline circuitusing the same parameters as before. The transistorsizing used for generalized C-elements was consistentfor both designs. For both designs, we used n-typetrue single phase transparent latches used in Day andWoods's design and adjusted the capacitive loadingon the control signals that enable transparent latchesto match Day and Woods's experiments (1pF). Again,Mentor Graphics Accusim analog simulator was usedfor simulation. The simulation traces for our designand Day and Woods's design are shown in �gure 17.The minimum cycle time is 4.9ns, which is 1.8ns fasterthan Day and Woods's semi-decoupled circuit.There are two factors that contributed to this dif-ference. The �rst factor is that Day and Woods'scircuit requires a state variable insertion and two in-verter delays (for latch enable signals) in the criticalpaths. As shown in �gure 10 and in table 2, thedelay Rin+ ! Rout+ requires an intervening statevariable transition. The delays Rout+ ! Aout� andAout� ! Ain� include extra 0.5ns and 0.6ns respec-tively for generating latch enable signals. The sec-ond factor is the compactness of our control circuit.Even if the latch enable bu�er delays are removed from2It is the feedback delay insertion requirement to avoid es-sential delay hazards in the 3D synthesis.

the critical path of Day and Woods's semi-decoupleddesign as discussed previously, our four-phase designwould still be 15% faster than their design.Burst-mode Day and Woods'sRin # ! Ain # 1.0ns Rin " ! Rout " 2.4nsAin # ! Rin " 0.7nsRin " ! Rout # 0.6nsRout # ! Aout # 1.0ns Rout " ! Aout # 1.5nsAout # ! Ain " 1.1ns Aout # ! Ain " 2.3nsAin " ! Rin # 0.5ns Ain " ! Rin " 0.5nsTotal 4.9ns 6.7nsTable 2: Cycle time delay breakdown comparison.5 ConclusionThis paper presents two high-performance asyn-chronous pipeline circuits which are signi�cantlyfaster than the current state-of-the-art four-phase mi-cropipelines.The DETDFF design has very low delay overheadwith or without processing logic. The principle sourceof the speed-up is the application of pseudo-static dou-ble edge-triggered D- ip- ops which update data onboth transitions of the latch control signal. As a result,no reset-phase is needed and e�cient two-phase con-trol structures can be used. The main disadvantage ofthe pseudo-static DETDFFs are larger area and heav-ily loaded latch control lines. Each DETDFF is ap-proximately twice the size of a standard edge-triggered ip- op and the capacitance on the DETDFF controllines is approximately four times greater than stan-dard level-sensitive latch control lines, yielding four-times more energy consumption in the control linesper transition. The overall e�ect on power consump-tion, however, is not clear because the DETDFFs areswitched half as often as level-sensitive designs andthey are naturally blocking [17], preventing glitchesfrom propagating through multiple datapath stages.The four-phase design, when it is used as a FIFO, isalso considerably faster than the existing designs dueto the compactness of the control circuit. However,it has the drawback of a longer cycle time when pro-cessing logic is added. More investigation is needed todetermine whether more e�cient fully-decoupled four-phase pipeline circuits can be designed.

2.20e-08 2.40e-08 2.60e-08 2.80e-08 3.00e-08 3.20e-08 3.40e-08

TIME (sec)

0.001.002.003.004.005.00

LE4

0.001.002.003.004.005.00

Dou

t

0.001.002.003.004.005.00

nAou

t

0.001.002.003.004.005.00

nRou

t

0.001.002.003.004.005.00

LE3

0.001.002.003.004.005.00

Din

0.001.002.003.004.005.00

nAin

0.001.002.003.004.005.00

nRin

Burst-Mode 4-Phase Micropipeline Circuit

8.200e-08 8.400e-08 8.600e-08 8.800e-08 9.000e-08 9.200e-08 9.400e-08 9.600e-08 9.800e-08 1.000e-07 1.020e-07

TIME (sec)

0.001.002.003.004.005.00

X

0.001.002.003.004.005.00

Dou

t

0.001.002.003.004.005.00

nAou

t

0.001.002.003.004.005.00

Rou

t

0.001.002.003.004.005.00

Din

0.001.002.003.004.005.00

nAin

-1.000.001.002.003.004.005.00

Rin

Day and Woods’s 4-Phase Micropipeline

Figure 17: Burst-mode 4-phase (top) and Dayand Woods's semi-decoupled 4-phase (bottom) mi-cropipeline circuit simulations.

References[1] M. Afghahi and J. Yuan. Double edge-triggered D- ip- ops for high-speed CMOS circuits. IEEE Journalof Solid-State Circuits, 26(8), August 1991.[2] Erik Brunvand. The NSR processor. In Proc. HawaiiInternational Conf. System Sciences, volume I. IEEEComputer Society Press, January 1993.[3] Steven M. Burns. Performance Analysis and Opti-mization of Asynchronous Circuits. PhD thesis, Cal-ifornia Institute of Technology, 1991.[4] Tam-Anh Chu. Synthesis of Self-Timed VLSI Circuitsfrom Graph-Theoretic Speci�cations. PhD thesis, MITLaboratory for Computer Science, June 1987.[5] W. S. Coates, A. L. Davis, and K. S. Stevens.The Post O�ce experience: Designing a large asyn-chronous chip. INTEGRATION, the VLSI Journal,15(4):341{366, 1993.[6] Paul Day and J. Viv Woods. Investigation into mi-cropipeline latch design styles. IEEE Transactions onVLSI Systems, 3(2), June 1995.[7] A. de Angel and E. Swartzlander Jr. A new asyn-chronous multiplier using enable/disable CMOS dif-ferential logic. In Proc. International Conf. ComputerDesign (ICCD). IEEE Computer Society Press, Oc-tober 1994.[8] S. Furber. Computing without clocks: Micropipelin-ing the ARM processor. In Graham Birtwistleand Al Davis, editors, Asynchronous Digital CircuitDesign, Workshops in Computing, pages 211{262.Springer-Verlag, 1995.[9] Jim D. Garside. A CMOS VLSI implementation ofan asynchronous ALU. In S. Furber and M. Edwards,editors, Asynchronous Design Methodologies, volumeA-28 of IFIP Transactions, pages 181{207. ElsevierScience Publishers, 1993.[10] Mark B. Josephs, Rudolf H. Mak, Jan Tijmen Ud-ding, Tom Verhoe�, and Jelio T. Yantchev. High-level design of an asynchronous packet-routing chip.In J�rgen Staunstrup and Robin Sharp, editors, De-signing Correct Circuits, volume A-5 of IFIP Trans-actions, pages 261{274. Elsevier Science Publishers,1992.[11] L. Lavagno and A. Sangiovanni-Vincentelli. Auto-mated synthesis of asynchronous interface circuits. InS. Furber and M. Edwards, editors, AsynchronousDesign Methodologies, volume A-28 of IFIP Trans-actions, pages 107{121. Elsevier Science Publishers,1993.[12] A. J. Martin. Programming in VLSI: From communi-cating processes to delay-insensitive VLSI circuits. InC. A. R. Hoare, editor, UT Year of Programming In-stitute on Concurrent Programming. Addison-Wesley,1990.

[13] Alain J. Martin. Asynchronous datapaths and thedesign of an asynchronous adder. Formal Methods inSystem Design, 1(1):119{137, July 1992.[14] Alain J. Martin, Steven M. Burns, T. K. Lee, DrazenBorkovic, and Pieter J. Hazewindus. The design ofan asynchronous microprocessor. In Charles L. Seitz,editor, Advanced Research in VLSI: Proceedings of theDecennial Caltech Conference on VLSI, pages 351{373. MIT Press, 1989.[15] C. Myers and T. H.-Y. Meng. Synthesis of timedasynchronous circuits. IEEE Transactions on VLSISystems, 1(2):106{119, June 1993.[16] Steven M. Nowick, Mark E. Dean, David L. Dill, andMark Horowitz. The design of a high-performancecache controller: a case study in asynchronous syn-thesis. Integration, the VLSI journal, 15(3):241{262,October 1993.[17] N. C. Paver. The Design and Implementation of anAsynchronous Microprocessor. PhD thesis, Depart-ment of Computer Science, University of Manchester,June 1994.[18] Robert F. Sproull, Ivan E. Sutherland, and Charles E.Molnar. The counter ow pipeline processor architec-ture. IEEE Design & Test of Computers, 11(3):48{59,Fall 1994.[19] Ivan E. Sutherland. Micropipelines. Communicationsof the ACM, 32(6):720{738, June 1989.[20] Ted E. Williams and Mark A. Horowitz. A zero-overhead self-timed 160ns 54b CMOS divider. IEEEJournal of Solid-State Circuits, 26(11):1651{1661,November 1991.[21] A. Yakovlev, V. Varshavsky, V. Marakhovsky, andA. Semenov. Designing an asynchronous pipeline to-ken ring interface. In Asynchronous Design Method-ologies, pages 32{41. IEEE Computer Society Press,May 1995.[22] K. Y. Yun. Synthesis of Asynchronous Controllers forHeterogeneous Systems. PhD thesis, Stanford Univer-sity, August 1994. Technical Report CSL-TR-94-644.[23] K. Y. Yun and D. L. Dill. Unifying syn-chronous/asynchronous state machine synthesis. InProc. International Conf. Computer-Aided Design(ICCAD), pages 255{260. IEEE Computer SocietyPress, November 1993.[24] K. Y. Yun and D. L. Dill. A high-performanceasynchronous SCSI controller. In Proc. InternationalConf. Computer Design (ICCD), pages 44{49, Octo-ber 1995.[25] Kenneth Y. Yun. Automatic synthesis of extendedburst-mode circuits using generalized C-elements,1996. Submitted to a conference.