An Evaluation of Parallel Simulated Annealing Strategies With Application to Standard Cell Placement

13
398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 Short Papers An Evaluation of Parallel Simulated Annealing Strategies with Application to Standard Cell Placement John A. Chandy, Sungho Kim, Balkrishna Ramkumar, Steven Parkes, and Prithviraj Banerjee Abstract— Simulated annealing, a methodology for solving combi- natorial optimization problems, is a very computationally expensive algorithm and, as such, numerous researchers have undertaken efforts to parallelize it. In this paper, we investigate three of these parallel simulated annealing strategies when applied to standard cell placement, specifically the TimberWolfSC placement tool. We have examined a parallel moves strategy, as well as two new approaches to parallel cell placement—multiple Markov chains and speculative computation. These algorithms have been implemented in ProperPLACE, our parallel cell placement application, as part of the ProperCAD II project. We have constructed ProperPLACE so that it is portable across a wide range of parallel architectures. Our parallel moves algorithm uses novel approaches to dynamic message sizing, message prioritization, and error control. We show that parallel moves and multiple Markov chains are effective approaches to parallel simulated annealing when applied to TimberWolfSC, yet speculative computation is wholly inadequate. Index Terms—Parallel simulated annealing, standard cell placement. I. INTRODUCTION Simulated annealing [1] has long been acknowledged as a powerful combinatorial optimization tool, but its drawback has always been its appetite for computational resources. In light of this, several researchers have investigated parallel implementations of simulated annealing. We are interested in the application of these parallel simulated annealing algorithms with respect to cell placement, par- ticularly TimberWolfSC [2], one of the more popular standard cell placement tools to use simulated annealing. Of the several generalized algorithms proposed for parallelizing simulated annealing, only a few have been applied to cell placement. Parallel moves has been the most popular strategy [3]–[6], and in this paper we will present a new implementation of this approach. We will also investigate two new parallel simulated annealing algorithms that have not been used for cell placement, namely, multiple Markov chains [7], [8], and speculative computation [9]. The three parallel simulated annealing algorithms have been im- plemented in ProperPLACE, our standard cell placement tool, as part of the ProperCAD II project. By building around an existing Manuscript received June 25, 1993; revised September 20, 1994 and March 27, 1997. This paper includes portions that were presented at the 1994 International Parallel Processing Symposium and the 1996 International Conference on VLSI Design. This work was supported in part by the Semiconductor Research Corporation under Grant SRC 94-DP-109, and the Advanced Research Projects Agency under Contract DAA-H04-94-G-0273 administered by the U.S. Army Research Office. This paper was recommended by Associate Editor M. Sarrafzadeh. J. A. Chandy and S. Parkes are with Sierra Vista Research, Los Gatos, CA 95030 USA. S. Kim is with Synopsys, Inc., Mountain View, CA 94043 USA. B. Ramkumar is with the Department of Electrical and Computer Engineer- ing, University of Iowa, Iowa City, IA 52242 USA. P. Banerjee is with the Center for Parallel and Distributed Computing, Northwestern University, Evanston, IL 60208 USA. Publisher Item Identifier S 0278-0070(97)05485-7. sequential placement algorithm, TimberWolfSC, we can benefit from future improvements of the sequential algorithm. In doing so we are able to ensure that the overheads of parallelization are kept low so that the parallel algorithm can automatically benefit from future improvements in the uniprocessor algorithm. As an indication of this effort, we have been able to reuse nearly 95% of the sequential source code from TimberWolfSC 6.0. Another important feature of the ProperPLACE application is its portability—the code can run on a wide variety of MIMD architectures with no change to its source code. In the first of the parallel algorithms we investigate—parallel moves—we address a significant problem with current approaches to parallel placement. In all the asynchronous schemes proposed to date, maintaining the accuracy of local databases has been the major hurdle. This inaccuracy in the database arises from the fact that accepted moves by a processor are not relayed to the other processors on time. To reduce this inaccuracy, we propose a prioritized message passing technique along with a dynamic message sizing control mechanism. For all three parallel strategies, we present results on several MCNC and ISCAS benchmark circuits for both shared memory machines (Sun 4/690MP and Sun SparcServer 1000) and distributed memory environments (Intel Paragon and cluster of Sun 4 worksta- tions). The various strategies have differing impacts on both solution quality and runtime, and these are examined in further detail. The rest of this paper is organized as follows. In the following section, we present a brief overview of the ProperCAD II project. The next section introduces the placement problem and offers a synopsis of the TimberWolfSC algorithm. Section IV reviews some of the algorithms proposed for parallel simulated annealing as well as some of the previous work done in parallel placement. The next three sec- tions describe the three parallel simulated annealing algorithms that we are evaluating in this paper. In each of these sections, we present a detailed analysis of the algorithm as well as experimental results. II. AN OVERVIEW OF THE PROPERCAD II PROJECT The rapid improvement in very large scale integration (VLSI) technology over recent years has made circuit design an extremely complex process and in turn has been placing increasing demands on CAD tools. Parallel processing is fast becoming an attractive solution to reduce the inordinate amount of time spent in VLSI circuit design. This has been recognized by several researchers in VLSI CAD as is evident in recent literature for cell placement [3], [6], [10], floor planning [11], circuit extraction [12], [13], test generation, and fault simulation [14]–[16], etc. However, much of the work in parallel CAD reported to date suffers from a major limitation. The parallel algorithms proposed for these applications are designed with a specific underlying architecture in mind. As a result, these programs perform poorly on architectures other than the one for which they were designed. Even more impor- tantly, incompatibilities in programming environments also make it difficult to port these programs across different parallel architectures. This limitation has serious consequences, since a parallel algorithm needs to be developed afresh for every target MIMD architecture. This is compounded by the length of the software development cycle, which, for parallel applications, is considerably longer than for se- quential programs. Consequently, parallel programs are significantly costlier to develop than sequential programs. 0278–0070/97$10.00 1997 IEEE

Transcript of An Evaluation of Parallel Simulated Annealing Strategies With Application to Standard Cell Placement

398 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

Short Papers

An Evaluation of Parallel Simulated Annealing Strategieswith Application to Standard Cell Placement

John A. Chandy, Sungho Kim, BalkrishnaRamkumar, Steven Parkes, and Prithviraj Banerjee

Abstract—Simulated annealing, a methodology for solving combi-natorial optimization problems, is a very computationally expensivealgorithm and, as such, numerous researchers have undertaken effortsto parallelize it. In this paper, we investigate three of these parallelsimulated annealing strategies when applied to standard cell placement,specifically the TimberWolfSC placement tool. We have examined aparallel moves strategy, as well as two new approaches to parallelcell placement—multiple Markov chains and speculative computation.These algorithms have been implemented inProperPLACE , our parallelcell placement application, as part of the ProperCAD II project. Wehave constructed ProperPLACE so that it is portable across a widerange of parallel architectures. Our parallel moves algorithm uses novelapproaches to dynamic message sizing, message prioritization, and errorcontrol. We show that parallel moves and multiple Markov chains areeffective approaches to parallel simulated annealing when applied toTimberWolfSC , yet speculative computation is wholly inadequate.

Index Terms—Parallel simulated annealing, standard cell placement.

I. INTRODUCTION

Simulated annealing [1] has long been acknowledged as a powerfulcombinatorial optimization tool, but its drawback has always beenits appetite for computational resources. In light of this, severalresearchers have investigated parallel implementations of simulatedannealing. We are interested in the application of these parallelsimulated annealing algorithms with respect to cell placement, par-ticularly TimberWolfSC [2], one of the more popular standard cellplacement tools to use simulated annealing. Of the several generalizedalgorithms proposed for parallelizing simulated annealing, only a fewhave been applied to cell placement. Parallel moves has been themost popular strategy [3]–[6], and in this paper we will present anew implementation of this approach. We will also investigate twonew parallel simulated annealing algorithms that have not been usedfor cell placement, namely, multiple Markov chains [7], [8], andspeculative computation [9].

The three parallel simulated annealing algorithms have been im-plemented inProperPLACE , our standard cell placement tool, aspart of the ProperCAD II project. By building around an existing

Manuscript received June 25, 1993; revised September 20, 1994 andMarch 27, 1997. This paper includes portions that were presented at the1994 International Parallel Processing Symposium and the 1996 InternationalConference on VLSI Design. This work was supported in part by theSemiconductor Research Corporation under Grant SRC 94-DP-109, and theAdvanced Research Projects Agency under Contract DAA-H04-94-G-0273administered by the U.S. Army Research Office. This paper was recommendedby Associate Editor M. Sarrafzadeh.

J. A. Chandy and S. Parkes are with Sierra Vista Research, Los Gatos, CA95030 USA.

S. Kim is with Synopsys, Inc., Mountain View, CA 94043 USA.B. Ramkumar is with the Department of Electrical and Computer Engineer-

ing, University of Iowa, Iowa City, IA 52242 USA.P. Banerjee is with the Center for Parallel and Distributed Computing,

Northwestern University, Evanston, IL 60208 USA.Publisher Item Identifier S 0278-0070(97)05485-7.

sequential placement algorithm,TimberWolfSC , we can benefitfrom future improvements of the sequential algorithm. In doing sowe are able to ensure that the overheads of parallelization are kept lowso that the parallel algorithm can automatically benefit from futureimprovements in the uniprocessor algorithm. As an indication of thiseffort, we have been able to reuse nearly 95% of the sequential sourcecode fromTimberWolfSC 6.0 . Another important feature of theProperPLACE application is its portability—the code can run ona wide variety of MIMD architectures with no change to its sourcecode. In the first of the parallel algorithms we investigate—parallelmoves—we address a significant problem with current approaches toparallel placement. In all the asynchronous schemes proposed to date,maintaining the accuracy of local databases has been the major hurdle.This inaccuracy in the database arises from the fact that acceptedmoves by a processor are not relayed to the other processors on time.To reduce this inaccuracy, we propose a prioritized message passingtechnique along with a dynamic message sizing control mechanism.

For all three parallel strategies, we present results on severalMCNC and ISCAS benchmark circuits for both shared memorymachines (Sun 4/690MP and Sun SparcServer 1000) and distributedmemory environments (Intel Paragon and cluster of Sun 4 worksta-tions). The various strategies have differing impacts on both solutionquality and runtime, and these are examined in further detail.

The rest of this paper is organized as follows. In the followingsection, we present a brief overview of the ProperCAD II project. Thenext section introduces the placement problem and offers a synopsisof the TimberWolfSC algorithm. Section IV reviews some of thealgorithms proposed for parallel simulated annealing as well as someof the previous work done in parallel placement. The next three sec-tions describe the three parallel simulated annealing algorithms thatwe are evaluating in this paper. In each of these sections, we presenta detailed analysis of the algorithm as well as experimental results.

II. A N OVERVIEW OF THE PROPERCAD II PROJECT

The rapid improvement in very large scale integration (VLSI)technology over recent years has made circuit design an extremelycomplex process and in turn has been placing increasing demands onCAD tools. Parallel processing is fast becoming an attractive solutionto reduce the inordinate amount of time spent in VLSI circuit design.This has been recognized by several researchers in VLSI CAD asis evident in recent literature for cell placement [3], [6], [10], floorplanning [11], circuit extraction [12], [13], test generation, and faultsimulation [14]–[16], etc.

However, much of the work in parallel CAD reported to datesuffers from a major limitation. The parallel algorithms proposed forthese applications are designed with a specific underlying architecturein mind. As a result, these programs perform poorly on architecturesother than the one for which they were designed. Even more impor-tantly, incompatibilities in programming environments also make itdifficult to port these programs across different parallel architectures.This limitation has serious consequences, since a parallel algorithmneeds to be developed afresh for every target MIMD architecture.This is compounded by the length of the software development cycle,which, for parallel applications, is considerably longer than for se-quential programs. Consequently, parallel programs are significantlycostlier to develop than sequential programs.

0278–0070/97$10.00 1997 IEEE

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 399

Fig. 1. An overview of the ProperCAD project.

One of the primary concerns of the ProperCAD project [17] is toaddress this portability problem by designing algorithms to run on arange of parallel machines including shared memory multiprocessors,distributed memory multicomputers, and networks of workstations.The ProperCAD approach to the design of parallel CAD algorithmsis illustrated in Fig. 1. A parallel algorithm is designed aroundan existing uniprocessor algorithm by identifying modules in theuniprocessor code and designing a well-defined interface betweenthe parallel and sequential code.

The project has undergone two distinct phases. The first phase,ProperCAD I, involved the use of the C-based Charm languageand runtime system [18], [19]. As part of ProperCAD I, a suite ofparallel applications was developed that address the most significanttasks in VLSI design automation including circuit extraction [20],test generation [21], and logic synthesis [22]. An earlier versionof our placement tool,ProperPLACE , was also developed usingProperCAD I [23].

The second phase, ProperCAD II [24], [25], entailed the creationof a C++ library which provided an object-oriented parallel interfacebased on the actor model of concurrent object-oriented computing.This library-based approach, as opposed to Charm’s language-basedsystem, offered us several advantages that are described in greaterdetail in [24]. The library contains two distinct interfaces—the actorinterface (AIF) and the abstract parallel architecture (APA).

The AIF provides the mechanisms necessary for parallel executionin the ProperCAD II environment. Concurrency is achieved throughthe use of a fundamental object called an actor [26]. An actor objectconsists of a thread of control that communicates with other actorsby sending messages, and all actor actions are in response to thesemessages. Specific actor methods are invoked to process each typeof message, and actors are not allowed to block or explicitly makereceive requests from other processors. The runtime system on eachprocessor picks the next available actor thread with some priority andthat thread is then allowed to run to completion without interruption.Also concurrent abstractions known as aggregates [27] are availablein ProperCAD II to support a multiaccess interface to groups ofactors.

The implementation of the actor interface is defined in termsof the APA, a low-level interface which provides an interface andimplementation which can be used to describe and utilize resourcesneeded by any parallel application across a variety of architectures.The APA is used by the AIF but may also be used by applicationsdirectly. The AIF has been carefully integrated with the APA so thatcommon types of architectural tuning and incremental parallelization

can be expressed in a systematic way, reducing the need forad hoccombinations of AIF and APA references.

Applications created using ProperCAD II include test generation[28], fault simulation [29], logic synthesis [30], VHDL simulation[31], and placement. The latter is the focus of this paper.

III. T HE PLACEMENT PROBLEM

The VLSI cell placement problem involves placing a set of cellson a VLSI layout, given a netlist which provides the connectivitybetween each cell and a library containing layout information foreach type of cell. This layout information includes the width andheight of the cell, the location of each pin, the presence of equivalentpins, and the possible presence of feed through paths within the cell.The primary goal of cell placement is to determine the best locationof each cell so as to minimize the total area of the layout and thelength of the nets connecting the cells together. With standard celldesign, the layout is organized into equal height rows, and the desiredplacement should have equal length rows.

A. TimberWolfSC

One of the more popular sequential applications for placementhas been theTimberWolfSC set of cell placement tools [2], [32].TimberWolfSC ’s core algorithm, simulated annealing, is a suitableapproach to problems like VLSI cell placement since they lackgood heuristic algorithms. Briefly, simulated annealing is an iterativeimprovement strategy that starts with a system in a disordered state,and through perturbations of the state, brings the system gradually toa low energy, and thus optimal, state. A significant characteristic ofsimulated annealing is that, unlike greedy algorithms, perturbationsthat increase the energy of the system are sometimes accepted witha probability related to the temperature of the system [1], [33]. Inthe context of cell placement, andTimberWolfSC in particular,perturbations are simply moves of the cells to different locations onthe layout, and the energy is an approximated layout cost function,consisting of three parts:

• estimate of wirelength of all nets as the half perimeter of thebounding box;

• penalty for area overlap between cells in the same row;• penalty for difference between actual and desired row length.

Moves are generated by choosing a random cell and then displacingit to a random location on the layout. If a cell is already presentat the new location, the two cells are exchanged. A temperaturedependent range limiter is used to limit the distance over which acell can move. Initially, the span of the range limiter is set suchthat a cell can move anywhere on the layout. Subsequently, the spanis decreased logarithmically with temperature. These range limiterupdates are made at the end of each of the 160 iterations into whichTimberWolfSC segments the simulated annealing procedure. Asthe algorithm progresses, the temperature is gradually decreased byforcing the acceptance rate to follow a theoretically derived schedulethat attempts to keep the acceptance rate close to 44% during themiddle region of annealing [34].TimberWolfSC 6.0 also usesrow bins to aid in the computation of overlap and row penalties, andearly rejection methods are used to speed up the decision process [35].

IV. PARALLEL ANNEALING STRATEGIES

Because of the inherent computational costs associated with sim-ulated annealing, several methods have been proposed for the paral-lelization of the procedure. We briefly describe four major approachesthat researchers have proposed to apply parallelism to simulatedannealing.

400 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

Fig. 2. The actors inProperPLACE-PM .

A. Generalized Strategies for Parallel Simulated Annealing

1) Move Acceleration:In this approach, each individual moveis evaluated faster by breaking up the task of evaluating a moveinto subtasks such as selecting a feasible move, evaluating thecost changes, deciding to accept or reject, and perhaps updating aglobal database. Concurrency is obtained by delegating individualsubtasks to different processors. Such approaches to parallelizationare restricted to shared memory architectures and have limited scopefor large scale parallelism.

2) Parallel Moves: In this method, each processor generates andevaluates moves independently as if the other processors are notmaking any moves. One problem with this approach is that the costfunction calculations may be incorrect due to the moves made bythe other processors. This can be handled by either evaluating onlymoves that do not interact, or by handling interacting moves withsome error tolerance procedure.

3) Multiple Markov Chains: Multiple Markov chains calls for theconcurrent execution of separate simulated annealing chains withperiodic exchange of solutions [7], [36]. This algorithm is particularlypromising since it has the potential to use parallelism to increase thequality of the solution.

4) Speculative Computation:Speculative computation attempts topredict the execution behavior of the simulated annealing schedule byspeculatively executing future moves on parallel nodes. The speedupis limited to the inverse of the acceptance rate, but it does have theadvantage of retaining the exact execution profile of the sequentialalgorithm, and thus the convergence characteristics are maintained.

B. Parallel Algorithms for Placement

Many researchers have investigated the parallelization of placementalgorithms, but only methods 1 and 2 have been used. Kravitzand Rutenbar [3] tried approaches 1 and 2 on a shared memorymultiprocessor and obtained a speedup of 2 on four processors forthe first approach and 3.5 on four processors in the second approach.Banerjeeet al. [4] implemented a parallel placement algorithm usingthe parallel move approach on an iPSC/2 hypercube multiprocessorand proposed several geographical partitioning strategies for theproblem specific to the hypercube topology. Speedups of 12 on16 processors were reported. Using approach 2, Casottoet al. [37]worked on speeding up simulated annealing for the placement ofmacrocells, and achieved speedups of six using eight processors ona shared memory multiprocessor. Roseet al. [5] proposed a parallelalgorithm on an experimental distributed memory multiprocessor. Inthat algorithm, they replaced the high temperature portion of theparallel simulated annealing placer with a min-cut based algorithmand used a parallel moves strategy for the lower temperatures.Speedups of four on five processors were reported. Jayaraman and

Fig. 3. Geographic partitioning of the placement area.

Rutenbar [11] proposed for the Intel iPSC hypercube multiprocessora parallel floor-planning algorithm that uses parallel moves alongwith periodic synchronization to control the error. Both Casottoand Sangiovanni-Vincentelli [38] and Wong and Fiebrich [39] havepresented results on parallel moves implementations of simulatedannealing placement on the Connection Machine. Sun and Sechenhave shown results achieving near linear speedup on a network ofworkstations using a parallel moves approach [40].

These and all the other previous work on parallel placementalgorithms share a common drawback: each is proposed for a specificparallel architecture. Second, most of these parallel algorithms weredeveloped from scratch by rewriting the sequential algorithm; hence,the performance of the algorithms in a single processor was notas efficient as the best sequential algorithms. Finally, the error inthe cost function evaluation due to parallel move evaluation wasnot controlled, leading to significant quality degradation as moreprocessors are added.

In this paper, we will investigate three of the simulated annealingalgorithms presented above—parallel moves, multiple Markov chains,and speculative computation. We decided not to evaluate move accel-eration since the potential speedup reported by [3] was not significant,and also it is only practical on shared memory architectures. Ourimplementation of parallel moves is particularly notable as it is thefirst portable implementation of such a strategy. This and the other

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 401

Fig. 4. An outline of theProperPLACE-PM algorithm for eachAnneal actor.

two methods have been implemented as part of our parallel cellplacement tool—ProperPLACE —and they are described further inthe following sections.

V. ProperPLACE-PM PARALLEL MOVES

ProperPLACE-PM exploits parallelism by using parallel movesand allowing errors in the cost function. The application beginswith a random input placement that is replicated on each availablephysical processor. Using the aggregate feature of the ProperCADII library, an aggregate namedCircuit is constructed to manageaccess to the circuit structure and maintain a coherent state of thecurrent placement. Each processor will have one representative ofthe aggregate responsible for its local copy of the circuit. In addition,an Anneal actor is created per physical processor to perform theannealing steps—i.e., move, evaluate, and decide. Fig. 2 shows therelationship between the aggregate and its dependent actors.

After the creation of the actors, the placement is divided uptopographically by rows, with the rows and its cells assigned toseparateAnneal actors. For example, the placement in Fig. 3 hasfour rows and is replicated on each of four processors. Each actor isresponsible for one row as shown in Fig. 3, and thus is only allowedto attempt moves on cells in that row. If a cell is moved to a regionowned by another actor, the ownership of the cell is transferred to thenew actor and the original actor is no longer responsible for movingthat cell. Because an entire row, not a subpart, is owned by an actor,there will be no error in the calculation of cell overlaps and rowlengths (second and third cost function components in Section III-A) during the simultaneous evaluation of multiple moves. Note thatthis approach assumes that the number of rows is greater than orequal to the number of processors. If not, the rows must be split intoa number of subrows, in which case some overlap penalties may becalculated erroneously. Although this row-based partitioning is rathernaive, it is sufficient since the early high temperature regions of thesimulated annealing algorithm will cause the cells to be randomlyspread across the circuit.

Fig. 5. Moves inProperPLACE-PM .

After partitioning, eachAnneal actor can proceed with theannealing algorithm outlined in Fig. 4. A valid cell is selected forperturbation, and then a displacement or exchange is performed onthat cell. As detailed below, there are two subclasses of moves forboth displacement and exchange, or four move types in total. Themove type is determined by the intended location of the selected cellA.

M1) Intra-Actor Cell Displacement:Cell A moves to a newlocation owned by the same actor.

M2) Intra-Actor Cell Exchange:Two cells A and B owned bythe same actor exchange their locations.

M3) Inter-Actor Cell Displacement:Cell A moves to a newlocation owned by a different actor.

M4) Inter-Actor Cell Exchange:Two cells A and B owned bydifferent actors are exchanged.

An example of each type of move is shown in Fig. 5. In the figure,assume that each row is owned by differentAnneal actors. Noticethat the three moves (M1, M2, M3) can be done alone by actor0, the owner of cellA. For the move M4, however, actor 0 needspermission from actor 1 which owns cell B, as it is possible that cellB may have already been moved to another location or is frozen dueto some pending move. Because the information about cell B maybe out of date in the database of actor 0, it locks (or freezes) cell

402 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

(a) (b)

Fig. 6. Asynchronous versus synchronous parallel simulated annealing: (a) asynchronous and (b) synchronous.

A and cell B and sends anAskPermission message to actor 1.After receiving theAskPermission message, actor 1 examines thestate of cell B and determines whether to allow the exchange. Thedecision is sent back to actor 0 by sending theReturnAnswermessage. Upon receipt of theReturnAnswer message, actor 0unlocks cells A and B, and the move is attempted if the returnedanswer is yes. Actor 0 does not wait idly until theReturnAnswermessage is received—instead, it continues annealing by making othermoves with unfrozen cells that it owns. Since the inter-actor exchangemove (M4) takes the most time due to extensive message passing,we introduce a prioritized message scheme to reduce the time takenby M4 in Section V-A.

If a move is accepted, then the accepting actor must send the moveto the Circuit aggregate so a consistent cell position databasecan be maintained. In order to amortize the startup cost of sendinga message, position update messages are held until a number ofmoves have been accepted. Although this reduces the total numberof Update messages sent among processors, there is a drawback inthis approach. As the frequency ofUpdate messages is reduced, thelocal cell position database on eachCircuit representative becomesincreasingly inaccurate, thereby causing the cost function calculationerror to increase as well. This error, if too large, may prevent the algo-rithm from converging to an optimal solution. Previous researchers[4], [11], [38] have shown that simulated annealing is tolerant tosome error in cost function calculations. InProperPLACE-PM ,the frequency of sending positionUpdate messages is determinedadaptively such that the error in the cost function is kept small at alltimes. This will be discussed in detail in Section V-B.

Since actor methods are nonblocking, the actor’s annealing processmust give up control every so often to allow the aggregate to gaincomputation time to perform the updates. Therefore, a limitM isplaced on the number of moves that may be performed in successionwithout interruption. TheCircuit aggregate can then process anywaiting Update messages to keep the local database up to date.Anneal will have rescheduled itself by sending itself aContinuemessage that will enable control to come back to theAnneal actorand the next set of moves can then be proposed and evaluated.

After broadcasting its set of moves through the aggregate, anAnneal actor does not wait idly until all theUpdate messagessent by other actors have been processed, but it goes ahead withthe next sequence of simulated annealing moves. The advantage ofthis asynchronous approach is illustrated in Fig. 6. Fig. 6(b) shows

a synchronous approach to parallelization in which actors finishinga block of moves must wait for slower actors to finish. The timeto evaluate different moves is not the same, leading to some actorsremaining idle (shown by the dark rectangles), and thus reducingthe overall speedup. In an asynchronous approach as shown inFig. 6(a), actors become idle only at the end of the entire simulatedannealing procedure. The overall idle time will have been reducedleading to greater speedup than the synchronous method. However,synchronization does offer an advantage in that the error in the costfunction calculation becomes zero at each synchronization barrier,thereby making error control much easier. In the asynchronousapproach an effective error control scheme (which is discussed inSection V-B is necessary to bound the accumulated error in thesystem.

A. Prioritized Messages

The ProperCAD II library provides prioritized messages wherebythe programmer can influence the order in which messages in thework pool are picked for processing by assigning priorities to them.Prioritized execution is instrumental in delivering the performancepresented in Section V-D for our placement algorithm. In this section,we describe the use of priorities both to reduce the runtime and toimprove the quality of solution. The reduction in runtime is achievedby reducing the time taken by the inter-actor exchange move (M4),which is considerably longer than that of all the other types ofmove, and the improved solution quality is achieved by reducingthe database inconsistencies.

In our parallel simulated annealing algorithm, the majority ofmessages passed among actors are one of the following:

• Update ;• Continue ;• AskPermission ;• ReturnAnswer .

We give the highest priority to theAskPermission and Re-turnAnswer messages. This reduces, first, the probability of cell B,involved in the inter-actor exchange move, being moved to anotherlocation while theAskPermission message is received by theowner of cell B and, secondly, the time that cells remain frozen,i.e., prevented from moving. Messages of typeUpdate are giventhe next highest priority. Finally, the message of typeContinueare given the lowest priority. By giving a higher priority toUpdate

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 403

Fig. 7. Moves between update messages withM = 10.

thanContinue , all Update messages received by a processor arepicked up from the message queue before theContinue message.Therefore, the most up-to-date cell location information is alwaysused at the beginning of each block ofM moves.

B. Dynamic Message Sizing and Error Control

To reduce the communication overhead,Update messages arebroadcast periodically after accumulating a number of acceptedmoves, not after each move. Now, the problem is to determine thefrequency of this update message or, equivalently, the number ofattempted moves between updates (M ). We are concerned with thenumber of accepted moves between updates or, in other wordsjU j,the size of theUpdate message. IfjU j is too large, accepted movesby one actor do not appear on another processor’s database on time.Consequently, the error in the cost function calculation increasesand results in further degradation of the solution quality. Such acumulative error in cost function has also been recognized as a severeproblem by several researchers [4], [6], [10]. On the other hand, if themessage size is too small, the number of messages sent is increasedand results in large communication overheads.

A naive approach to determiningjU j is to use a statically presetvalue. The problem with this static approach is that there is no goodway to determine an optimaljU j a priori. In a dynamic approach,jU j is determined during the annealing process by monitoring thesize of the error in the cost function in the system. If the size of errorduring the annealing process becomes too large,jU j will be reducedto decrease the error at the cost of an increased number of messages.Likewise, if the error becomes very small,jU j is increased to reducethe number of messages. As long as the size of the error is bounded,this dynamic approach will produce an equivalent quality solution tothat of a sequential simulated annealing algorithm.

The error in the cost function is defined to be the differencebetween the real change in cost from the initial to final configurations,and the estimated change in cost equal to the sum of locally perceivedchanges in cost at each processor. IfCi is the exact initial cost,�Cj

is the change in cost computed locally atn processors, andCf is theexact cost of the new configuration after a series of moves, then

Ci +

n

j=1

�Cj = Cf + Eall

whereEall is the total accumulated error.In light of the fact that no synchronization takes place to exchange

cell position information,Cf is available only at the end of theannealing process at which time each processor has the identical copyof the entire circuit. Consequently,Eall cannot be obtained during theplacement process. Therefore, in an asynchronous approach, one canonly approximate whatEall will be during the placement process.

In our algorithm, instead of estimatingEall, we estimate,E, theerror that each processor contributes by moving its own cells. Theadvantage of obtaining the error this way is that it can be calculatedby each processor independently without any synchronization.

In Fig. 7, while a sequence ofM (= 10) moves (m1; m2 � � � m10)are proposed, evaluated, and accepted, no update message is sentto the other processors. This may cause error in the cost functioncalculations in the other processors if the accepted moves inM havesome nets connected to some cells in the other processors (remotecells). In addition, earlier moves have a higher chance of causing suchan error than later ones. For example,m1, if accepted, is more likelyto cause an error thanm10 because remote cells will have movedduring the time betweenm1 and t2 than betweenm10 and t2.

In the ProperPLACE-PM algorithm, each processor calculatesthe error that it is contributing to the other processors by

Eestimate =i2A

e(i)p(i)

N:

In the above equation,A is a set ofN accepted moves within a timeinterval [t1, t2]. e(i) is the upper bound on the amount of error formovei, which is the amount of the cell displacement multiplied by thenumber of nets attached to celli. p(i) is the probability that somenets connected to celli will be moved by other processors duringthe time between accepting movei and sending the nextUpdatemessage. Thep(i) term takes into account all attached nets to themoved cell as follows:

p(i) = 1�j=all remote cells sharing a net with i

[1� P (j)]

whereP (j) is the probability that cellj is moved during the timebetween accepting movei and sending the nextUpdate message:

P (j) = acceptance rate(owner ofj) �M �Mi

cellsj

whereMi is the number of moves attempted between sending thelast update message and accepting movei, andcellsj is the numberof cells owned by the owner ofj.

Fig. 8 shows an example of the error estimation on a singledisplacement move. In the figure are three standard cell rows, ownedby Anneal 0 , Anneal 1 , and Anneal 2 . Suppose that gateimoves to a new location by the distance of 10 as shown by an arrow inthe figure. Since the cells A, B, C, and E, which are connected to celli

through a net, may be moved beforeAnneal 1 sends the next updatemessage, there may be an error in calculating the cost function byAnneal 0 andAnneal 2 . Let us assume that each actor owns 100cells, and that the current acceptance rate of simulated annealing is 0.5for Anneal 0 and 0.2 forAnneal 2 . Also, assume thatAnneal 1has attempted 50 moves since broadcasting the update message. Then,P (A) = P (B) = P (C) = 0:5� 50

100= 0:25. P (E) = 0:2� 50

100= 0:1.

Therefore,p(i) = 1 � (1� 0:25)(1� 0:25)(1� 0:25)(1� 0:1) =

0:62 and Eestimate(i) = 0:62 � e(i) where e(i) is the amount ofdisplacement of celli, an upper bound on the error. Withe(i) = 10,as celli has only one net in this example,Eestimate(i) = 6:2.

This error accumulated overM moves is used to controljU j, theupdate message size. InProperPLACE-PM , we put a bound on theerror in the cost function as originally reported in [4]. The probabilityof accepting a move is:

P = e��C=T � Prob (�C > 0) + Prob(�C < 0):

In the presence of error, the composite acceptance rate changesslightly; however, the probability of generating good or bad movesis invariant with respect to error:

PE = e�(�C�E)=T � Prob(�C > 0) + Prob(�C < 0):

404 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

Fig. 8. An example of error estimation.

To bound the acceptance rate with errorPE to within 5% of normal,i.e.,

P � PE

P� 0:05

we find a bound on magnitude of error

E � T ln (1:05) �T

21:

We decrease the message size by(current size=2) � (1�e�E=k),whenever the computed errorE is higher thanT /21. The variablek isfixed at0:0687�T in order to ensure that at the boundary (E = T /21),the message size is decreased bycurrent size=4. If the error is verylow, then the message size is increased similarly.

C. Load Balancing by Inter-Actor Move Suppression

In order to maintain the same number of total moves, the numberof moves attempted in the inner iteration by each actor was reducedas shown below.

Mpar =Muni

numberofprocessors

whereMuni is the number of moves attempted inTimberWolfSC6.0 .

Since the time to evaluate each move differs, someAnneal actorsmay perform annealing much faster than others. Also, the number ofcells owned by each actor may vary considerably because cells areallowed to move to other actors. To maintain approximately the samenumber of cells owned by each actor, we need to rebalance the cellsamong actors.

In our algorithm, we achieve this balance in the number of cellsby varying the type of moves proposed and accepted. For example,when the number of cells falls below two-thirds of the originalassignment, we cut down the number of inter-actor moves. Becausethis reduces the probability that a cell moves out of this actor, thenumber of cells moving into this actor’s region becomes greater thanthe number moving away. Therefore, the balance is maintained bythis simple technique. Similarly, if an actor owns many more cellsthan the average, more inter-actor moves are proposed to increase theprobability of cells moving out. This change in move types does notaffect the placement solution.

D. Experimental Evaluation

In this subsection, we will describe results obtained by usingProperPLACE-PM on various circuits in the ISCAS and MCNCbenchmark suite as well as one industry circuit (see Table I). Whereas

TABLE ICIRCUIT INFORMATION

(PHYSICAL DESIGN WORKSHOP91 AND OTHER INDUSTRY BENCHMARKS)

TABLE IITimberWolfSC AND ProperPLACE-PM

COMPARISON (SUN 4/690MP SINGLE PROCESSOR)

much of the previous work in parallel placement has used relativelysmall circuits for their evaluations, we are using reasonably complexcircuits that are accepted in the placement community. As notedearlier, the need for parallel processing is only apparent for largercircuits.

1) Results with Prioritized Messages and Dynamic Message Sizing:In Table II, we compare the quality of placement obtained byTimberWolfSC 6.0 with that of ProperPLACE-PM in anuniprocessor (Sun 4/690MP) environment. By ensuring that the deci-sion process forProperPLACE-PM is identical toTimberWolfSCthere is no loss in placement quality (the differences are due tothe stochastic behavior of simulated annealing). The last columnscompare the runtimes forTimberWolfSC and forProperPLACE-PM, and again the times are comparable.1 Most previous work onparallel placement has had to reimplement the sequential algorithm ina simplified way, and as a result the performances of those algorithmson a single processor were very inferior to the best sequentialalgorithm available.

We ranProperPLACE-PM on a Sun SparcServer 1000, an IntelParagon mesh, and a cluster of SparcStation 5 workstations. Wewould like to emphasize that theProperPLACE-PM placementprogram ran unchanged on all these machines. It should be notedthat all previous parallel algorithms for placement were targeted tospecific parallel architectures.

Table III presents the results obtained on a Sun SparcServer1000, a shared memory MIMD architecture with eight 50-MHzSparc processors. The quality of the results in terms of normalizedwirelength (W) and the normalized speedups (S) are shown for thebenchmark circuits. Tables IV and V show data for the Intel Paragonand a network of SparcStation 5 workstations, respectively.

From the tables it is clear thatProperPLACE-PM producesacceptable speedups, while maintaining quality comparable to that ofTimberWolfSC 6.0 . Please note that the times presented are allwall clock times not CPU times, and thus are subject to interference

1From measurements we were able to make using tools like Pure Software’sQuantify, the differences in runtime are due almost entirely to memoryplacement policies and not the overhead of the library. We are in the processof investigating these effects on cache and virtual memory paging behavior.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 405

TABLE IIIProperPLACE-PM ON A SUN SPARCSERVER 1000E

TABLE IVProperPLACE-PM ON AN INTEL PARAGON

TABLE VProperPLACE-PM ON A CLUSTER OF SPARCSTATION 5 WORKSTATIONS

due to other loads on the system—particularly for the Sun machines,since they are available for general use within the group. Also, wewere unable to run the largest circuits on the Paragon because ofmemory limitations. We are aware of limitations in the ProperCADII library with respect to workstation cluster environments, thatprevent us from running large circuits, and we are in the processof investigating these issues.

2) Effect of Synchronizing Barriers on Quality and Speedup:Theresults we have just presented are for an asynchronous parallel movesalgorithm with no synchronization barriers. In this subsection, wedemonstrate the effect of synchronization barriers on the speedupand the quality of the placement produced.

In this experiment, we force the annealing actors to synchronizeafter every 40 moves [see Fig. 6(a)]. As discussed earlier, suchsynchronization barriers cause actors to wait for the slowest actorsto perform its task, but provide perhaps better control of the error.Table VI compares the results (wirelength and runtime) obtainedwith the synchronization barriers with the results obtained withasynchronous algorithm ofProperPLACE-PM on eight processors.The results are presented for the Intel iPSC/860. One can observethat the scheme with synchronizing barriers occasionally producesmarginally better quality compared toProperPLACE-PM . But theruntimes are much worse for the scheme with the synchronizationbarriers.

VI. ProperPLACE-MMC MULTIPLE MARKOV CHAINS

The ProperPLACE-PM algorithm we investigated above hasbeen shown to be effective at producing cell placement solutions withspeedups at moderate losses of quality. However, there are situationswhere no loss in quality can be afforded, and the following twosections present parallel algorithms intended to address that problem.

ProperPLACE-MMC uses the concept of multiple Markov chains,first presented as parallel clustered statistical cooling by Aartset al.[36], [41]. It was further refined by Lee and Lee who introducedan asynchronous approach to this methodology, and in particular

TABLE VICOMPARISON OF QUALITY AND RUNTIME WITH AND WITHOUT

SYNCHRONIZING BARRIERS (iPSC/860—8 PROCESSORS)

Fig. 9. Message flow in actor based synchronous MMC.

applied the algorithm to the graph partitioning problem for sharedmemory multiprocessors [7], [42]. The algorithm can be understoodif the sequential simulated annealing procedure is considered asa search path where moves are proposed and either accepted orrejected depending on particular cost evaluations and also a startingrandom seed. The search path is essentially a Markov chain, andparallelization is accomplished by initiating different chains (usingdifferent seeds) on each processor. Each chain then explores the entiresearch space by independently performing the annealing perturbation,evaluation, and decision steps. After each processor has completedthe annealing schedule, the solutions are compared and the bestis selected. Roseet al. used a similar approach with the min-cutalgorithm in the high-temperature region of the simulated annealingschedule [5]. Note that this differs from parallel moves in that eachchain is allowed to perform moves on the entire set of cells and notjust a subset.

Of course, there is no speedup in this approach since each processoris individually performing the same amount of work as the sequentialalgorithm. To achieve speedup, we must reduce the number of movesevaluated in each chain by a factor of1=N whereN is the numberof processors. Since the number of moves determines the run timeof the program, a reduction by a factor of1=N will cause a speedupof N . Obviously, such a reduction alone is not appropriate since thequality will likely decrease accordingly. To take advantage of the factthat multiple processors are being used, some means of interactionbetween the various chains is necessary.

A. Synchronous Multiple Markov Chains

One possible interaction scheme, calledsynchronous MMC withperiodic exchangeby Lee and Lee, is to periodically comparesolutions at fixed intervals. This method allows each Markov chain toupdate its local database with the best solution, and then continue on.This exchange point serves as the end of asegmentof computation,and behaves as a barrier synchronization point. According to the

406 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

Fig. 10. An outline of the asynchronous method interface for actors inProperPLACE-MMC.

algorithm proposed by Aartset al., the exchange point occurs afterevery move. At the barrier, various application specific metricscan then be used to determine the best solution. When applied tothe TimberWolfSC placement tool, a natural point for solutionexchanges is the end of eachTimberWolfSC iteration.

In an actor framework such as ProperCAD II, each chain or searchpath is represented by a separate actor or thread of control. Since theactor model cannot assume a shared memory architecture, solutionupdates must be done with message sends. The barrier at the endof each segment is implicitly achieved through the use of thesemessages, as shown in Fig. 9. When an actor has reached the endof its segment, it propagates a solution metric up to amasteractorthrough a reduction tree. This metric is only a cost measurementof the solution and is not the entire global state. Themasterthreaddetermines the best solution, and then directs the actor with the bestsolution to broadcast its state to all other actors. In the example inFig. 9, actor 3 is determined to have the best solution.

The barrier could be implemented in a single phase manner if eachactor propagates its entire state to themasterrather than just the costmetric. Themastercould then broadcast the state itself rather thanrequest the winning actor to broadcast it. However, because of thesize of the state in cell placement, the two phase method is moreefficient. The implementation proposed by Lee and Lee can use asingle phase by transferring the entire state through shared memory.Table VII shows the message size needed to transfer the relevantcircuit state for a sampling of circuits.

B. Asynchronous Multiple Markov Chains

From examining Fig. 9, it is obvious that barriers can be costlyoperations; an asynchronous approach is preferred. Fig. 10 describesthe actor interface for such an implementation. Notice that very fewmodifications have to be made to support the asynchronous method.In this approach, themasteractor does not perform any computation.Instead, it serves as a location for the best available solution at any

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 407

TABLE VIICIRCUIT SIZE INFORMATION

Fig. 11. Message flow in actor based asynchronous MMC.

particular time. When an actor has completed an iteration, it sendsits solution metric to themasteractor, and requests the best solutionavailable. The master thread on receipt of this request will determineif the received solution is better than the local “best” solution. If itis better, themasterwill ask the requestor to send its state back.Remember that the requestor had only sent the metric initially. Therequestor will then send its local state back to the master and continuewith the next iteration with its own local state. If themasterhasdetermined that the received solution is worse than the best solution,the masterwill simply send the current best state to the requestor.At the cost of dedicating an extra processor for “master” usage, thisasynchronous approach can eliminate much of the idle time that waspresent earlier.

Graphically, this algorithm is illustrated in Fig. 11. For example,Actor 2 has finished its segment and sends its solution metric tothe masterwhich determines that its solution is the best and sendsa message back requesting Actor 2’s local state. While waiting forActor 2’s state, themasterhas in the mean time received a solutionmetric from Actor 1. Since it has not yet received Actor 2’s state, themastermust compare Actor 1’s metric with the previous state. It thendetermines that Actor 1 has an inferior solution, and thus sends theprevious state back to Actor 1. Note, because of the asynchronousbehavior, Actor 1 was not able to receive the best solution at thatcurrent point. This type of erroneous update is acceptable, since theactors can correct themselves at future iterations, and it also providesan opportunity to escape from local minima.

The execution time,tser, of a serial run ofTimberWolfSC canbe stated asItm, where tm is the time to perform a move, andI is the number of moves attempted. In an asynchronous MMCimplementation, the run time becomes

tpar =I

Ntm + tc

where tc is the communication time. Since the time to send thesolution metric is small,tc is dominated by the time to send the state.If tc is relatively small, it is clear that the speedup for asynchronous

TABLE VIIIProperPLACE-MMC (SUN SPARCSERVER 1000E)

TABLE IXProperPLACE-MMC (INTEL PARAGON)

MMC is perfectly linear, i.e, the speedupS = N . However,tc isnot necessarily small, and we must determine its effect. ComparingTables I and VII yields 80 as an approximate ratio of state sizeto number of cells. Since the state is communicated at every oneof TimberWolfSC ’s 160 iterations,tc � 12 800Ctb, where C

is the number of cells andtb is the per byte communication cost.With I, the number of total moves, set to4000C (C=500)1=3 forTimberWolfSC , the speedup can then be simplified to

S =tser

tpar=

5C

500

1=3

tm

5

N

C

500

1=3

tm + 16tb

:

On an Intel Paragon,tm is on the order of 50 to 100�s, while tbis only 0.018�s. Thus it can be seen thattc is negligible, and theexpected speedup should approachN .

C. Experimental Observations

We examined the effectiveness of the asynchronous MMC algo-rithm on the Sun SparcServer 1000E as well as for the Intel Paragonas shown in Tables VIII and IX. We start at two processors becauseof the need for amasterprocessor. The quality of the solutions showno degradation as the number of processors increase—in fact, theysometimes shows improvements because of the periodic exchange ofsolutions. The algorithm scales much better than the parallel movesapproach. As expected, communication time is not significant sincethe number of messages sent is few. Note that the speedup and qualityare much better than can be achieved with a pure parallel movesstrategy as presented in the previous section.

If speed improvement is not the goal,ProperPLACE-MMC canbe used in another mode to provide better solutions for the samerun time. Table X comparesTimberWolfSC on one processor toProperPLACE-MMC on four processors. The solution quality isimproved by an average of 9% at a cost of only 3% in runtime.To achieve that type of quality improvement inTimberWolfSCwould require significantly more computation time.

Sun and Sechen have used a modified parallel moves approachto achieve similar results [40]. Although they use parallel movesand partitioned circuits, at each iteration, as with multiple Markovchains, solutions are exchanged among processors. They do not needto process frequent update messages.

408 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

TABLE XProperPLACE-MMC (SUN SPARCSERVER 1000E)

(a)

(b) (c)

Fig. 12. Speculative computation trees: (a) Speculative computation trees.(b) Biased reject tree. (c) Biased accept tree.

VII. ProperPLACE-SC SPECULATIVE COMPUTATION

Another approach recently suggested for generalized parallel sim-ulated annealing is speculative computation [9]. Witteet al., appliedthis approach to the task assignment problem and found speedupsapproachinglog

2P , whereP is the number of processors. In this

section, we apply the concept of speculative computation to cellplacement and determine the applicability of such an approach. Wefirst give a brief description of the algorithm.

A. Generalized Speculative Computation

A sequential simulated annealing schedule is simply a series ofmove proposals intended to reduce some cost function as related to theparticular problem. Each move consists of three parts—the proposalor perturbation, evaluation, and decision. Only after these three partsare completed is the next move started. Since the decision madeby the next move is dependent on the current state as determinedby prior moves, simulated annealing is almost inherently serial innature. Consider the decision tree of moves in Fig. 12(a). The topnode represents a move attempted in a simulated annealing process.There are two possible decisions as a result of this move—acceptanceor rejection. Speculative computation will assign two different proces-sors to speculatively work on the two possibilities before the parentmove has completed. The rejection child can start at the same timeas the parent, since it will assume that the state has not changed.After the parent has completed the move proposal, it can then relaythe new state to the accept processor.

TABLE XIProperPLACE-SC (SUN 4/690MP)

As the acceptance characteristics of the procedure varies, the shapeof the tree can also change. For example, if the acceptance rate ishigh, it would make sense to generate a linear tree of only acceptancenodes, and on the other hand, a very low acceptance rate would implythe creation of only rejection nodes [Fig. 12(b) and (c)]. This lattermode is essentially the mode in which the algorithms proposed by[3] and [43] operate.

B. Speculative Computation for Placement

Since speculative computation seems to be a promising avenue toachieve at least some speedup in the high temperature region, wedecided to investigate the applicability of such an approach to thecell placement problem. The problem fits naturally into an actor basedframework, in that each speculated move can be represented by anactor. One of the major changes we made to the algorithm was to addsome asynchronous behavior by removing the need for a centralizedroot processor that was required to start off each tree. Eliminating thissynchronization point allows multiple speculative trees to be activeat once. Indexing was used to properly order the execution of trees.

Another major modification made to Witte’s algorithm was totransfer just the move proposal to the accept child rather than theentire state after the move. That is, if the root node were to proposemoving a cell to a new location, it would convey the cell number andnew location to the accept child, and the child would be responsiblefor duplicating the move. As with multiple Markov chains, thisdecision was made because of the potentially large size of the cellstate. Once a speculative move has been determined to be false, theactor responsible for the move must then abort its move as well asall its parent’s moves. It must then update its cell database with themoves from the correct path.

After the modifications were made, the algorithm was run ona variety of circuits as shown in Table XI. As can be seen fromthe table, the wirelengths are identical, as expected. The speedups,however, are disappointingly poor, as indicated by the drastic slow-downs. The primary reason for this behavior is the faultiness ofWitte’s model when applied to cell placement—particularlyTim-berWolfSC . Witte makes two main assumptions—the time toperform and propose a move is small compared to the evaluationof the move, and secondly, the outcome of that move or the resultantstate is easily communicable. Consider the following optimisticanalysis—for parallel speculative computation, the execution time is

tpar =I

L[(L� 1)(tc + �tsm) + tm]

whereL is the expected length of the tree,tc is the time required forcommunicating a proposal, andtsm is the time required to performa parent’s move. The speedup is

S =Ltm

(L� 1)(tc + �tsm) + tm:

The term(L � 1)(tc + �tsm) is the cost due to performing par-ent’s moves. If this term is small, then the speedup is simplyL,which in the extreme temperature regions isN , and at worst islog N . However, our measurements have shown that this cost is notnegligible. Our measurements show that the speculative move time

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997 409

(tsm), the time that a child node needs to perform its parent’s move,is comparable with a nonspeculative move. The other component,�, is also very high due to the relatively high acceptance rate ofTimberWolfSC . The combination of these factors leads to a veryhigh overhead for parallel speculative computation. In light of theseproblems, it is clear that speculative computation, although presentedas a generalized parallel simulated annealing algorithm, is not afeasible approach to parallelizing cell placement.

VIII. C ONCLUSIONS AND FUTURE WORK

We have investigated the applicability of three different parallelsimulated annealing strategies to the problem of standard cell place-ment. The first strategy, parallel moves, based on the use of prioritiesand a dynamic message sizing, was used to deliver good consistentspeedups with little degradation in the wire length. Multiple Markovchains appears to be promising as a means to achieving moderatespeedup without losing quality, and in fact in some cases improvingquality. Speculative computation, however, is shown to be inadequateas a means of parallelization of cell placement. A combination of theparallel moves approach with intermediate exchanges as in multipleMarkov chains may offer benefits in terms of reducing the errorpresent in the parallel moves approach alone.

With all the parallel strategies, we have demonstrated that it ispossible to design and implement a portable parallel placement toolthat is cleanly and efficiently interfaced with a simulated annealing-based uniprocessor algorithm using the ProperCAD II environment.The parallel placement tool runs unchanged on a range of MIMDmachines including shared memory machines, distributed memorymachines, and a network of workstations. We believe that this isthe first parallel placement application to effectively exploit sharedmemory, nonshared memory machines and a network of workstationsin this manner. Another important feature ofProperPLACE is thatfuture improvements to the uniprocessor algorithm will result inimprovements to the parallel performance ofProperPLACE withminimal effort. It is only necessary to keep the interface between theuniprocessor and parallel placement algorithms unchanged.

ACKNOWLEDGMENT

The authors thank Dr. C. Sechen for providing them with the sourcecode toTimberWolfSC 6.0 . They would also like to acknowledgethe support of the San Diego Supercomputing Center for grantingthem access to Intel Paragon.

REFERENCES

[1] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization bysimulated annealing,”Science,vol. 220, no. 4598, pp. 671–680, May1983.

[2] C. Sechen and A. Sangiovanni-Vincentelli, “The TimberWolf placementand routing package,”IEEE J. Solid State Circuits,vol. SSC-20, no. 2,pp. 510–522, Apr. 1985.

[3] S. A. Kravitz and R. A. Rutenbar, “Placement by simulated annealingon a multiprocessor,”IEEE Trans. Computer-Aided Design,vol. CAD-6,pp. 534–549, July 1987.

[4] P. Banerjee, M. H. Jones, and J. S. Sargent, “Parallel simulated annealingalgorithms for standard cell placement on hypercube multiprocessors,”IEEE Trans. Parallel Distrib. Syst.,vol. 1, pp. 91–106, Jan. 1990.

[5] J. S. Rose, W. Martin Snelgrove, and Z. G. Vranesic, “Parallel cellplacement algorithms with quality equivalent to simulated annealing,”IEEE Trans. Computer-Aided Design,vol. 7, pp. 387–396, Mar. 1988.

[6] M. D. Durand, “Accuracy vs. speed in placement,”IEEE Design TestComputers,pp. 8–34, June 1989.

[7] S.-Y. Lee and K.-G. Lee, “Asynchronous communication of multipleMarkov chains in parallel simulated annealing,” inProc. Int. Conf.Parallel Processing,St. Charles, IL, Aug. 1992, pp. III:169–III:176.

[8] E. H. L. Aarts and J. H. M. Korst, “Boltzmann machines as a model forparallel annealing,”Algorithmica,vol. 6, pp. 437–465, 1991.

[9] E. E. Witte, R. D. Chamberlain, and M. A. Franklin, “Parallel simulatedannealing using speculative computation,”IEEE Trans. Parallel Distrib.Syst.,vol. 2, pp. 483–494, Oct. 1991.

[10] F. Darema, S. Kirkpatrick, and V. A. Norton, “Parallel algorithms forchip placement by simulated annealing,”IBM J. Res. Develop.,vol. 31,no. 3, pp. 391–402, May 1987.

[11] R. Jayaraman and R. A. Rutenbar, “Floorplanning by annealing on ahypercube multiprocessor,” inDig. Papers, Int. Conf. Computer-AidedDesign,Santa Clara, CA, Nov. 1987, pp. 346–349.

[12] K. P. Belkhale and P. Banerjee, “Parallel algorithms for VLSI circuitextraction,”IEEE Trans. Computer-Aided Design,vol. 10, pp. 604–618,May 1991.

[13] B. A. Tonkin, “Circuit extraction on a message passing multiprocessor,”in Proc. Design Automation Conf., Orlando, FL, June 1990, pp. 260–265.

[14] S. J. Chandra and J. H. Patel, “Test generation in a parallel processingenvironment,” inProc. Int. Conf. Computer Design,Rye Brook, NY,Oct. 1988.

[15] H. Fujiwara and T. Inoue, “Optimal granularity of test generation ina distributed system,”IEEE Trans. Computer-Aided Design,vol. 9, pp.885–892, Aug. 1990.

[16] S. Patil and P. Banerjee, “A parallel branch and bound algorithm for testgeneration,”IEEE Trans. Computer-Aided Design,vol. 9, pp. 313–322,Mar. 1990.

[17] B. Ramkumar and P. Banerjee, “ProperCAD: A portable object-orientedparallel environment for VLSI CAD,”IEEE Trans. Computer-AidedDesign,vol. 13, pp. 829–842, July 1994.

[18] W. Fenton, B. Ramkumar, V. A. Saletore, A. B. Sinha, and L. V. Kal´e,“Supporting machine independent programming on diverse parallelarchitectures,” inProc. Int. Conf. Parallel Processing,St. Charles, IL,Aug. 1991.

[19] L. V. Kale, “The Chare Kernel parallel programming system,” inProc.Int. Conf. Parallel Processing,St. Charles, IL, Aug. 1990, pp. 17–25.

[20] B. Ramkumar and P. Banerjee, “ProperEXT: A portable parallel al-gorithm for VLSI circuit extraction,” inProc. Int. Parallel ProcessingSymp.,1993, pp. 434–438.

[21] , “Portable parallel test generation for sequential circuits,” inDig.Papers, Int. Conf. Computer-Aided Design,Santa Clara, CA, Nov. 1992,pp. 220–223.

[22] K. De, B. Ramkumar, and P. Banerjee, “A portable parallel algorithmfor logic synthesis using transduction,”IEEE Trans. Computer-AidedDesign,vol. 13, pp. 566–580, May 1994.

[23] S. Kim, “Improved algorithms for cell placement and their parallelimplementations,” Ph.D. dissertation, Univ. Illinois Urbana-Champaign,Tech. Rep. CRHC-93-18/UILU-ENG-93-2231, July 1993,

[24] S. Parkes, J. A. Chandy, and P. Banerjee, “A library-based approach toportable, parallel, object-oriented programming: Interface, implementa-tion, and application,” inProc. Supercomputing ’94,Washington, DC,Nov. 1994, pp. 69–78.

[25] S. M. Parkes, “A class library approach to concurrent object-orientedprogramming with applications to VLSI CAD,” Ph.D. dissertation, Univ.Illinois at Urbana-Champaign, Tech. Rep. CRHC-94-20/UILU-ENG-94-2235, Sept. 1994.

[26] G. A. Agha,Actors: A Model of Concurrent Computation in DistributedSystems. Cambridge, MA: MIT Press, 1986.

[27] A. A. Chien, Concurrent Aggregates: Supporting Modularity in Mas-sively Parallel Programs. Cambridge, MA: MIT Press, 1993.

[28] S. Parkes, P. Banerjee, and J. H. Patel, “ProperHITEC: A portable,parallel, object-oriented approach to sequential test generation,” inProc.Design Automation Conf.,San Diego, CA, June 1994, pp. 717–721.

[29] , “A parallel algorithm for fault simulation based on PROOFS,” inProc. Int. Conf. Computer Design,Austin, TX, Oct. 1995.

[30] K. De, J. A. Chandy, S. Roy, S. Parkes, and P. Banerjee, “Portableparallel algorithms for logic synthesis using the MIS approach,” inProc. Int. Parallel Processing Symp.,Santa Barbara, CA, Apr. 1995,pp. 579–585.

[31] V. Krishnaswamy and P. Banerjee, “Actor based parallel VHDL sim-ulation using time warp,” inProc. 1996 Workshop Parallel DistributedSimulation,Philadelphia, PA, May 1996.

[32] C. Sechen, “VLSI placement and global routing using simulated anneal-ing,” in VLSI, Computer Architecture and Digital Signal Processing.Boston, MA: Kluwer, 1988.

[33] R. A. Rutenbar, “Simulated annealing algorithms: An overview,”IEEECircuits and Devices Mag., vol. 5, pp. 19–26, Jan. 1989.

[34] J. Lam and J.-M. Delosme, “Performance of a new annealing schedule,”in Proc. Design Automation Conf.,1988, pp. 306–311.

410 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 16, NO. 4, APRIL 1997

[35] C. Sechen and K.-W. Lee, “An improved simulated annealing algorithmfor row-based placement,” inDig. Papers, Int. Conf. Computer-AidedDesign,Santa Clara, CA, Nov. 1987, pp. 478–481.

[36] E. H. L. Aarts, F. M. J. de Bont, E. H. A. Habers, and P. J. M.van Laarhoven, “Parallel implementations of the statistical coolingalgorithm,” Integr. VLSI J.,vol. 4, pp. 209–238, Sept. 1986.

[37] A. Casotto, F. Romeo, and A. Sangiovanni-Vincentelli, “A parallelsimulated annealing algorithm for the placement of macro-cells,”IEEETrans. Computer-Aided Design,vol. CAD-6, pp. 838–847, Sept. 1987.

[38] A. Casotto and A. Sangiovanni-Vincentelli, “Placement of standard cellsusing simulated annealing on the connection machine,” inDig. Papers,Int. Conf. Computer-Aided Design,Santa Clara, CA, Nov. 1987, pp.350–353.

[39] C.-P. Wong and R.-D. Fiebrich, “Simulated annealing-based circuitplacement algorithm on the connection machine system,” inProc. Int.Conf. Computer Design,Rye Brook, NY, Oct. 1987, pp. 78–82.

[40] W.-J. Sun and C. Sechen, “A loosely coupled parallel algorithm forstandard cell placement,” inDig. Papers, Int. Conf. Computer-AidedDesign,San Jose, CA, Nov. 1994, pp. 137–144.

[41] E. H. L. Aarts and P. J. M. van Laarhoven, “Simulated annealing: Theoryand applications,” inMathematics and Its Applications.Boston, MA:Kluwer, 1987.

[42] K.-G. Lee and S.-Y. Lee, “Efficient parallelization of simulated anneal-ing using multiple Markov chains: An application to graph partitioning,”in Proc. Int. Conf. Parallel Processing,St. Charles, IL, Aug. 1992, pp.III:177–III:180.

[43] A. Sohn, “Parallel speculative computation of simulated annealing,” inProc. Int. Conf. Parallel Processing,St. Charles, IL, Aug. 1994, pp.III:8–III:11.

Performance-Driven Routing with Multiple Sources

Jason Cong and Patrick H. Madden

Abstract—Existing routing problems for delay minimization considerthe connection of asingle source node to a number of sink nodes, withthe objective of minimizing the delay from the source to all sinks, ora set of critical sinks. In this paper, we study the problem of routingnets with multiple sources, such as those found in signal busses. This newmodel assumes that each node in a net may be a source, a sink, or both.The objective is to optimize the routing topology to minimize the totalweighted delay betweenall node pairs (or a subset of critical node pairs).We present a heuristic algorithm for the multiple-source performance-driven routing tree problem based on efficient construction of minimum-diameter minimum-cost Steiner trees. Experimental results on randomnets with submicrometer CMOS IC and MCM technologies show anaverage of 12.6% and 21% reduction in the maximum interconnectdelay, when compared with conventional minimum Steiner tree basedtopologies. Experimental results on multisource nets extracted from anIntel processor show as much as a 16.1% reduction in the maximuminterconnect delay, when compared with conventional minimum Steinertree based topologies.

Index Terms—Interconnections, interconnect topology optimization,layout, minimum diameter routing tree, multisource routing tree, rec-tilinear arborescience, Steiner tree.

Manuscript received May 8, 1995; revised April 29, 1996 and December17, 1996. This work was supported in part by DARPA/ITO under ContractJ-FBI-93-112, an NSF Young Investigator Award MIP9357582, and a grantfrom Intel Corporation. This paper was recommended by Associate EditorC.-K. Cheng.

The authors are with the Computer Science Department, University ofCalifornia, Los Angeles, CA 90024-1596 USA.

Publisher Item Identifier S 0278-0070(97)05151-8.

I. INTRODUCTION

The competitive nature of the very large scale integration (VLSI)industry has created a strong demand for techniques to improve theperformance of integrated circuits. Methods to increase speed, and toreduce area or power consumption, are of great interest.

Scaling of device dimensions has resulted in changes to manyfundamental design goals: where previously the bulk of system delayhad been generated by the switching times of devices, it is nowcommon that the interconnecting wires between devices accounts forthe dominating portion of the delay. These changes have created newareas in need of optimization, and new measures by which we gaugesolution quality.

With smaller minimum feature size comes a reduction in transistorchannel width and length, resulting in relatively constant transistoron resistance; the reduction in wire width, on the other hand, resultsin higher unit wire resistance [2]. As a result, theresistance ratio[8],defined to be the driver resistance divided by the unit length wireresistance, is reduced significantly. This shift produces a situationwhere the length of the path between a driver and sink can havecomparable resistance to that of the transistor channel. Thus, changesto the interconnect length and topology can have a significant impacton delay. The result in [11] showed convincingly that interconnecttopology optimization has a considerable effect on interconnect delayreduction when the resistance ratio is small.

A number of optimized interconnect topologies have been pro-posed, including bounded-radius bounded-cost trees [9], AHHK trees[1], LAST trees [21], maximum performance trees [7], A-trees [11],low-delay trees [5], and IDW/CFD trees [18]. These methods considerboth the traditional concern of low total wire length, and also thepath length or Elmore delay between the source node and the timing-critical sink nodes.

Although many of these methods effectively reduce the intercon-nect delay, all of them assume that there is a single source nodedriving one or more sink nodes and minimize the delay from theunique source to all sinks, or a set of critical sinks.

In practice, many timing-critical nets may have multiple sources,each of them controlled by a tri-state gate and driving the net ata different time. Signal busses are instances of such nets. In thesecases, the existing performance-driven routing algorithms for singlesource nets may perform poorly, as a topology optimized for onesource may result in high interconnect delay when some other sourcebecomes active.

Fig. 1 presents a pair of four-node routing trees with the same wirelength. The first routing tree, optimized for nodep1; has relativelyhigh delay when nodep2 drives the net. The second routing treeprovides a lower overall maximum delay when all four nodes mightbe sources or sinks. Delay times with respect to the driving nodesare shown in Table I.

Note that the second routing tree, which minimizes the maximumlinear delay, does not fall entirely on the Hanan grid [16]. For thesingle source model under Elmore delay, [4] showed that an optimaltree which minimizes the maximum delay to any sink may not becontained by the Hanan grid, but also observed that these cases wererare. For problems with multiple sources, a solution restricted to theHanan grid may be far from the optimal solution, as shown in Fig. 1.Therefore, we cannot restrict our search for solutions to this grid.

In this paper, we study the problem of routing nets with multiplesources. This new model assumes that each node in a net may bea source, a sink, or both. The objective is to optimize the routing

0278–0070/97$10.00 1997 IEEE