A CAD environment for Real-time DSP implementations on multiprocessors

20
Journal of VLSI Signal Processing, 8, 131-150 (1994) 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A CAD Environment for Real-time DSP Implementations on Multiprocessors PHU HOANG AND JAN RABAEY Department of Electrical Engineering, University of California, Berkeley Abstract. A software environment designed to support the real-time implementation of Digital Signal Processing (DSP) applications onto multiple programmable processors is described. The system, called McDAS, allows a designer to program his application as he would on a single processor, using a high level signal-flow DSP language. The program is then automatically scheduled and compiled onto a target multiprocessor. The environment allows the scheduler to be invoked with different numbers of processors and multiprocessor topology to explore various implementations. McDAS maximizes the computational throughput by exploiting pipelining, retiming, and parallel execution under the architectural constraints. The code generator is retargetable to different multiprocessor architectures as well as core processors. Data buffers and synchronizations are automatically inserted to ensure correct execution. The final implementation can be used for simulation speedup or real-time processing. 1 Introduction In recent years, a significant advance in the computing power of programmable digital sig- nal processors has been observed. New ad- vances in architecture and technology has en- able DSP processors to achieve throughputs up to 50 MFLOPS [1]. Their high-speed perfor- mance, programmability, and low cost have al- ready made them the ideal medium in a number of real-time applications such as speech, mu- sic, and image processing. Unfortunately, we have concurrently experienced an even greater increase in the computational requirements of DSP applications. For instance, a computation rate of 1 GOPS or more is typical of High Def- inition Television (HDTV) [2]. In addition, the applications themselves are becoming increas- ingly complex, utilizing nested loop structures or multi-dimensional vector computations. Ex- amples of this can be found in Code-Excited Linear Prediction coders [3], CCITT standard visual telephony [4], and JPEG, MPEG image compression [5]. Currently, the only means to meet the high throughput demands of these applications is with special purpose hardware, which can be quite expensive and time consum- ing to build at the prototyping stage. Given the success of the DSP processor, a nat- ural progression to obtain greater computational power while maintaining the programmability is to employ multiple DSP processors working in parallel. The major obstacle to the prevalent use of these multiprocessor systems, however, has been the absence of adequate CAD tools to automatically partition the program onto the multiple processors, coordinate their communi- cations and synchronizations, and synthesize the required code. For a DSP multiprocessor de- sign environment to be useful, it must firstly ad- dress the following issues. The input language, whether graphical or textual, should be powerful enough to support complex operations such as infinite iterations, matrix manipulations, sample rate conversions, and wordlength overflow. The scheduling tool should exploit all types of con- currency and all levels of granularity to achieve good speedups across a wide range of applica- tions. The environment should be able to cope with a variety of multiprocessor architectures with constraints on processor count and mem- ory capacity. This is extremely useful if the tar- get multiprocessor system is reconfigurable and scalable, or if there are many different target architecture available. Finally, once a sched- ule is found, the generated code should support

Transcript of A CAD environment for Real-time DSP implementations on multiprocessors

Journal of VLSI Signal Processing, 8, 131-150 (1994) �9 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

A CAD Environment for Real-time DSP Implementations on Multiprocessors

PHU HOANG AND JAN RABAEY Department of Electrical Engineering, University of California, Berkeley

Abstract. A software environment designed to support the real-time implementation of Digital Signal Processing (DSP) applications onto multiple programmable processors is described. The system, called McDAS, allows a designer to program his application as he would on a single processor, using a high level signal-flow DSP language. The program is then automatically scheduled and compiled onto a target multiprocessor. The environment allows the scheduler to be invoked with different numbers of processors and multiprocessor topology to explore various implementations. McDAS maximizes the computational throughput by exploiting pipelining, retiming, and parallel execution under the architectural constraints. The code generator is retargetable to different multiprocessor architectures as well as core processors. Data buffers and synchronizations are automatically inserted to ensure correct execution. The final implementation can be used for simulation speedup or real-time processing.

1 Introduct ion

In recent years, a significant advance in the computing power of programmable digital sig- nal processors has been observed. New ad- vances in architecture and technology has en- able DSP processors to achieve throughputs up to 50 MFLOPS [1]. Their high-speed perfor- mance, programmability, and low cost have al- ready made them the ideal medium in a number of real-time applications such as speech, mu- sic, and image processing. Unfortunately, we have concurrently experienced an even greater increase in the computational requirements of DSP applications. For instance, a computation rate of 1 GOPS or more is typical of High Def- inition Television (HDTV) [2]. In addition, the applications themselves are becoming increas- ingly complex, utilizing nested loop structures or multi-dimensional vector computations. Ex- amples of this can be found in Code-Excited Linear Prediction coders [3], CCITT standard visual telephony [4], and JPEG, MPEG image compression [5] . Currently, the only means to meet the high throughput demands of these applications is with special purpose hardware, which can be quite expensive and time consum- ing to build at the prototyping stage.

Given the success of the DSP processor, a nat- ural progression to obtain greater computational power while maintaining the programmability is to employ multiple DSP processors working in parallel. The major obstacle to the prevalent use of these multiprocessor systems, however, has been the absence of adequate CAD tools to automatically partition the program onto the multiple processors, coordinate their communi- cations and synchronizations, and synthesize the required code. For a DSP multiprocessor de- sign environment to be useful, it must firstly ad- dress the following issues. The input language, whether graphical or textual, should be powerful enough to support complex operations such as infinite iterations, matrix manipulations, sample rate conversions, and wordlength overflow. The scheduling tool should exploit all types of con- currency and all levels of granularity to achieve good speedups across a wide range of applica- tions. The environment should be able to cope with a variety of multiprocessor architectures with constraints on processor count and mem- ory capacity. This is extremely useful if the tar- get multiprocessor system is reconfigurable and scalable, or if there are many different target architecture available. Finally, once a sched- ule is found, the generated code should support

132 Hoang and Rabaey

functional simulation and real-time implementa- tion. For real-time applications, the code must be fast and efficient, fully utilizing the capabil- ity of the processor. For simulation, the code should support quasi-infinite precision as well as bit-true data types to allow designers to moni- tor quantization and truncation effects. McDAS was designed with these features in mind to pro- vide the user the maximum design flexibility with the minimum implementation effort.

The remainder of the paper is composed of 8 sections. Section 2 discusses the previous work on multiprocessor DSP design environ- ments and multiprocessor scheduling. Section 3 describes the overall composition of the McDAS system. Sections 4, 5, 6, 7 describe respectively the Silage-to-flowgraph translation, the strategy for estimating computation costs and commu- nication delays, the scheduling algorithm, and the code generation strategy. Benchmark re- sults are presented in Section 8. We conclude the paper with a discussion of future research.

2 Previous Work

Most multiprocessor DSP design environments currently available are block-diagram based, where tasks are represented as blocks connected by edges representing data flow. Examples of these systems include the Gabriel system, a design environment based on synchronous data flow principles [6], the Signal Processing Worksystem from Comdisco, and the block- diagram compiler from Lincoln Labs [7]. The blocks can have arbitrary granularity and may be defined in a standard library or by the user. Associated with each block are its simulation code and real-time code. The simulation code is usually implemented in a high level language with emphasis on portability and user interface, whereas the real-time code is more concerned with exploiting hardware features for good per- formance. Since the functionality of these blocks are known, hand-optimized code can be written for each block. Users specify their DSP ap- plications by connecting these library modules together using a graphical schematic editor. Af- ter the block-diagram is created, the application

is scheduled onto the target multiprocessor using various scheduling techniques. The final code is generated by concatenating the code associated with each block into programs.

Other multiprocessor systems, including Mc- DAS, allows users to specify their descriptions textually [8]. Textual descriptions are not as illustrative as graphical descriptions, but allows for more flexibility and cleaner specification of hierarchical structures such as iterations and re- cursion. The textual program is usually trans- lated into a flowgraph description, where par- allelism is exposed explicitly. After scheduling, code is generated using standard compiler's code generation techniques.

The major difference among these systems lie in the multiprocessor scheduling strategy. Over the years, there has been a massive effort to study multiprocessor scheduling in the computer science and operations research community [9]. These scheduling algorithms exploit spatial con- currency (parallelism) in between tasks to min- imize the completion time of a job on a given multiprocessor. The prevaling strategies used are based on list scheduling [10], clustering [8], branch and bounds [11], and simulated anneal- ing [12]. When the application is real-time signal processing, multiprocessor scheduling is usually performed at compile time to reduce run-time scheduling overheads. In addition, signal pro- cessing dictates that the same computation be applied to every data sample in the stream, yielding a periodic schedule. One technique for scheduling DSP applications onto multipro- cessors involves schedulihg a single execution of the program, and repeating the schedule for each sample. This was the approach taken by a number of systems including the Gabriel system, which uses a modified list scheduling algorithm that considers communication delays [13], the Lincoln Labs compiler [7], which uses a simu- lated annealing scheduling algorithm, and the ZC compiler [14] for the Warp machine [15], which uses a priority-based algorithm that con- centrates on tasks with massive data-parallelism.

These schedulers have only limited success in DSP applications because they only exploit spatial concurrency to obtain their speedup. The infinite time loop in DSP applications

A CAD Environment 133

FlowGraph Scheduler

/ Database

l Code Generator

S,lageTo G ph al I FlowGraph I Command I

FlowGraph

l ~lowGraph~ I ~Database. I FlowGraph l I rra"sy~176 [

\

Graphical Display Processor Util :

1 Bus Util :

Fig. 1. McDAS system overview.

also introduces temporal concurrency (pipelining) however, something classical scheduling algo- rithms did not assume. A number of DSP scheduling algorithms which exploit both types of concurrency include cyclo-static scheduling [16], Skewed Single Instruction Multiple Data (SSIMD) [17], Fully Static Rate Optimal [18], and Generalized Parallel SSIMD [19]. All de- rive their schedules using optimal solutions as bounds. Unfortunately, the exponential running time of these techniques preclude a practical im- plementation of large applications. Their for- mulations, however, provide characterizations of optimal bounds which are quite useful for lim- iting the search space and evaluating heuristic implementations. Bokhari presented an algo- rithm for pipelining serial tasks on a linear ar- ray of processors to maximize throughput [20]. This approach considers processor and memory constraints, requires little interprocessor com- munication support, but only exploit temporal concurrency. Although this technique is quite limited, it comes closest to our stated objec- tives, and is a basis for the scheduling approach adopted in McDAS.

Finally, only a number of scheduling algo- rithms address the need to consider the gran- ularity of operations [8], [14], [21]. None of the techniques however, considers decomposing

a node into smaller nodes when it is a schedul- ing bottleneck. This is a serious drawback as the granularity of the nodes (and not the performance of the scheduling algorithm) is of- ten the constraining factor towards reaching a high quality solution. The scheduler in McDAS is able to exploit all types of concurrency at all levels of granularity in its search, enabling near optimal speedups across a wide range of different applications.

3 System Overview

An overview of the McDAS environment is shown in Figure 1.

The system is composed of six modules oper- ating on a centralized flowgraph database. The input is described using Silage, a signal-flow language developed especially for DSP speci- fication [22]. The modules interact with each other by reading and/or writing to the flowgraph database. Accessing the database as well as per- forming common operations on it are facilitated by the Flowgraph Library, which is linked into all tools. This modularity of design and exten- sive library support allows new modules to be incorporated easily.

The compiler environment is retargetable to different multiprocessor implementations. This

134 Hoang and Rabaey

Fig. 2. McDAS graphical user interface.

is accomplished with the aid of an architec- ture database, which is linked to both the Flowgraph Scheduler and Code Generator. A set of graphic tools allow for the displaying of the essential statistics of an implementation on a given architecture. In particular, proces- sor utilizations, bus congestions and memory requirements can be plotted to facilitate user interaction and feedback.

To implement a design, the user begins with a Silage description of his application. The de- scription is translated into a flowgraph descrip- tion, and a number of architecture-independent flowgraph transformations are automatically ap- plied to remove any dead or redundant oper- ations. To schedule the flowgraph, the user enters the number of processors, chooses a de- sired architecture topology from the database, and invokes the scheduler. Once finished, the

scheduler decorates the flowgraph with the pro- cessor assignments and scheduling order. The decorated flowgraph is now called the "sched- uled flowgraph". The scheduling results showing the speedup, load balancing, processor assign- ments, and communication costs can be dis- played at this time. If not satisfied with the re- sult, the user can select a new architecture and re-schedule. A history tool maintains a log of all past design versions, allowing easy backtrack- ing of the design process. Once an acceptable solution is found, the code generator is invoked.

The Graphical Command module provides an graphical interface between the user and Mc- DAS. Here, all commands to select architec- tures, invoke tools, and display results are avail- able to the user via menus and buttons. A sample screen is shown in Figure 2. In the re- mainder of this section, we will discuss in greater

A CAD Environment 135

detail the input language Silage, the flowgraph structure, and the architectural description.

3.1 Silage

A good specification language must provide to the user the primitives of the application do- main. In DSP, system designers are very com- fortable with a graphical representation of DSP algorithms. The semantics implied is known as data-flow semantics, in which the emphasis is on the paths followed by results of a compu- tation, rather than the sequence of imperative operations performed on memory locations- the control-flow semantics. The data-flow rep- resentation is not only intuitive to the designer, but also explicitly exposes potential parallelism. This explains the popularity of block-diagram interfaces in a number of DSP design environ- ments discussed previously. While this format is useful for describing traditional signal process- ing algorithms such as filters and modulators, signal processing algorithms have grown consid- erably more complex, requiring more powerful constructs such as iterations and conditionals. The incorporation of these control-flow opera- tions to pure data-flow graphical representations have been very awkward. Textual languages such as Silage allow these constructs to be incorpo- rated into a data flow environment in a natural and concise manner.

Silage is an applicative language which cap- tures the data flow concept in a linear, textual form. An example of a Silage program is shown in Figure 3. All variables or signals in Silage denote infinite streams of values. These signals can have integer, floating point, or fix point data types. Fix(32, 10) means a word length of 32 bits, with 10 bits of binary fraction. This al- lows the user to express precise requirements on signal accuracy, and monitor the affects of quantization and truncation. A past sample of a signal can be accessed using a Silage primitive operation '@'. For example, X@2 represents the value of signal X, 2 samples earlier. The signals can have different sampling rates and can be synchronous or asynchronous.

A Silage program consists of an unordered se- quence of definitions of signals and of functions

#def ine word ftx<32,10> #def ine Coef l l 0.015437

func I IR (In: word) Out: w o r d = beg in

T m p l = Biquad(In, Coef l l , Coef12, Coefl3, Coefl4); Trap2 = B iquad(Tmpl , Coef21, Coef22, Coef23, Coef24); O u t = Firs tOrder(Tmp2, Coef'31, Coef32);

end; func B iquad (in, a l , a2, b l , b2 : word) out: word = begin

state@@1 = 0.0; state@@2 = 0.0; state = in - (al * state@l) + ( a2" state@2); ou t = state + (bl * state@l) + (b2 * state@2);

end; func F i r s tOrde r (in, a l , bl) out: w o r d = begin

state@l = 0.0; s ta te = in - (al * state@l); out = state + (bl * state@l);

end;

Fig. 3. S i l a g e d e s c r i p t i o n o f a n 5 t h o r d e r I I R .

which are applied to these signals. A signal can only be defined once, making statements like X = X + 1 illegal. This single assignment semantic allows for a simple translation to a flowgraph format. To handle repetitive and conditional operations, Silage supports itera- tions and conditional expressions, as shown in Figure 4. Other important features of Silage in- clude vector operations and operations to switch sample rates and multiplex data streams. Al- though relatively new, Silage has proven its ef- fectiveness. A number of DSP design envi- ronments have already been built around this language, from compiler systems [23] to high level synthesis systems of custom ICs [24] and field-programmable datapaths [25].

3.2 Flowgraph Definition

The central database to which all tools interact is represented as a hierarchical data/control flow graph (CDFG). The nodes in the CDFG repre- sent data operations, while the data edges rep- resent the flow of data from the source node to the destination node. In addition, control edges can be used to enforce dependency constraints between independent nodes. Aside from the

136 Hoang and Rabaey

(t: 1 .. 10) :: begin

C[i] = A[ i ] + B[i]; D[i] = C[i] * A[i] ;

e n d ;

F i n i ~ I t e r a f i o n

d o i@@l = 0.0; Acc@@I = 0; i = i@l + I;

A c c = Acc@I + A[i] * BIi]; o u t = exi t (i > l eng th ) -> Acc;

o d ;

Inf in i te I t e ra t ion

o u t = if (n == 0) -> 0.0

I I (n == 1) -> A[n] I I (n > 1) -> AInl "

B[nl fi;

C o n d i t i o n a l

Fig. 4. S i l a g e d e s c r i p t i o n o f i t e r a t i o n s a n d c o n d i t i o n a l s .

Input & I Output Structure & [ Interconnect

Semantic I sor n, & I Analysis Leveling

FIowgraph I Library

I

Management I

Fig. 5. F l o w g r a p h l i b r a r y o r g a n i z a t i o n .

Trans- I formations

standard arithmetic operations, the CDFG al- lows a number of macro control-flow opera- tions such as iterations and conditionals. With these constructs, we obtain a hierarchical graph whose subgraphs represent bodies of these iter- ations and conditionals. The subgraph contracts to a single node at the next level. This hier- archical representation has the advantages of compactness and descriptiveness, as it dramati- cally reduces the number of effective nodes as compared to a flattened flow graph. Further- more, the macro control-flow of the algorithm is retained, preserving the structural hints from the designer which may lead to more efficient code generation.

A flowgraph library was developed to facilitate the integration of new tools into the McDAS environment. The library has 6 sub-libraries, as shown in Figure 5.

The Input & Output library contains routines to read in and dump out CDFG text files. The Structure & Interconnect library contains rou- tines to create, delete, copy, modify, connect, and disconnect flowgraph objects such as graphs, nodes, and edges. The Semantic Analysis library

performs semantic checking, such as deducing and enforcing the edge types. The Memory Management library efficiently manages the al- location and de-allocation of memory used in creating flowgraph objects. The Sorting & Lev- eling library contains standard graph traversai routines such as depth and breath-first search, topological order, leveling and critical path anal- ysis. Finally, the Transformation library contains standard compiler optimization routines such as manifest expression removal, common subex- pressions and dead-code elimination. These techniques can be found in the standard com- piler literature [26].

3.2.1 Flowgraph Model. This section describes the semantics of the flowgraph representation as used in McDAS. The semantics are based on the Silage language, inheriting its DSP features. Be- sides describing the primitive operators, we will also discuss the representation of hierarchical constructs such as conditionals and iterations.

Primitive Operators and Function Calls. Each primitive operation in Silage has a corresponding

A CAD Environment 137

.__--- \ / ,LN - /

Fig. 6. CDFG description of 5th order IIR.

,,:1 c

a) Finite Iteration

,o.,,11,

b) Infinite Iteration

a b

a ? ~ b

c) Conditional

Fig. Z CDFG representation of iterations and conditionals.

primitive node in CDFG. A function call is rep- resented as a hierarchical node, whose subgraph points to the function body. The interpretation of these construct is straight forward. Figure 6 shows the flowgraph representation of an 5th order IIR Silage description presented earlier.

Iterators and Conditionals. Control operators such as iterators and conditionals are hard to represent in a flowgraph. We propose an ele- gant scheme, based on hierarchy: An iteration or conditional is represented as a single node at the invocation level. This node is an instance of a subgraph representing the body of the it- eration or conditional. Information about the iteration, such as the index and the bounds are passed as parameters to the hierarchical node. The flowgraph representation of both the finite and infinite iteration are shown in Figure 7a

and Figure 7b. There are two ways to represent the conditional operator. The first way uses the standard data-flow approach, using a multiplexer to select between two results. This representa- tion requires both cases to be evaluated. The CDFG also allows a control conditional state- ment representation, where only one of the cases will be executed at run-time. The flowgraph for this construct is shown in Figure 7c. Notice that the signal c has two definitions! This still satisfies the single assignment rule since the two definitions are mutually exclusive.

3.3 Architecture Database

The architecture database contains a description of the different types of possible architectures that McDAS will map to. It interacts with the scheduler and code generator to customize the

138 Hoang and Rabaey

compilation process to a particular target ma- chine. Each description contains the following information:

1. Computation time and memory requirement of each primitive instruction of the processor such as multiply, add, shift, etc.

2. Computation time overhead and memory requirement overhead of control constructs such as function calls, loop jumps and tests, etc.

3. Instruction set of the processor. 4. Program and data memory size for each pro-

cessor. 5. Distance between processor in terms of bus

hops, as well as the data routing paths. 6. Time required to send one unit of data across

1 bus hop. 7. The memory layout of the architecture. 8. The interprocessor communication and syn-

chronization protocols.

The first four items characterize the processor, while the last four items describe the multipro- cessor configuration. From this data, it is pos- sible for the scheduler to estimate computation times and memory requirements of nodes as well as communication delays of data transfers. The code generator also uses the memory layout and the communication protocols to generate code for interprocessor communication and synchro- nization. Currently, an architecture description has components which can be modified on the fly from the McDAS command window (such as the number of available processors) and components which are compiled along with the scheduling and code generation front-ends (such as header files defining instruction costs, routines to cal- culate communication delays for each topology, and routines to emit code for a specific pro- cessor). Users who want to build their own architecture description can modify the header files and routines to reflect their architecture. The customized header files and rotitines are then compiled with the generic scheduling or code generation frontend to yield customized schedulers and code generators. These are then available as options to the user from the Mc- DAS command window. While it is possible

to standardize these interface routines so that a textual description of the instruction set and interconnect topology can be read in on the fly, it is doubtful whether a generic code generation strategy is adequate to generate code efficient enough for real-time implementation.

4 Silage to Fiowgraph

The Silage To Flowgraph module translates a Silage program into a CDFG. The front-end parser was derived from a Silage compiler devel- oped in [27]. After the flowgraph is generated, a number of compiler optimizing transforma- tions are automatically applied to the flowgraph to eliminate redundant operations. A detail description of the module can be found in [28].

If the application contains multirate opera- tions such as upsampling or downsampling, an important step in the translation process is the conversion of these multirate operations into a single-rate realization. This representation simplifies the scheduling and code generation algorithms, as all operations to be executed are explicitly represented. The rate-conversion is done by clustering operations running at the same rate together into a process. The slowest running process is taken as the sample rate, and the higher rate processes are invoked the re- quired number of times using iterations. Since these rates may not be integral multiples of the slowest rate, a greatest common divisor algo- rithm is used to obtain the minimum invocation necessary. This can greatly reduce the size of data buffers which are needed between these communicating processes. An example of con- verting an upsampling operation is shown in Figure 8. Upsampling involves increasing the sampling rate of the process by padding O's to the incoming stream. The upsampling operator has two parameters: scale and phase. The scale parameter tells the upsampling factor, while the phase determines the timing offset. In Figure 8, three samples of signal c are produced for each sample of signal b. As a result, the computa- tion after the rate change is repeated 3 times, and is realized with the aid of an iteration con- struct. The resultant description can now be

A CAD Environment 139

'3, 1'

b

d

K1

K2

K1

b

LJ ~c[3] ...........

Fig. 8. Multirate transformation.

c[i]

t / ?: Ill

K2

Stage 0 Stage 1 Stage 2 Stage 3

Fig. 9. An example program structure.

Stage 4

considered as a single rate program. When op- timizing transformations are applied, redundant memory storage can be removed. As an exam- ple, the result of a . K 1 in Figure 8 can be stored directly into e[2], eliminating the variable b.

5 Model of Computation & Estimation

Real-time DSP implementations on multiproces- sors require static scheduling to reduce run-time overhead. Every static scheduling strategy is based on a computation model, which organizes how tasks are executed and how communica- tions are conducted. The computation model in turns make a number of assumptions on the target architecture. In addition, a static sched-

ule is only good if accurate estimations of the computation, memory, and communication costs are available. In this section, the computation model and the estimation stategy for the Mc- DAS system is presented.

5.1 Execution Model

The execution of a scheduled CDFG involves a number of autonomous sub-programs run- ning simultaneously, with some executing in parallel, and some in a pipeline fashion. An example program structure is shown in Figure 9. Nodes represent processes and edges repre- sent communication.

Each sub-program is a set of operations to be executed serially in some predefined order

140 Hoang and Rabaey

Sub-Programl main () {

while (1) { Global_syncO; Sub-Programl O;

] }

Processor I

S u b - P r o g r a m 2

main () {

while (I) { Clobal_syncO; Sub-Program20;

} }

Sub-Program3 main 0 {

while (1) { Global_sync0; Sub-Program30;

} }

Processor2 Processor3

Fig. I0. Pseudo-code for sub-programs.

on one processor. Thus there is a one-to-one mapping of sub-programs to processors. The number of sub-programs at any pipeline stage, and the number of pipeline stages depend on the types of concurrency available in the ap- plication. Each processor executes its assigned code once each sample period, consuming one frame of data from each of its inputs and pro- ducing one frame of data for each of its outputs. Since data comes in an infinite stream, each pro- cessor executes repetitively in an infinite loop, synchronizing at the beginning of each sample. The pseudo-code for each processor is shown in Figure 10. The global synchronization serves to ensure correct data transfer from one pipeline stage to the next.

Edges between the sub-programs are called buffer edges, and represent interprocessor com- munications. These communications are not re- stricted to be between adjacent pipeline stages, but may occur between sub-programs within the same pipeline stage, or from non-adjacent stages (Figure 9). In addition, the communication is not confined to occur only at the beginning and the end of a sub-program, but may occur at any point within a sub-program. Communications between processors in the same pipeline stage require additional synchronizations, as will be explained later.

To support the computation model .above, the target multiprocessor architecture must possess the following features: Firstly, it must be MIMD (multiple instruction, multiple data), that is, each processor must have its own program mem- ory and sequencer. Secondly, each processor must be able to synchronize and communicate

with any other processor in the system, although point-to-point communication is not necessary. The communication mechanism can be message passing or through shared memory.

5.2 Computation and Memory Estimation

Let us represent the hierarchical graph as G --- (N,E), where N is the set of nodes and E is the set of edges in the top level hierarchy of the graph. N can be divided into three sets: The set of primitive nodes Np, the set of function call nodes NF, and the set of iteration nodes NI. Function call nodes and iteration nodes are hierarchical nodes. Let SG(.): NF U N1 --* G denote the function which returns the underlying subgraph of a hierarchical node, and let N(.): G ~ N be a function which returns the set of nodes of a graph.

Estimating Computation Time. To model the computation costs, sample programs were pro- filed. From this, a cost, or weight w was as- signed to each primitive node such as addition and multiplication. Costs were also derived for the overhead Wo of performing loop increments, loop tests, and function calls. With these val- ues, the cost of every node in the hierarchy of any flow graph can be estimated by traversing the flow graph bottom up, accumulating compu- tation times of primitive nodes into subgraphs, and so on up to the root graph.

Definition I. V n E N, the computation time w(n) is given by the following three rules:

A CAD Environment 141

Table I. Computation estimation on SMART multiprocessor.

Example # Operations Estimated (eye) Measured (eye) % Error

7th-IIR 55 6497 6414 + 1.20%

8pt-DCT 87 2388 2511 -4.89%

Cordie 494 135,098 137,924 -2.04%

2 Norm 1926 109,812 114,099 -3.75%

Histogram 30,687 4,186,747 4,227,738 -0.97%

Table 2. Computation estimation on sequent multiprocessor.

Example # Operations Estimated (ms) Measured (ms) % Error

8pt-DCT 87 2882 2790 +3,20%

Cordic 494 11,551 11,668 -1.00%

2 Norm 1926 59,259 60,997 -2.85%

Histogram 30,687 248,291 237,667 +4.27%

256pt-DFT 760,330 7,958,590 7,944,570 +0.17%

1. If n ~ Ne, w(n) = predefined cost, based on benchmark results.

2. If n E NF, w(n) = Wo(n) + ~lv(se (n) )w(v) , where wo(n) represents the overhead of the function call, as analyzed in the benchmarks.

3. For n 6 NI, w(n) = wo(n, L) + L. EvEN(SG(n)) W(V), where Wo(n) represents the overhead of the loop, as analyzed in the benchmarks, and L is the iteration count.

The estimation strategy has been applied to two multiprocessor systems currently supported by McDAS. The first is the SMART (Switchable Multiprocessor Architecture supporting Real- Time applications) [29] system, composed of 10 AT&T's DSP32C DSP processors connected by a configurable bus. The second system is the Sequent machine [30], a shared-bus mul- tiprocessor containing 14 Intel 386 processors. Table 1 shows the estimated and actual comple- tion time of a number of DSP applications on the DSP32C. Each cycle corresponds to 20 ns. Table 2 shows the computation time estimation on the Intel 386. The error on both benchmarks shows that the estimated computation time is accurate to within 5% of the actual time.

Estimating Memory Requirements. A commu- nication between two processors incur memory storage at the destination processor to buffer the data. It is possible to keep track of the buffer memory usage during scheduling so that solutions which violate the memory limit can be discarded. An edge between two nodes assigned to different processors represents an interpro- cessor communication. Each communication is supported by a FIFO in our compilation strat- egy (Section 7). The buffer memory require- ment of a node n on a processor p, denoted as bm(n,p), is dependent on the size of the data on the edge and the difference in the source and destination pipeline stages. The parameter bm(n,p) is used by the scheduler to prohibit a node from being assigned to a processor if exe- cuting this node would overflow the processor's buffer memory. To do this, each processor p has to also keep track of its remaining buffer memory size during scheduling. This parameter is denoted bmavaiI(P). Section 6 will describe how bm(n,p) and bmavail(P) are used by the scheduling algorithm.

There are a number of other memory param- eters which can be estimated. These include

142 Hoang and Rabaey

program memory, memory for static and global variables, and stack memory. While not yet implemented, the same technique used in com- putation estimation should be extendable to es- timate memory usage as well.

5.3 Communication Delay Estimation

The communication costs depend on the amount of data being sent and the distance between the source and destination processors. When a node is scheduled on a processor, the data transfers that are needed to bring the input variables to the processor (if data is non-local) are also scheduled on the appropriate bus or busses. This is done by building time-slots of the proposed data transfers and merging it onto the time-slots already scheduled onto the buses. If any proposed time-slot conflicts with an existing time-slot, it is scheduled on the next available time-slot. This allows the scheduler to take bus congestion into account when calculating the arrival time of the data. More details on the time-slot model can be found in [31].

Definition 2. The Arrival time tarv(ns,p) denotes the time at which data computed in a source node n~ is available at processor p. We assume the node n~ is already scheduled on a source processor and the data transfer(s) are scheduled on the appropriate bus or buses, tarv(n,,p) gives the time the last data package arrives at p, given the state of the bus congestion at that moment in time.

Definition 3. The Available time t~wil(n, p) is the time at which all input data to node is available at processor p. It is calculated over the set I(n) of all input nodes of n as:

t,,,il(n,p) = max{t,~(ni,p) l i � 9 (1)

Definition 4. The Ready time treacly(P) is the time processor p has finished executing its last assigned node.

For a node n to start on a processor p, all of its input data must be available at processor p, and the processor must have completed any previously assigned computation.

Definition 5. The Earliest Starting time ~(n, p) of node n on processor p is defined as:

~(n, p) = max{tavai1(n, p), treacly(P)} (2)

~(n,p) effectively abstracts the underlying ar- chitecture to the level of the starting times of nodes on processors. The scheduling algorithm is only concerned with this information to make its decisions, irrespective of the architecture. As a result, ~(n,p) serves as the interface between the architecture-dependent estimations and the scheduling algorithm. This modularity allows the scheduler to deal with any architecture in a unified manner. Each architecture would only have to provide its own calculation of ~(n, p).

6 Flowgraph Scheduler

The Flowgraph Scheduler module is the most in- teresting module of McDAS. It can run in two modes: Scheduling for fixed throughput, and scheduling for maximum throughout. The first mode is used in a real-time implementation, where the throughput is dictated by the sam- piing rate of the application. The second mode is appropriate for speeding up simulation. The input to the scheduler is a CDFG and a target ar- chitectural description, including the number of processors P. The key feature of the scheduler is its ability to simulataneously consider pipelining, retiming, and parallel execution in its search. Furthermore, it can traverse the different hi- erarchies of the CDFG, allowing concurrency exploitation to take place at a granularity level suitable with the available hardware resources. Both scheduling formulations fall in the class of NP-complete problems [32]. For this reason, al- gorithms which can obtain optimal solutions are discarded in favor of heuristics which obtain a fairly good suboptimal solution in a reasonable time. We will first describe the scheduling for fixed throughput algorithm, and then extend it to schedule for maximum throughput.

6.1 Scheduling for Fixed Throughput

Definition 6. The stagetime T is defined as the reciprocal of the throughput of the system.

A CAD Environment 143

l j i i2

Stage 0 Stage 1 Stage 2 Stage 3

Fig. 11. Scheduling given stagetime T = 10.

The stagetime equals the time allocated to each pipeline stage in the system, and thus to each processor in that stage. It is possi- ble to derive some bounds on the stagetime. If Wtota I is the computation time of the entire graph, an upper bound Tub on the stagetime is Wtotal, and a lower bound Tlb on the stagetime is Wtotal/P. The user-specify stagetime can be checked against Tlb at the start of scheduling to see if it is achievable. Furthermore, during scheduling, the following condition must always be satisfied:

Maximum Granularity Condition. The compu- tation time of the largest node in the graph (at a given level of granularity) Wm~x is < T.

The reason for this is clear. A node of com- putation time >T cannot be scheduled on a pro- cessor with only time T to execute. Hence, after checking that T is feasible, we must go through a graph expansion phase where all hierarchi- cal nodes whose computation times violate the maximum granularity condition are decomposed into smaller nodes.

Given the stagetime T, the scheduling algo- rithm traverses the graph from input to output, partitioning the graph into stages of pipelines. Nodes are scheduled onto a processor until the total computation costs of the nodes plus the communication cost of output edges exceeds the stagetime T. Once a pipeline is filled, the sched- uler proceeds to schedule the remaining nodes on the next pipeline stage. At any stage, the algorithm may decide to use multiple processors working in parallel. At the end, the graph is par- titioned into a number of pipeline stages, where some stages may have more than one processor.

The total number of processors needed is re- turned. An example of how the algorithm works on a simple acyclic graph is shown in Figure 11. Values inside the nodes represent estimates of their computation costs, and values on the edges represent the additional delays for communica- tion (for the sake of simplicity, the time-slot model is not used here). When two nodes are assigned to the same processor, communications between them incur no cost.

In scheduling node n~, the algorithm examines the Earliest Starting Time ~(nl,pk) for different feasible pk's to pick the best candidate. ~(ni,pk) is set to Koo, a very large constant, if any of the following is true:

1. There is insufficient buffer memory in pro- cessor pk to execute node hi, i.e. bmavail(Pk) < bm(ni, Pk).

2. p~ was already assigned a pipeline stage which is different from the stage needed to exe- cute hi.

3. There is insufficient available time left in pk to execute ni within the stagetime limit.

A processor p for which ~(n, p) < K~o is called a feasible processor for n. Condition 1 assumes the buffer memory is local to the processor. If it is in a centralized memory, the condi- tion would be if bmavait(P) < bm(n,p), where bmavail(P) is the remaining buffer memory of the entire system.

Exploiting both pipelining and parallel exe- cution at the same time makes the scheduling task much more difficult. Since the number of processors is fixed, not all parallelism can be exploited, and the algorithm must decide which

144 Hoang and Rabaey

Memory Mapper I

/ Emitter ~ C Compiler

F

Simulation Simulatfon

Dsp Code I Emitter

Real.tlme Implementation /

Fig. 12. Code generator.

operations deserve extra processors and which do not. The exact criteria for node scheduling, the retiming of flowgraph cycles, the node de- composition, and other features of the scheduler are beyond the scope of this paper. The reader is referred to [31] for a complete description of the scheduling algorithm.

6.2 Scheduling for Maximum Throughput

To schedule for maximum throughput, we use the algorithm above as a probing function in a binary search to minimize the stagetime T. Given the bounds Tub and TIb, a candidate stagetime T is chosen between the two bounds and used as the input to the scheduling algorithm above. This routine, as a side-effect to scheduling the flowgraph, returns the number of processors that are needed. By comparing the number of pro- cessors required to the actual number of proces- sors that are available, Tub or Tib can be updated accordingly, and a new stagetime T is checked for feasibility. This iterative refinement process terminates with the minimal feasible stagetime.

Note that since there is no user-specify stage- time, it is not necessary to decompose any node at the start of the search. Instead, the. algorithm starts at the top level of hierarchy, and system- atically decomposes nodes only when necessary. The maximum granularity condition is now used to decide when a node should be decomposed. Specifically, when there are nodes in the graph with costs as large as T, and the search decides

that T can be decreased further, it will break up these nodes to satisfy the maximum granu- larity condition.

Once a schedule is available, the scheduling information is annotated onto the CDFG for code generation.

7 Code Generation

The Code Generator module takes the scheduled CDFG and generates a program for each pro- cessor. The organization of the code generator is shown in Figure 12.

From the CDFG, the Memory Mapper allo- cate buffers for data transfers between proces- sors and issues local synchronizations. For be- havioral simulation of the application, C code is generated which can perform floating point or bit-true simulations. This allows an algo- rithm designer to verify functionality, optimize application parameters, and assess the effects of finite word-length implementations. If real-time implementation is desired, code for a DSP mul- tiprocessor engine can be generated. Currently, McDAS uses commercial C compilers to gener- ate code for real-time implementations. How- ever, other direct code generation approaches (starting from the CDFG) such as [23], [33] can be easily integrated when available.

Memory Mapper. The Memory Mapper mod- ule scans the CDFG for edges whose input and

A CAD Environment 145

Sync " - " ' ~ Q

Stage 0 Stage 1 Stage 2 Stage 3

Fig. 13. Buffer allocation.

output nodes are mapped to different proces- sors. These represent interprocessor communi- cations. They are implemented as FIFOs whose sizes are determined by the amount of data sent and the pipeline stage of the source and destination processors. The use of FIFOs to support the passing of data between tasks can be found in the BLOSIM system [34]. The phys- ical residence of the FIFOs depends on whether the multiprocessor support message passing or shared memory communication and whether the memory is centralized or distributed. For mes- sage passing or distributed memory multipro- cessors, these buffers would reside in the local memory of the destination processor. For cen- tralized memory systems, they would, of course, reside in the central memory. Buffers for data broadcast across several pipeline stages only need to be allocated once in centralized memory systems, but have to be replicated in distributed systems. Figure 13 shows how these buffers are derived from the CDFG of Figure 11.

A FIFO is assigned to each interprocessor communication. The length of the buffer is equal to PipeStage (Destination Processor) - PipeStage (Source Processor)+1. A pointer keeps track of the current free data block. The source processor always writes to the free data block while the destination processor always read from the oldest block. If all processors globally synchronize at the beginning of each sample period, this scheme allows the source processor to write to the destination processor without ever corrupting unread data. When two processors in the same pipeline stage need to

communicate, the length of the buffer is one and the processors need to synchronize locally to en- sure no data corruption. If the multiprocessor system does not have built-in synchronization in- structions, semaphores can be used. In message passing systems, the send and receive primitives may implicitly perform the necessary synchro- nizations themselves. Note that if it was pos- sible to perfectly predict the computation and communication times exactly, no local synchro- nizations would be necessary as the nodes are only scheduled after all input data have arrived. Unfortunately, since such accurate estimations are not available, the synchronizations are nec- essary to guarantee correct execution.

Once the buffers are determined, the flow graph is physically partitioned into sub-graphs according to the processor assignment, and code is generated for each processor.

Floating-point and Fixed-point Simulation. The C code generated supports both floating-point and fixed-point simulations. The mapping is accomplished by implementing all primitive op- erations in the C code as macros, which can be expanded to different computations depending on which type of simulation is desired. Two header files are available: highlevel.h and bit- true.h. The highleveI.h header file is included for floating point simulation, while fixed-point simulation uses bittrue.h. Each file defines the same primitive macros differently: One per- forming a floating point operation, the other fixed point. Note that each simulation model

146 Hoang and Rabaey

Ilo rll

o o

Fig. 14. DFT example.

should have its own architecture description to provide correct estimates of computation times to the scheduler.

DFT Loads CompletionTime (ms)

1600.0

1400.0

1200.0

1000.0

800.0

600.0

400.0

200.0

0.00

0 1 2 3 4 5 Processor

Fig. 15. DFT load balancing on 6 processors.

m Estimated ]t~lllllllll

etuai

8 Some Examples

A few examples are presented to demonstrate the basic capabilities of the compiler. The tar- get architectures are the Sequent and SMART multiprocessor systems.

Discrete Fourier Transform (DFT). A 256-point DFT algorithm is partitioned and scheduled on the Sequent multiprocessor. The input data consists of an array of 256 real-valued sam- ples. The real and imaginary components of the Fourier transform are then calculated, and the magnitude signal is derived and outputed. The top-level CDFG is shown in Figure 14. The value inside the nodes represent its esti- mated computation time in milliseconds (ms). For ease of illustration, the first target architec- ture configuration consists of 6 processors. The completion times on the processors as estimated by the scheduler and as actually measured on the Sequent are shown in Figure 15. Overall, the load is well balanced across the processors, and the estimated time agrees very well with the actual running time. The distribution of the computation is shown in Figure 16. The program is pipelined into 2 stages, with proces- sor 0 in stage 0 and processors 1-5 in stage 1. Processor 0 reads in the array and performs the

DFT on the first 25 samples. The remaining processors process the remaining samples and send their results to processor 4, which is also assigned the task of outputting the result. The same example was then scheduled and executed on different numbers of processors to assess how the speedup varies with the processor count. Figure 17 shows the speedup as estimated by the scheduler, and the actual speedup on the Sequent, both as functions of the processors. For this example, we observe that McDAS is able to consistently achieve a faster throughput with each additional processor. For comparison, the ideal speedup is shown. This can only result if there is perfect load balancing and there is no cost for interprocessor communication.

CORDIC. A CORDIC algorithm is scheduled onto the SMART multiprocessor. The algo- rithm converts cartesian to polar coordinates it- eratively in 20 steps. Since the computations are serial in nature, McDAS achieves the speedup by pipelining the loop, assigning successive pro- cessors the successive loop iterations. In con- trast to the histogram example, the CORDIC program is communication intensive, with each iteration passing its results to the next. The amount of interprocessor communication as a

A CAD Environment 147

input Proc 1, Pipeline 1

Proc O, Pipeline 0

Buffer Memory Usage

Proc0 I 0data --.----.....===4 P r o c l I 512 data

Proc 2 I 512 data Proc 3 I 512 data Proc 4 I 1153 data . . . . . . . . . . ~ Proc 5 I 512 data

2, Pipeline 1

: 3, Pipeline 1

output

Pr,6c 4, Pipeline 1

Proc 5, Pipeline I

Fig. 16. DFT partitioned on 6 processors.

S p e e d u p

12.00 - -

1 1 . 0 0 - -

1 0 . 0 0 - -

9 . 0 0 - -

8.00 - -

7.00 - -

6.00 - -

5.00 - -

4.00 - -

3.00 - -

2.00 - -

DFTSpeedup

/ J

f

m

-'Ketual

.imatcd

m

m

m

4 6 8 1 0 1 2 P r o c s

Fig. 1Z DFT speedup.

function of the number of available processors is shown in Figure 18. When 20 processors are available, each processor computes 1 itera- tion step and sends its result to the adjacent processor, increasing the overall interproces- sor communication significantly. A linear array architecture allows these neighbor-to-neighbor

CORDIC % Communication

% C o m m u n i c a t i o n

18.00 _

16.00 _

14.00 _ /

12.00 -

10.00 -

8.00 .

6 .00 .

4.oo_ / I 2.oo . ~ .

f I

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Processors

Fig. 18. CORDIC % communication.

communications to occur simultaneously while a single shared bus architecture is soon satu- rated. This is demonstrated in Figure 19 where the speedup for the CORDIC algorithm on a

148 Hoang and Rabaey

CORDIC S p e e d U p

Speedup

26 - eal f 24 - B / 22 - J m

20 -- / , Line trA~ my - -

18- / I -

16 - / /1 Sh~-~Bu~ - , 4 - / / / -

' ~ - / f - 10 --

6 _ L / 4 -- ~ -

2 - - ~ f

2 4 6 8 10 12 14 16 '8 20 22 24 26 28

Proce.sson

Fig. 19. CORDIC speedup.

linear array and shared bus architecture is plot- ted. In addition, we note a "stair case" effect in the speedup. This results from the fact that Mc- DAS does not currently partition at the middle of an iteration of a loop. As a result, when there are 12 to 18 processors available, the scheduler still assigns at least one processor 2 iterations to execute. The throughput is stalled at this period until there is enough processors to allow each iteration to be executed by its own processor. This occurs at 20 processors, after which a great leap in throughput is attained. The decision to retain iteration boundaries in the scheduler was made to keep the number of nodes considered by the scheduler low and maintain reasonable user response time.

Table 1 shows the buffer memory usage of the CORDIC example on the SMART system as a function of the available processors. The maximum buffer size gives the largest buffer memory usage on any processor, and the total buffer size gives the total amount of memory used by all the processors. Since the applica- tion is extensively pipelined, the buffer memory requirements in the last stages of the pipeline can be quite large. Additional examples ex- hibiting a wide range of concurrency types and communication patterns can be found in [31].

Table 3. CORDIC buffer memory usage.

Size of Size of # Procs # Pipelines Maximum Buffer Total Buffer

2 2 86 86

6 6 106 308

I0 10 206 1134

14 11 206 1165

18 11 206 1168

22 21 406 4302

9 C o n c l u s i o n s a n d F u t u r e W o r k

The McDAS environment was developed to al- low rapid real-time implementations of DSP applications onto multiple processors. Its main goal was to provide the user with maximum design flexibility and minimum implementation effort. As such, the laborious tasks of schedul- ing and code generation are automated, allow- ing the user to explore different implementa- tions quickly. The scheduler exploits all types of concurrency at different levels of granular- ity to maximize the throughput of the resultant implementation. The resultant parallel code can execute the algorithm in quasi-infinite or bit-true precision. The key features of the sys- tem are its interactive user-interface and mod- ular design. For the user, different archi- tecture configurations can be explored quickly and easily. In addition, new tools (transfor- mations, schedulers, code emitters, user inter- faces) can be easily incorporated through the CDFG interface.

While the current environment is complete enough to allow a designer to carry a de- sign from specification to implementation, there are a number of areas which can be im- proved. Firstly, the current commercial C com- pilers for the DSP processors do not gener- ate code efficient enough for real-time appli- cations. This is due mainly to the special- ized architectures of the DSP processors and the lack of support for DSP primitives in C. The work by Genin [23] and Kim [33] to generate code directly from DSP flowgraphs

A CAD Environment 149

are important contributions. Secondly, exper- iments have shown that a number of loop transformations can dramatically improve the speedup or memory usage for a class of ex- amples. Transformations such as loop merg- ing, loop interchanging, and loop unrolling are prime candidates. For example, merging two loops in series reduces the amount of buffer- ing between the two loops. Moving an inner loop which has many iterations to the outer- most loop allows the scheduler to better bal- ance loads. Next, the scheduling strategy must be augmented to handle data-dependent com- putations, without having to incur the cost of dy- namic scheduling. A scheduling paradigm called quasi-static scheduling [35] attempts to retain as much static scheduling as possible, resorting to dynamic scheduling only when absolutely nec- essary. In this paradigm, Ha [36] proposes to derive a number of possible local schedules of each data-dependent computations at compile- time, and have the static scheduler select the profile which minimizes the expected run-time cost. Finally, the scheduler can be extended to incorporate heterogeneous processors, mapping computations to processors that execute them the fastest. Ultimately, McDAS can provide a powerful CAD environment where the user has the ability to make architectural trade-offs based on scheduling results, perform transfor- mations to optimize designs, perform simula- tions, and generate efficient parallel code for real-time implementation.

References

1. DSP96002 User's Manual, Motorola, 1990. 2. K.A. Frenkel, "HDTV and the computer industry," Com-

munications of the ACM, November 1989. 3. J.E Campbell Jr., V. Welch, and T Tremain, '~,n Expand-

able Error-Protected 4800 bps CELP coder," ICASSP, March 1985.

4. "Description of Reference Model 8 (RM8)," Document 525, CCITT SGXV, Working Party XV/4, Specialist Group for Visual Telephony, 1989.

5. G.K. Wallace, Techincal Description of the Proposed JPEG Baseline Standard for Color Image Compression, EI'89, Boston, Oct. 1989.

6. E.A. Lee, et al., "Gabriel: A Design Environment for DSP," IEEE Trans. on ASSP, vol. 37, Nov. 1989.

7. M.A. Zissman, G.C. O'Leary, and D.H. Johnson, "A block

diagram compiler for a digital signal processing MIMD computer," DSP Workshop Presentation, Chatham, MA, October 1986.

8. V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors, MIT Press, 1989.

9. M.J. Gonzalez, Jr., "Deterministic Processor Scheduling," Computing Surveys, vol. 9, no. 3, Sept. 1977.

10. M.L. Campbell, "Static Allocation For a Data Flow Mul- tiprocessor," Prec. Int. Conf. Parallel Processing, 1985, pp. 511-517.

11. B. Greenblatt and C.J. Linn, "Branch and Bound Al- gorithms for Scheduling Communicating Tasks in a Dis- tributed System," Compcon, pp. 12-16, 1987.

12. S. W Bollinger and S.E Midkiff, "Processor and Link As- signment in Multicomputers using simulated annealing," Prec. Int. Conf. Parallel Processing, Aug. 1988, pp. 1-7.

13. G. Sih and E.A. Lee, "Dynamic-Level Scheduling for Heterogeneous Processor Networks," IEEE Symposium on Parallel and Distributed Processing, December 1989.

14. H. Printz, "Automatic Mapping of Large Signal Processing Systems to a Parallel Machine," Ph.D. Thesis, Carnegie Mellon University, May 1991.

15. M. Annaratone, et al., "Warp Architecture: From Pro- totype to Production," Proceedings of the 1987 National Computer Conference, Chicago, Illinois, June 1987.

16. D.A. Schwartz, "Synchronous Multiprocessor Realizations of Shift-invariant Flow Graphs," Doctoral Thesis, Georgia Institute of Technology Technical Report, June 1985.

17. S.H. Lee, C.J.M. Hedges, and T.E Barnwell, III, '~n SSIMD Compiler for the Implementation of Linear Shift- invariant Flow Graphs," ICASSP, March 1985.

18. K.K. Parhi, "Rate-Optimal Fully-Static Multiprocessor Scheduling of Data-Flow Signal Processing Programs," ISCAS'89.

19. H.R. Forren, "Multiprocessor Design Methodology for Real-Time DSP Systems Represented by Shift-Invariant Flow Graphs," Doctoral Thesis, Georgia Institute of Tech- nology Technical Report, 1988.

20. S. Bokhari, "Assignment Problems in Parallel and Dis- tributed Computing," Parallel Processing and Fifth Gener- ation Computing, Kluwer Academic Publishers, 1988.

21. G. Sih and E.A. Lee, "A Multiprocessor Scheduling Strat- egy," Electronics Research Laboratory Memorandum, De- cember 1990.

22. E Hilfinger, "SILAGE, A High Level Language and Sil- icon Compiler for Digital Signal Processing," Proceedings IEEE CICC Conference, Portland, May 1985.

23. D.R. Genin, et al., "System Design, Optimization, and Intelligent Code Generation for Standard DSP," 1SCAS, May 1989.

24. J. Rabaey et al., "Fast Prototyping of Datapath-Intensive Architectures," IEEE Design & Test of Computers, June 1991.

25. D.C. Chen, "A Field Programmable Architecture and CAD Framework for High Speed DSP applications," In- ternal Report, U.C. Berkeley, 1991.

26. A.V. Aho, R. Sethi, and J. Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley Publishing Com- pany, 1986.

150 Hoang and Rabaey

27. C. Scheers, "User Manual for the $2C Silage to 6' Com- piler," Internal Document, IMEC Laboratory, Leuven, Belgium, August 1988.

28. E Hoang, "Compiling Real-Time DSP Applications onto Multiprocessor Systems," Ph.D. Thesis, U.C. Berkeley, May 1992.

29. W. Koh, A. Yeungk, E Hoang, and J. Rabaey, "A Config- urable Multiprocessor System for DSP Behavioral Simu- lation," ISCAS'89.

30. A. Osterhaug, "Guide to Parallel Programming on Sequent Computer Systems," Sequent Computer Sys- tems, Inc.

31. P. Hoang and J. Rabaey, "Scheduling of DSP Programs onto Multiprocessors for Maximum Throughput," IEEE Trans. on ASSP, June 1993.

32. M.R. Garey and D.S. Johnson, "Complexity Results for Multiprocessor Scheduling under Resource Constraints," SIAM Journal on Computing, vol. 4, pp. 397-411, 1975.

33. B.M. Kim and T.E Barnwell, III, "Resource Allocation and Code Generation for Pointer-Based Pipelined DSP Multiprocessors," ISCAS'90.

34. D.G. Messerschmitt, "A Tool for Structured Functional Simulation," IEEE J. Select. Areas Commun., voi. SAC-2, January 1984.

35. E.A. Lee, "Recurrences, Iterations, and Conditionals in Statically Scheduled Block Diagram Languages," VLSI Signal Processing 11I, IEEE press, 1988.

36. Soonhoi Ha, "Compile-Time Scheduling of Dataflow Pro- gram Graphs with Dynamic Constructs," Ph.D. Thesis, University of California, Berkeley, 1992.

Phu D. Hoang was born in Saigon, Vietnam on January 21, 1965. He received the B.S. degree in Electrical Engi- neering and Mathematics in 1985 from the University of Maryland, College Park, and his M.S. and Ph.D. degrees in Electrical Engineering and Computer Science in 1987 and 1992 respectively, from the University of California, Berke- ley. His research interests include multiprocessor scheduling and compilation, high-level synthesis, and design-exploration frameworks. He is currently with Redwood Design Automa- tion, engaged in research and development of high-level design automation tools.

Jan Rabaey received the EE and Ph.D degrees (with "summa cum laude") in applied sciences in 1978 and 1983 respec- tively, from the Katholieke Universiteit Leuven, Belgium. In 1987, he joined the faculty of the University of Cali- fornia, Berkeley, where he is now a professor. His current research interests are the study of architectures, the com- puter aided analysis and the automated design of digital signal processing systems.

Jan Rabaey authored or co-authored more than 100 pa- pers in the area of signal processing. In 1986, he received the 1985 IEEE Transactions on Computer Aided Design Best Paper Award (Circuits and Systems Society). In 1989, he received the Presidential Young Investigator award. He also was a recipient of the Analog Devices Career Development Professorship award. He has served as associate editor of the IEEE Journal of Solid State Circuits and he is/has been on the program committee of the ISSCC, EDAC, ICCD, ICCAD, High Level Synthesis and VLSI Signal Process- ing conferences.