Architecture-driven synthesis techniques for mapping digital signal processing algorithms into...

17
Arc hitecture-Driven Synthesis Techniques for VLSl Implementation of DSP Algorithms HUGO DEMAN, FRANCKY CATTHOOR, GERT GOOSSENS, JAN VANHOOF, JEF VAN MEERBERGEN, STEFAAN NOTE, AND JOS HUISKEN Invited Paper Digital signal processing (DSP) is a rapidly growing discipline as VLSl technology makes real-time digital algorithms for speech, audio, image processing, video, and control systems economically feasible. Due to the competitiveness of the application field, cut- ting design time is a key issue for DSP. Silicon compilation is a way to achieve this. In this paper the state of the art of compiling DSP algorithms into silicon is discussed. First it is indicated how digital signal processing differs from numerical data processing, includ- ing the consequences on the synthesis tools. Unlike compilers generating general-purpose microprocessors, DSP synthesis requires tools for analysis, optimization and simulation of the bit- true behavior of the algorithm at the highest level. An applicative input language for specifying the behavior of DSP systems is advo- cated. Based on a wide span of DSP applications, four classes of architectures are distinguished to serve as templates for four dif- ferent synthesis systems. Although each of these four silicon com- pilers is tuned to a specific class of applications in order to gener- ate area-efficient chips, they all accept as input the same behavioural DSP specification. The four selected architectural styles are best characterized by the following keywords: hard-wired bit- serial data-paths, microcoded multiprocessors, cooperating bit- parallel data-paths and, finally, regular arrays. Each of the CATHEDRAL compilers is based on a mixture of knowl- edge-based architecture generation techniques and algorithmic optimizations. Silicon is generated from technology-updatable libraries of primitive cells, by means of structured module-genera- tors or by using a standard-cell design system. Attention is paid to the assembly of test patterns for the synthesized chips. The CATHE- DRAL programs support interactive synthesis for the four above- mentioned architectures, all the way from the applicative bit-true specification to silicon. For each compiler the design trajectory starting from a high-level specification down to layout is analyzed in the paper. Each of the CArHEDRALs and their underlying meth- odology is illustrated with the complete design of a representative example. I. INTRODUCTION Digital signal processing (DSP) systems constitute an increasingly important class of applications in electronic Manuscript received April 3, 1989; revised September 19, 1989. This research has been sponsored in part by the ESPRIT 97 project of the European Economic Community (EEC) and industrial part- ners Philips, Siemens, Bell Telephone Mfg. Co., and Silvar-Lisco. H. De Man is with the Elektrotechnic Institute, Katholieke Uni- versiteit Leuven, Belgium. F. Catthoor, G. Goossens, J. Vanhoof, and S. Note are with IMEC Laboratory, 8-3030 Leuven, Belgium. J. Van Meerbergen and J. Huisken arewith Philips Research Labs, NL-5600, J. A. Eindhoven, The Netherlands. IEEE Log Number 8934100. design. Evolving from linear, time-invariant digital filters, state-of-the-art DSP applications are today characterized by an increasing arithmetic complexity combined with time- variant or decisionmaking operations. No longer do they merely perform transformations on one-dimensional data- streams, but they also employ advanced vector and matrix operations [I]. The application domain ranges from medium throughput speech, audio, and user-end telecommunica- tions on the low-frequency side of the spectrum, to image, video, and radar processingon the high-frequency side. For these complex systems, and especially for those designed for the consumer-electronics market, a sufficiently large design efficiency has to beachieved in termsof throughput, area, power consumption, and packaging.At the same time, the design cycle from algorithm to working system should be reduced from a few years to a few weeks in order to respond to the rapidly evolving market. These objectives can be achieved by implementing DSP systems in appkation-specific ICs (ASICs) using CAD sup- port during the complete design trajectory from algorithm to system. This requires both high-levelsynthesis programs and module generation tools, which result in efficient architectures and dense layouts. Efficient silicon compi- lation of industrial size designs becomes feasible if the syn- thesis strategy is based on awell-controlled user interaction strategy and on a clearly defined target architecture, as demonstrated in the CATHEDRAL/~IRAMID projects [14],[151, In section II, a number of basic characteristicsof the DSP domain are presented and the difference between synthe- sis methods for DSP and for scientific and numerical pro- cessing is discussed. In addition, the development of ASIC methodologies for DSP applications is motivated. In sec- tion Ill, SILAGE [28] is proposed as a suitable specification language.The importance of bit-true specification and opti- mization for DSPapplications is stressed. In section IV,four efficient architectural strategies oriented towards DSP are outlined. In section V, the characteristics of architectural synthesis for DSP are discussed in general. This topic is treated in more detail with three different CATHEDRAL syn- thesis environments. CATHEDRAL-\ is oriented towards hard- wired bit-serial chips (section VI, CATHEDRAL-II towards microcoded processors (section VI I) and CATHEDRAL-I II [291,[61. 0018-9219/90/0200-0319$01.00 0 1990 IEEE PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990 319

Transcript of Architecture-driven synthesis techniques for mapping digital signal processing algorithms into...

Arc hitecture-Driven Synthesis Techniques for VLSl Implementation of DSP Algorithms

HUGO DE MAN, FRANCKY CATTHOOR, GERT GOOSSENS, JAN VANHOOF, JEF VAN MEERBERGEN, STEFAAN NOTE, AND JOS HUISKEN

Invited Paper

Digital signal processing (DSP) is a rapidly growing discipline as VLSl technology makes real-time digital algorithms for speech, audio, image processing, video, and control systems economically feasible. Due to the competitiveness of the application field, cut- ting design time is a key issue for DSP. Silicon compilation is a way to achieve this. In this paper the state of the art of compiling DSP algorithms into silicon is discussed. First it is indicated how digital signal processing differs from numerical data processing, includ- ing the consequences on the synthesis tools. Unlike compilers generating general-purpose microprocessors, DSP synthesis requires tools for analysis, optimization and simulation of the bit- true behavior of the algorithm at the highest level. An applicative input language for specifying the behavior of DSP systems is advo- cated. Based on a wide span of DSP applications, four classes of architectures are distinguished to serve as templates for four dif- ferent synthesis systems. Although each of these four silicon com- pilers is tuned to a specific class of applications in order to gener- ate area-efficient chips, they all accept as input the same behavioural DSP specification. The four selected architectural styles are best characterized by the following keywords: hard-wired bit- serial data-paths, microcoded multiprocessors, cooperating bit- parallel data-paths and, finally, regular arrays.

Each of the CATHEDRAL compilers is based on a mixture of knowl- edge-based architecture generation techniques and algorithmic optimizations. Silicon is generated from technology-updatable libraries of primitive cells, by means of structured module-genera- tors or by using a standard-cell design system. Attention is paid to the assembly of test patterns for the synthesized chips. The CATHE-

DRAL programs support interactive synthesis for the four above- mentioned architectures, all the way from the applicative bit-true specification to silicon. For each compiler the design trajectory starting from a high-level specification down to layout is analyzed in the paper. Each of the CArHEDRALs and their underlying meth- odology is illustrated with the complete design of a representative example.

I. INTRODUCTION

Digital signal processing (DSP) systems constitute an increasingly important class of applications in electronic

Manuscript received April 3 , 1989; revised September 19, 1989. This research has been sponsored in part by the ESPRIT 97 project of the European Economic Community (EEC) and industrial part- ners Philips, Siemens, Bell Telephone Mfg. Co., and Silvar-Lisco.

H. De Man i s with the Elektrotechnic Institute, Katholieke Uni- versiteit Leuven, Belgium.

F. Catthoor, G. Goossens, J. Vanhoof, and S. Note are with IMEC Laboratory, 8-3030 Leuven, Belgium.

J . Van Meerbergen and J . Huisken arewith Philips Research Labs, NL-5600, J . A. Eindhoven, The Netherlands.

I E E E Log Number 8934100.

design. Evolving from linear, time-invariant digital filters, state-of-the-art DSP applications are today characterized by an increasing arithmetic complexity combined with time- variant or decisionmaking operations. No longer do they merely perform transformations on one-dimensional data- streams, but they also employ advanced vector and matrix operations [I]. The application domain ranges from medium throughput speech, audio, and user-end telecommunica- tions on the low-frequency side of the spectrum, to image, video, and radar processing on the high-frequency side. For these complex systems, and especially for those designed for the consumer-electronics market, a sufficiently large design efficiency has to beachieved in termsof throughput, area, power consumption, and packaging. At the same time, the design cycle from algorithm to working system should be reduced from a few years to a few weeks in order to respond to the rapidly evolving market.

These objectives can be achieved by implementing DSP systems in appkation-specific ICs (ASICs) using CAD sup- port during the complete design trajectory from algorithm to system. This requires both high-levelsynthesis programs and module generation tools, which result in efficient architectures and dense layouts. Efficient silicon compi- lation of industrial size designs becomes feasible i f the syn- thesis strategy i s based on awell-controlled user interaction strategy and on a clearly defined target architecture, as demonstrated in the CATHEDRAL/~IRAMID projects [14], [151,

In section II, a number of basic characteristics of the DSP domain are presented and the difference between synthe- sis methods for DSP and for scientific and numerical pro- cessing is discussed. In addition, the development of ASIC methodologies for DSP applications i s motivated. In sec- tion I l l , SILAGE [28] i s proposed as a suitable specification language. The importance of bit-true specification and opti- mization for DSPapplications i s stressed. In section IV,four efficient architectural strategies oriented towards DSP are outlined. In section V, the characteristics of architectural synthesis for DSP are discussed in general. This topic is treated in more detail with three different CATHEDRAL syn- thesis environments. CATHEDRAL-\ i s oriented towards hard- wired bit-serial chips (section VI, CATHEDRAL-II towards microcoded processors (section VI I) and CATHEDRAL-I II

[291,[61.

0018-9219/90/0200-0319$01.00 0 1990 IEEE

PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990 319

towards bit-sliced cooperating data-paths (section VIII). CATHEDRAL-IV, aiming at the synthesis of regular arrays, i s still in an early stage of development and will not be dis- cussed in this paper. The claims are substantiated with a typical design for every compiler. Section IX provides the final discussion.

II. CAD JUSTIFIES ASIC DESIGN FOR DIGITAL SIGNAL PROCESSING

A. Evolving from Numerical Processing Towards DSP

Hardware compilation first started out in the 70s as an extension to softwarecompilation. At that time the goal was to automatically convert a register-transfer description of a system into a physical implementation, in terms of an interconnection l ist of commodity ICs. Designs were described rather than specified using a hardware descrip- tion language. Nowadays, one tends to classify these sys- tems as hardware assemblers rather than compilers. They are commercially available at this moment. Some of them are reported in [20].

Soon after that time, full-fledged compilers were reported to synthesize general-purposeTTLcomputers starting from a specification of the instruction set of the machine rather than from a register-transfer description of the processor. Instruction sets were specified using special-purpose lan- guages. The potential of VLSl shifted this aim towards com- piling special-purpose programs described in a high-level general-purpose procedural language such as PASCAL, into application-specific computers, integrated on a single chip (ASICs). The term silicon compilation was introduced. The specification of such an ASIC computer i s to run its built- in program as fast as possible on typical data. If a com- putation is finished earlier than expected, the next com- putation can be started already. Since one cannot afford to build large machines for every specific application, such designs are highly area-constrained. Compilers typically minimize the average program execution time, subject to an area constraint. Numerous compilers belonging to this class have been reported in literature: SILC [19], APOLLON

[50], and YORKTOWN SILICON COMPILER [2]. During the 80s once again a shift has taken place in com-

piler specifications, due to the growing economic impor- tance of the digital signal processing market. With the impact of DSP, people realized the importance of real-time computing. Unlike a general-purpose microprocessor capable of running a variety of computer programs on numerical data, or unlike an ASIC microprocessor running its built-in program as fast as possible on typical data, a sig- nalprocessor must be speciallydesigned to execute a repet- itive real-time algorithm with a constant data rate. Indeed the data rate of a digital signal processor is essentially fixed by the DSP algorithm itself and by the surrounding system functions.

To illustrate this point, consider a digital low-pass filter with a data rate of 44.1 kHz and a cut-off frequency of 10 kHz, which can be implemented as an ASIC containing a single ALU. By adding more arithmetic power to that chip (without exceeding the area constraints defined by the designer), a data rate of 88.2 kHz may be obtained. How- ever, the resulting filter has a cut-off frequency of 20 kHz.

[31], ALGIC [32], S(P)LICFR[37], CAMAD [40], MACPITTS [48], FACET

Although the ”shape” of the transfer function in the fre- quency domain is unchanged, the frequency axis i s warped and the filter behaves essentially different from the spec- ifications. On the other hand, if a single ALU does not pro- vide sufficient arithmetic power to compute the output at the specified rate, it makes no sense to allow a relaxation of the data rate to, say, 22.05 kHz, because the cut-off fre- quency will now shift to 5 kHz, again making the filter use- less. In the latter case, the only solution i s to provide more hardware on the chip.

As this example shows, digital signal processing implies a periodic evaluation of the algorithm in terms of the sam- pleor “frame”period. The input of a DSP compiler i s a high- level algorithmic description, which includes a notion of real-time. Real-time constraints make compiling a DSP algo- rithm into silicon a very difficult job. On the one hand it must be guaranteed under all circumstances that all com- putations of one frame can be completed within one frame period. The average execution time is no longer of interest. Only the worst-case situations are relevant. On the other hand it obviously makes no sense to generate a DSP chip running faster than the input data are presented, because after finishing its computations it will remain idle until the next data sample becomes available. One should therefore try to exchange these residual clock cycles for hardware.

Therefore, efficient DSP compilers typically minimize the chip area subject to a worst-case frame rate. Compilers belonging to this class are LAGER [41] [42], SFHWA [38], BSL- compiler [27l, HAL [39], SPAID [26], and the CATHEDRAL com- pilers discussed in the sections Ill-VIII.

The optimization criterion used by silicon compilers effectively separates them into two major compiler classes. It should be noted though that, in principle, a compiler of one class may be used to solve a design problem situated in the complementary class, by using the compiler itera- tively. Many systems support such an iterative mode.

B. Finite Word-Length Effects

Finite word-length issues play an important part in DSP. The accuracy of the results produced by a digital computer depends on its word-length. For general computing we use standard 32-bit or 64-bit floating-point, but for DSP appli- cations this i s often an overkill. For many applications fixed point arithmetic i s quite sufficient. Nevertheless the behav- ior of important classes of DSP algorithms i s inherently sen- sitive to finite word-length effects. For example, the signal- to-noise ratio i s an integral part of the specification of an audio system. DSP implementations can thus be optimized towards speed and area by minimizing the word-length under certain constraints to avoid limit-cycles and round- off noise (see section Ill).

The optimized word-lengths must be correctly imple- mented in the chip architecture, i.e., the compiler must guarantee a bit-true behavior.

C. Other Characteristics o f State-of-the-Art DSP Applications

Besides the real-time aspect and the finite word-length limitations, DSP algorithms exhibit other typical features. Signal processing algorithms typically involve large amounts of data. Compared to conventional DSP of the first generation (linear time-invariant digital filters), three dif-

PROCEEDINGS OF THE IFFE, VOL. 78, NO. 2. FEBRUARY 1990 320

ferent types of extensions are supported in the 2nd and 3rd rate and customization needed by the applications generation DSP [I]: addressed in this paper. Unfortunately they require a rel-

Algorithms not only use linear arithmetic, but also nonlinear operations are involved, such as modulo arithmetic, logarithmic processing, and bit manipu- lation. In many advanced algebraic DSP applications, multi- dimensional signals are involved which necessitate vector, blockand matrixoperations in addition to sca- lar processing. In almost every DSP system a regular signal flow as present in digital filters is combined with data-depen- dent branching. Parabolic interpolation and pattern recognition are typical examples (subsection VlI-6).

D. Compiled ASIC‘s Outperform General-purpose DSPs and Full-Custom Designs in the Consumer-Electronics Market

Compared to an ASIC solution, systems based on com- modity DSP-processors can be designed in a relatively short time, since no IC processing i s performed. Most of the design time i s spent in writing the application program, as DSP code-generators starting from a truly behavioral spec- ification are currently not available. In addition, their field programmability allows to adapt the application to the lat- est modifications.

Commodity DSP’s however fail to handle the complex algorithms used in todays consumer electronics and tele- communication applications, as described earlier in this section. Many video and image processing systems require so high a throughput that the sample rate cannot be main- tained by a commodity DSP-processor. Complex systems using several processors in parallel suffer from commu- nication overhead. Several domain-specific video signal processors are currently entering the market, but they still lack the speed required by many front-end subsystems. In addition, the power consumption of the powerful versions of programmable DSPs is rather high. Finally, these DSPs cannot be parameterized according to the designer’s needs such as 110, on-chip memory and (bit-true) signal word- length requirements.

In contrast, the area efficiency and performance achiev- able with a full-custom design can be fully exploited for the range of high speed applications such as video and radar. But for most DSP applications envisioned in this paper, the large-time-to-market due to the excessive design-time of a full-custom chip cannot be justified. Fast reaction to market changes i s of prior importance to outpace the competition.

These considerations motivate the choice for a serni-cus- tom approach, where on the one hand, power, area-effi- ciency, and customizability are improved over the pro- grammable DSP processors, and on the other hand the design time is orders of magnitude less than for full-cus- tom. This can be achieved by designing ASlCarchitectures, which exploit the specific properties of the implemented algorithms, such as the inherent parallelism, and by match- ing the hardware to the algorithmic complexity. The main advantage of semi-custom gate-array or standard-cell chips is the remarkable smallsize of the cell library these systems are built from (e.g., 20 cells with an average complexity of 10 transistors). In contrast to general-purpose solutions, these semi-custom chips are able to meet the computation

atively large chip area for regular structures such as data- paths. For a macro-cell ASIC design, the cell library i s cer- tainly larger (100 cells of about the same complexity), but still maintainable. The library development effort i s justi- fied bythelarge numberofdesignswhichare possible,with an increased area-efficiency. For irregular parts of a macro- cell ASIC, such as the controller or glue-logic, standard-cell subdesigns can easily be included. First exercises show that the resultingefficiencycan approach theoneof full-custom designs within 50%, while CAD allows for design times of less than a month. Especially for volumes above 100K chips per year, an ASIC solution will eventually be the cheapest.

E. The “Meet-in-the-Middle” Methodology

Exercises show that the design-time issue can be solved by adopting a so-called “meet-in-t he-m iddle” methodology (Fig. 1) [13]. In this methodology the designer i s assisted by

Behavioural description [ o f DsP system 1

I + Arch1 tecture

(Module net l ist ,

GENERATI ON

Fig. 1. Illustration of the “meet-in-the-middle’’ design methodology.

interactive synthesis software on all levels of abstraction, while the physical implementation of the design i s based on a restricted and pre-characterized libraryof silicon mod- ules. The library should be pre-characterized in order to effectively decouple synthesis from physical implementa- tion,and it should be restricted for reasonsof maintenance. Fortunately, even the broad range of DSP algorithms described above can be efficiently implemerited by using a restricted set of functional building blocks such as reg- ister-files, multiplexers, ALUs, RAMS, PLAs, and multipliers.

A module is the physical implementation of a functional building block. Modules are constructed froin lower-level primitive cells or leaf-cells by a module generator. This i s a computer program written by a silicon designer working “below the line.’’ I t i s crucial that all the modules are com- posed of their leaf-cells in a parameterizable way, either as bit-sliced, matrix-like, or even as logarithmic structures. The development of the module library is supported with an interactive module-generation environment ,and with ver- ification tools, while the library of the leaf-cells i s created using a symbolic-layout editor and automatic compaction.

TheCAD supportingthe system designer”abovethe line” includes the architectural synthesis from an applicative, high-level behavioral description to a detailed netlist of

DE MAN et al.: VLS! IMPLEMENTATION OF DSP ALGORITHMS 321

-

functional building blocks, thechip assemblytoafloorplan and the generation of a test program.

At the IMEC and Philips laboratories such a set of syn- thesis, optimization, and verification tools, collected in the CATHEDRA~PIRAMID environment, is currently being devel- oped to compile DSP algorithms into ASIC’s [13], [14], [24], ~ 9 1 , [301, WI, [361, WI, [561.

I l l . BIT-TRUE SPECIFICATION AND OPTIMIZATION

In viewof the strong requirements of signal typing in most DSP applications, it i s crucial to provide specifications of all signal formats up to the bit-level (i.e., bit-true specifi- cation) including all signal quantization and overflow char- acteristics. Also the difficult step of converting abstract floating-point descriptions into concrete fixed-point descriptions has to be supported. In addition, some other requirements on description languages will be discussed.

A. Algorithmic Description Languages

Many description languages have been proposed for the specification of hardware or algorithmic behavior. One of the most important distinctions i s between procedural or imperative and applicative or functional description styles. A procedural program computes an algorithm by perform- ing incremental changes tovariables. A variable in an imper- ative language refers to a named storage location whose value can be modified by means of a sequenced assigment statement. By contrast, applicative languages describe rela- tions between signals. Each signal has a unique name and i s uniquely defined by an equation. This property i s known as referential transparency [ IA. As long as a signal has not been computed, it is unknown. An operation to compute a signal may only be executed whenever all its input signals are known, which introduces precedences. No additional sequencing is needed or even allowed. The parallelism in an applicative program i s only limited by the precedences.

For our purposes, namelyarchitectural synthesis for DSP, we have chosen an applicative language, SILAGE [28]. In this way, the algorithmic parallelism i s made explicit and no biases are introduced towards the eventual realization. Moreover the precedences in most DSP-algorithms aredata independent. They can be computed at compile time. This justifies the development of a graph language like SILAGE [28] or SIGNAL [21], instead of using a general-purpose func- tional language like LISP. A DSP algorithm may be repre- sented as a fixed graph with operators as vertices and with arcs representing the flow of data. This does not eliminate the use of data-dependent branching, though. The differ- ence with the IF-THEN-ELSE constructs of conventional pro- gramming languages lies in the use of ”multiplexer”-type equations for the branch definitions in SILAGE. In addition, DSP algorithms require some special constructs, which should be supported as much as possible in the language. These include rigorous behavior at the bit-level, timing aspects and multidimensional signals. All these aspects are supported in the current version of SILAGE. As an example, Table 1 contains the SILAGE description of the amplitude function in a pitch-extraction system [47, to be presented in section VII. Observe the use of arrays of signals, con- ditional definitions, and signal types.

Behavioral descriptions in SILAGE can be simulated effi- ciently by means of a compiled-code simulator S2C devel-

oped at IMEC. The SILAGE description of the DSP-algorithm is translated into C-code, which can subsequently be com- piled and executed on general-purpose computers (e.g., Apollo or Sequent). S2C either takes the exact finite word- length specifications into account or it can make abstrac- tion of finite word-length specifications for efficient sim- ulation using the floating-point capabilities of the host machine.

B. Word-Length Optimization in DSP-Algorithms

The step of translating a floating-point description of a DSP algorithm into a fixed-point finite word-length descrip- tion i s in general not supported by CAD tools. Hence, cur- rently extensive simulation i s the only alternative. How- ever, for a limited class of applications, namely, linear DSP such as digital filters, software techniques have been devel- oped, which aid the designer in coming up with an opti- mized finite word-length for both the coefficients and the signals. A powerful set of tools has been collected in the DSPDIGEST environment [9], which is oriented towards the analysis, evaluation, and optimization of digital filters. The DSPDlGEsrtools are largely independent of the target archi- tectures to bedefined in section IV. Onlysomeof theobjec- tive functions for the optimization tasks should be chosen accordingly. DSPDIGEST includes a frequency domain anal- ysisprogram to compute transferfunctions, noise, and other characteristics, even for multi-rate systems, and tools that carry out an optimization to discretize the multiplier coef- ficients in the filter [5]. These optimizations lead to a sig- nificant chip-area reduction. The signal word-length deci- sion is supported by norm and bound information on overflow and quantization effects (including correlated effects such as limit cycles).

IV. DSP-ORIENTED ARCHITECTURAL STYLES

When implementing a DSP algorithm in silicon, the first design decision i s to select a suitable architecture, based on the characteristics of the algorithm (modularity, paral- lelism, signal flow mechanism) and on the throughput spec- ifications. Rather than conceiving a synthesis system capa- bleof exploitingany kind of architecture in an efficientway, we selected four different architectural styles (Fig. 2). These four classes represent in a sense four extremes of the set of imaginable alternatives. Each of these styles contains enough flexibility to span a significantly large design space. They are all supported by a specific synthesis approach and CAD environment. Fortunately, many software techniques can be shared among these CAD environments [6]. The four styles are described below.

7) Hard-wired bit-serial architectures are well suited to implement linear filters with a relatively low sample rate, as used in audio, speech and some telecom applications [54] [55]. Each arithmetic operation of the algorithm is imple- mented as a separate I -b i t operator in the architecture. Sig- nals are processed bit by bit, the least significant bit first. This implies that thedata rate should approximate the clock ratedivided by the signal word-length for optimal hardware exploitation (e.g., 10-MHz clock rate and 20 bit signals imply an optimal data rate of 500 kHz). The amount of logic and arithmetic operators i s independent of the signal word- length for a pure bit-serial approach. However, the over- head in terms of bit-serial registers increases with the signal

PROCEEDINGS OF THE IEEE, VOL 78, NO 2, FEBRUARY 1990

U Frame

-

I CO trol/Ini Signals r \

Array of processing elements

+ + + 4 4 4 -

I/O inter t t c

Proc Elem = Data path + Local control (b) (C)

Fig. 2. ASIC architectural styles for DSP. (a) Microcoded multi-processors. (b) Multi- plexed co-operating data-paths. (c) Regular arrays.

word-length. Therefore, efficient tools for word-length optimization and for register minimization in the architec- ture are of crucial importance in a bit-serial compiler.

If the specified data rate i s lower than the optimal one, additional hardware-sharing can be obtained’ by multi- plexing independent signal sequences [54]. On the other hand, if the specifications require a data rate higher than theoptimum, bitscan begrouped toarriveat so-called digit serial approaches [27, [54], in which a signal is processed in chunks of N bits by N-bit operators instead of I-bit oper- ators. The chip area consumed by the controller of a time- invariant bit-serial filter i s negligible. In order to maintain this advantage, decisionmaking i s limited to filter-coeffi- cient programmability, where the coefficients are down- loaded prior to the operation. Nonlinear extensions to the architecture cause substantial arithmetic overhead.

This architectural methodology i s supported in the CATHEDRAL-I synthesis environment [30] (section VI). The corresponding library of bit-serial elements i s limited to a set of about 20 leaf-cells implementing I-bit registers, adders, scalers, and logic operators.

2) Microcodedprocessors, composed of application-spe- cific data-path modules, and a multiple branch controller

DE MAN et al.: VLSl IMPLEMENTATION OF DSP ALGORITHMS

[3], are oriented towards low- to medium-rate algorithms combining heavy decisionmaking and computation-inten- sivearray or matrix manipulations. Applications range from audio, speech, telecom, and algebraic processing to back- end image and video processing. Typical algorithms in this class require sample rate of 10 kHz up to 1 MHz. The achiev- able data rate is equal to the clock rate divided by the num- ber of machine cycles needed by the processor to compute one frame of the algorithm. The achievable clock rate i s determined by the worst-case delay of an operator of max- imal word length. Typically, for a 24-bit operator in a 3 pm CMOS technology, adelayof about 100 ns can be obtained, which results in a 10-MHz clock. The number of available machine cycles therefore varies between ten and a few thousand, such that these applications have to rely on (heavily) time-multiplexed and thus programmable oper- ators. For simple algorithms of the class of CATHEDRAL-I, these programmable processors result in an overkill.

In this style, a processor i s a dedicated interconnection of bit-parallel execution units (EXUs) controlled by a mul- tiple branch controller. Although customized towards a specific set of operations and heavily parameterizable in terms of word-length, register size, and so on, al l execution

323

units have the same basic structure. They consist of two input register-files and input multiplexers, a set of pro- grammable arithmetic operators (building blocks), and out- put buffers. The library of EXUs i s limited: at present the software supports ALUs, multiplier-accumulators, RAMS, address-computation units (ACUs), comparators, and ROMs. A processor may contain as much types and instances of execution units as the application demands. Matching the design towards the exact specifications thus allows designers to avoid the traditional bottle-necks of general-purpose signal processors. The key issues in the design methodology are efficient code generation and schedulingof thecode in timeon acustomized and parallel data path.

This architectural methodology provides the target archi- tecture in the CATHEDRAL-IVPIRAMID [15], [24], [291, [441, [561 synthesis environment (see section VII). It allows for an interactive search through awidedesign space rangingfrom a single time-multiplexed ALU to a pipelined data-path composed of a set of EXUs.

3) Bit-sliced multiplexed data-paths cooperate under control of a hierarchically decomposed controller [4], and are optimized in terms of critical timing paths. These can be successfully used to implement irregular and recursive high-speed algorithms as used in medium-level image, video, and radar processing applications. Also front-end audio and telecom applications require such high rates. The possible time-sharing of hardware in these algorithms is limited and typically varies between one and ten. Only pipe- lined parallel architectures can cope with the higher data rates, ranging from 1 MHz to 10 MHz. As for bit-serial, the data-path i s matched to the signal flow in the algorithm. Hence this type of architecture will usually lead to a cluster of application-specific units (Fig. 2(b)). Application-specific units are fully dedicated interconnections of functional building blocks such as registers, adders, multiplexers, and soon. In thiswaytheydifferfrom theexecution unitsintro- duced in item 2). Although they can be parametrized (e.g., to set the word-length), the latter have a predefined internal

Becauseof the high throughput requirements, the kernel of the design methodology consists of tools to derive a cus- tomized data-path and tooptimizethecritical paths in order to maximize the clock rate. As opposed to the application domain in item 2), the control flow is simple, thereby avoid- ing a complex microcoded controller. The timing con- straints are much more stringent, typically leading to a con- troller implemented situation as a partitioned finite state machine.

The architectural synthesis for this target style is sup- ported in the CATHEDRAL-Ill environment [35] (section VIII). 4) Regular array architectures with a regular network of

processing elements, mostly localized communication and distributed storage (Fig. 2(c)) are especially suited for the front-end modules in image, video, and radar processing. These algorithms are very modular in nature and operate on data rates very close to the achievable clock rate, as opposed to the other categories above [4].

If the array is fully pipelined and modular, systolic archi- tectures result [33] . However, useful extensions include: processing elements with several "modes," broadcasting of signals which are not time-critical, and partly irregular networks located at the boundaries [4].

topology.

324

The architectural synthesis for this target style will be supported in the CATHEDRAL-IV environment, currently under develop men t.

V. CHARACTERISTICS OF ARCHITECTURAL SYNTHESIS FOR DSP

A. Task Decomposition and Software Techniques

The complexity of the architectural synthesis process in the different CATHEDRAL compilers i s mastered by a decom- position into a sequenceof design tasks. For each task, spe- cialized software-techniques are developed, operating on signal-flow graph models. Basically the following types of software techniques are used:

1.

2.

3 .

As a

Rule-based synthesis techniques are applied to match high-level constructs in the input description to low- level architectural templates. This matching i s easy for simple arithmetic operations, but far more com- plicated for algorithmic delays, storage of matrices, RAM addressing, and double precision operations. Internally, certain rules may include local optimiza- tions. The construction of a rule base i s not straight- forward, since it must capture much of the knowl- edge of architectural designers. It needs to be flexible and expandable to cope with architectural changes. The use of rule-based synthesis techniques in the CATHEDRALcompilers will be illustrated in sections VI to VIII. Analytic, algorithm-based optimization techniques are employed, to optimize chip area and execution time. An important step in the definition of a syn- thesis methodology i s therefore the abstraction of a number of optimization criteria at the circuit-design level-in terms of chip area, execution time, and eventually, power dissipation-to the architectural level. The optimization techniques are often based on combinatorial optimization and graph theory. Heu- ristics may be incorporated to deal with the expo- nential algorithmic complexity of most of these prob- lems. Examples of analytic optimization-tools in CATHEDRAL are presented in sections VI to VIII. In addition, the CATHEDRAL systems rely on userinter- action to evaluate certain design-decisions and guide the synthesis process. In this way, delicate trade-offs between area and throughput can be made in very short time. The importance of user interaction in a compiler typically increases with the complexity and flexibility of the underlying architectural model. The methods for user interaction in CATHEDRAL are described in subsection V-B.

result of the task decomDosition, there is no formal ~~

guarantee about the optimalit; of the final design, even if the separate tools would guarantee optimality for their spe- cific task (which they generally do not). However, owing to a careful sequencing of the different tasks, possibly includ- ing local iteration-loops, high quality designs have already been obtained for a reasonable class of practical applica- tions, in very short design-time.

B. User-Interaction Mechanisms

During the architectural exploration stage, user inter- action is allowed with the CATHEDRAL high-level synthesis tools. This is done by annotating the behavioral specifi-

PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990

cation language SILAGE [28] with structural hints called "pragma" statements. Due to the very high cost of system verification by simulation, even at this high level of descrip- tion, the algorithmic verification step should be repeated as few times as possible. Pragmas should therefore be entered and interpreted in such a way that the principle of "correctness by construction" remains valid. This cannot always be accomplished in full, but still, much design time can be saved in this way. The types of pragmas that are cur- rently allowed include hints for the allocation of the type and number of building blocks (including their parame- ters), for the assignment (binding) of operators to opera- tions and of signals to memory, for partitioning, and for the ordering of operations [MI. In addition, timing specifica- tions between signals and also the global sample rate can be entered this way. Examples will be given in section VII. Note that pragma statements are optional; they only serve to overrule certain default decisions made by the compiler.

Although the pragma construct provides an elegant and efficient means of user interaction, certain specialized syn- t hesis-tas ks require more powerf u I i nteraction-mecha- nisms, especially at the highest design-levels. An illustrative example is the optimized in-place mapping of signal arrays on RAM-memory [53]. It would be extremely difficult and cumbersome to influence this decision using only simple pragmas atthe SiLAcElevel. For this purpose, in a future ver- sion of our compilers we will allow a limited form of direct user-interaction with a new internal intermediate descrip- tion-language used during synthesis. On the one hand this will allow to prefixwhatever is desired before a specific su b- task i s activated, but it will also allow us to modify the result of any subtask afterwards. In order to keep the "correct by construction" paradigm, the designer can only modify spe- cific architectural attributes in this internal language, which affect the efficiency of the realization but not the algo- rithmic behaviour. Still this description will be simulatable at any intermediate level during the synthesis traject.

Finally, a number of hardware optimizations are directly steered by user-specified constraints or tolerances on throughput or area. Examples are the optimization or "merging" of the bus-interconnection network, the opti- mization of the inter-module communication hardware, and the reduction of foreground memory (registers and reg- ister-files) in CATHEDRAL-II (see section VII). In addition, it should be possible to weigh the relative importance of throughput and area. These tasks cannot simply be con- trolled by pragmas in the original description, as the tol- erances depend on the final mapping results.

C. Application-Driven Synthesis

Developing a suitable design methodology should be approached from the point of view of system and archi- tectural designers. The experience with several realistic demonstrator applications has been crucial to the devel- opment of efficient and useful synthesis methodologies. These demonstrators have been selected from many dif- ferent application domains and have allowed us to identify and clarify the distinction between the CATHEDRALS in terms of both architectural and synthesis issues. Most of these demonstrators will not be explained here due to lack of space. Moreover, many of them have already been pub- lished elsewhere (see e.g., [3], [41, [A, 1301, [351, [43], [44], [54],

[55]). They substantiate our efficiency claims presented above, in the sense that these-often manual-exercises have resulted in architecture realizations which would be acceptable in a practical, i.e., industrial context.

In the sequel, the architectural synthesis methodology of the CATHERAL-I,-II, and-Ill compilers will beoutlined. Each of these systems will be illustrated with a test-vehicle which is typical for its application domain.

VI. BIT-SERIAL DATA-PATH SYNTHESIS IN CATHEDRAL-I

CATHEDRAL-I [30] i s an automatic synthesis system for lin- ear DSP algorithms such as digital filters. It provides an automatic mapping into a hard-wired bit-serial implemen- tation.

A. The CATHEDRAL-I Tool-Box

The flow chart of the CATHEDRAL-I tool-box is shown in Fig. 3. The signal-flow graph i s converted into a bit-serial archi- tecture by the rule-based synthesis program AMAI [30]. Typ-

(Optlrnlzed, b l t - t rue block-diagram)

ARCHITECTURAL MAPPING

+ (Expanded block-dlagrarn)

CONTROLLER GENERATION 'COMPASS' ' A M A I "

- 'MEET-IN-THE-MIDDLE' LINE--(Bit-Serlal arch1 tecture> I

O W > Z 6 w

(Chip) J(3

Fig. 3. Outline of the CATHEDRAL-I tool-box for architectural synthesis.

ical tasks include the expansion of coefficient multiplica- tions into shifts and additions and theoptimization of signal inversions.

Next the delay management task is solved analytically by the program COMPASS [23]. I t optimally redistributes frac- tions of the sample delay operations, to compensate for the inherent delay of the bit-serial operators in the architec- ture.Theobjectiveistominimizethetota1 shift-register area, while maintaining correct input/output behavior. Delay management is based on a retiming transformation [34] by means of a linear programming routine. The resulting delay- redistribution i s used by AMAI, to generate the controller.

The bit-serial architecture produced by AMAI and COM- PASS is described in terms of a netlist of bit-serial standard- cells. The designer chooses between a flexible and auto- mated standard cell place and route package or agate-array design system to interactivily route the chip. The latter method generates more compact layouts by using abut- ment ("LEG0"-style [54], [55]).

B. Design of a 5th Order Wave-Digital Filter

The digital filter benchmark proposed at the 3rd ACM/ IEEE High-Level Synthesis Workshop has been adopted as

DE MAN et al.: VLSl IMPLEMENTATION OF DSP ALGORITHMS 325

a test-vehicle [IO]. The wave digital lattice filter synthesis tool FALCON [22] has been used to generate a suitable filter structure. The result i s a canonical 5th order lattice filter with circulator sections (Fig. 4). It meets the frequency specs as verified with the DSPDICEST analysis program [9], and

Fig. 4. Signal-flow graph of 5th order wave-digital lattice filter, synthesized with FALCON.

exhibits many advantages compared to alternative filter structures: the arithmetic complexity i s low, the scaling properties are favorable [18], the register count after delay management is small. Moreover, all the other desirable properties of wave digital filters, such a5 reduced coeffi- cient sensitivity and the provable absence of limit-cycles under certain conditions [18], continue to hold.

In a first design step, the nominal real-valued filter coef- ficients have to be represented with a limited number of bits. A shift-add based multiplication scheme is used, for which the canonical-signed-digit code leads to a very com- pact result. As every nonzero bit directlycontributes to the implementation cost, the total number of coefficient bits i s minimized while satisfying the frequency domain spec- ifications [5]. In this design, the number of bits is reduced to only 9 for 5 coefficients.

Next the signal word-lengths are determined. Analytic techniques available in the DSPDICEST tool-box allow us to compute “templates” [9] which collect most of the infor- mation needed to optimize the word-length of all signal nodes. For the example, 20-bit signals prove to be sufficient to obtain a signal-to-noise ratio of 96 dB.

Finally, the standard-cell floorplan shown in Fig. 5 has been generated automatically. It should be noted that flat- tening the design to the level of leaf-cells during the floor- planning stage allows for additional optimization but makes it almost impossible to identify the hierarchy present in the original signal flow graph (Fig. 4). Since the bit-serial archi-

CFlLPLOT 12 022

Fig. 5. Standard-cell floorplan of bit-serial 5th order wave-digital lattice filter, generated with CATHEDRAL-I.

PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990

tectures are hard-wired and as only I-bit operations are implemented, the resulting chip area i s very small. Actually, the area obtained for this implementation i s the smallest of the 3 target architectures envisioned in this paper [IO]. A sampling frequency up to 1MHz can be reached, pro- vided that the frequency characteristic of the filter i s scaled with respect to the original specification.

VII. MICROCODED PROCESSOR SYNTHESIS IN CATHEDRAL-II

CATHEDRAL-II is oriented towards microcoded multi-pro- cessor architectures as defined in section IV.

A. The CATHEDRAL TOO/-BOX

The first task in the architectural synthesis process i s the partitioning of the SILAGE description into different pro- cessors. This should be done in such a way that the indi- vidual processors can be synthesized independently, whenever possible. Currently, partitioning is done man- ually, based on criteria such as data dependencies in the algorithm, balancing of the arithmetic and memory load of partitions, and the type of control flow in different parts of the description. After the detailed synthesis of the different processors (see below), the inter-processor communica- tion network i s generated automatically [12]. A selection between a RAM- or FIFO-like communication protocol i s made, and the size of these modules is optimized.

Within every processor the architectural primitives or EXUs have to be determined and bound to the high-level SILAGE operations. This process amounts to the following tasks:

€XU-type selection: the task of determining which types of EXUs, available in the module library, are best suited to implement a given operation: €XU allocation: the determination of the number of instances of each EXU type, that is required in the pro- cessor data-path; - €XUassignment:thedetailing binding of EXU instances to operations. Scheduling: ordering of the register-transfers in time taking into account resource constraints. Bus network minimization: merging of buses while trading off machine cycles for chip area, due to bus contention. Memory minimization: sharing the storage locations by different signals based on life-time analysis. This allows to reduce the size of register-files and RAMs.

The synthesis process for a single processor is illustrated in Fig. 6. The processor architecture i s synthesized itera- tively, with the help of user interaction. In a single iteration, starting from the behavioral description, a first version of a data path is generated by the rule-basedcompiler JACK-THE-

MAPPER [56]. Based on expert knowledge about the available libraries, the high-level SILAGE-operations and the architec- tural primitives (execution units) are matched. This includes the generation of addressing schemes for RAMs and the optimization of delay lines. The tool decides on EXU-type selection and EXU allocation, according to a default strat- egy, which corresponds to the smallest but slowest imple- mentation-alternative. These default decisions can be over- ruled in additional design iterations, as will be indicated below. JACK-THE-MAPPERdSO generates a contension-free bus

(SILAGE descr ipt ion o f DSP-system)

SIMULATION PART I T I ON1 NG (HANUALLY)

A \ (SILAGE descr ip t ion o f a processor) m

type selection, i [DATA PATH COMPICATION~

nt lmed RT-descrlptlon)

I

Fig. 6. Outline of the CATHEDRAL-II tool-box architectural synthesis.

network to connect the different EXU's. During the map- ping, new local signals or operations are often created. For example, multiplications or divisions on an ALU are expanded into a sequence of shifts and additions; when iteration constructs occurring in the SlLAGEcode are imple- mented as procedural loops, loop counters are generated and maintained in the data path.

The output of JACK-THE-MAPPER is a netlist of execution units which i s further refined in the subsequent stages, and aset of instructionsand parametersfor each execution unit. In accordance with the resulting structure, the SILAGE description i s transformed into an untim'ed register- transfer description. This description corresponds to an expanded data or signal-flow graph representation en- grafted on the generated data-path.

The design iteration i s then completed by the analytic optimization-tools ATOMICS and MICROMECCANO. ATOMICS 1241, [25] is a microcode compiler, which transforms the reg- ister-transfer description into the microcode tables for the multiple branch controller. AToMlCsfirst schedules the reg- ister-transfer code on the time axis, taking into account the data path generated by JACK-THE-MAPPER. An extended list- scheduling algorithm [Il l i s used, which in comparison to other techniques (e.g., [39]) exhibits a lower algorithmic complexity and i s better suited for interactive use sincecon- straints on hardware allocation can easily be taken into account. When sufficient hardware i s allocated, ATOMICS i s able to produce highly concurrent schedules. To reach this goal, both the algorithmic parallelism in the flow graph i s exploited and iterative sequences of operations are pipe- lined. The latter is achieved by optimizing the loop control- flow in the register-transfer code. During scheduling the tool assigns all transfers to particular execution unit instances, and finally it decides on register utilization,

DE MAN et al.: VLSl IMPLEMENTATION OF DSP ALGORITHMS 327

before the microcode i s generated. The interconnection network i s then optimized by the MICROMECCANO [56]. By merging buses, the tool reduces the interconnect area at the expense of machine cycles.

Fundamental design-figures, such as the resulting throughput and the estimated chip area, can now be eval- uated by the designer. This i s based on the data path and machine-cycle count determined by the compiler, and on accurate area and timing views of the execution unit instances provided by the module-generators.

The design can be improved in additional iterations, by specifying compiler directives or pragmas (see examples in subsection VII-6). With these pragmas, a designer can influ- ence two basic parameters in the hardware-binding pro- cess: the EXU-type selection and the EXU allocation. Expe- rience has shown that a designer often has a good insight in the complexity and computational bottle-necks of a DSP algorithm, and i s therefore well capable to judge about these parameters. The more complex and error-prone design-tasks like scheduling and EXU assignment on the other hand, are completely automated. Still, the expert designer i s given the possibility to influence these results with appropriate pragmas.

B. Design o f a Pitch-Extraction Processor for Speech Recognition

TheCATHEDRAL-11 synthesistools have been integrated into a complete design-system called PIRAMID, including a mod- ule library and a floorplanning environment [29]. To illus- trate the possibilities of the PIRAMID compiler, we have cho- sen a complex speech-processing application, more in particular a system for pitch extraction [47. This application is a typical representative of the class of decisionmaking algorithms. It combines a complex FOR-loop structure with a large amount of data-dependent conditional operations. The pitch of the incoming speech signal i s extracted from i ts frequency spectrum using pattern-matching techniques.

7) The pitch-extraction algorithm: The algorithm i s block oriented, operating on speech segments of 40 msec. Every 10 msec, a pitch signal i s extracted. Hence the blocks have an overlap of 30 msec. The sampling frequency i s 6.4 kHz and every segment contains 256 speech samples. Per seg- ment, 64 real and 64 imaginary components of the fre- quency spectrum are calculated, which are the inputs for the pitch-extraction algorithm.

A block diagram of the pitch-extraction algorithm i s shown in Fig. 7. In a first step, 64 signal amplitudes are

f real amp1

derived and a threshold value i s defined. In a secondstep, based on parabolic interpolation, the frequency points are determined that correspond to the 8 largest maxima of the amplitude spectrum, which are above the threshold. In a third step, these 8 frequency points are used in a pattern- matching operation called ”harmonic sieve detection.” The aim i s to determine the best match between the maxima of the frequency-domain spectrum and one out of 40 pre- defined spectral patterns. The characteristics of the best matching pattern are finally used in a fourth step, to com- pute the pitch of the speech signal. To illustrate the com- plexity of the algorithm, the first and third step will be dis- cussed in more detail.

The amplitude o f a signal is calculated from i t s real and imaginary components. Nonlinear functions like ampli- tude can be implemented via table lookup, iterative algo- rithms, etc. We implemented a piecewise linear approxi- mation, described in [47]:

a1 : = max(abs(real), abs(imag)) a2 := min(abs(real), abs(imag)) if 4a2 < a1 then c l := 0.996 and c2 := 0.123 else if 2a2 2 = a1 then c l : = 0.817; c2 : = 0589 else c l := 0.941 and c2 := 0.348 ampl(rea1, image) := c l . a1 + c2 * a2

This i s a typical “multi-way” decision function, since the appropriate values for the coefficients c l and c2 have to be selected from three different possibilities. It can be realized very efficiently using the multiple branch controller in the CATHEDRAL-II architecture. In addition, one threshold value must be calculated, defined as the maximum of all ampli- tude values in a data block, divided by 16. As an example, the SILAGE code for the amplitude function is given in Table I .

In the third step, and ”optimal harmonic sieve” must be selected out of 40 predefined candidates. Every harmonic sieve is characterized by its fundamental frequency. A har- monic sieve represents a pattern in the frequency domain consisting of meshes j *y , at multiples of the fundamental frequency (for j from 1 to 8). The optimal sieve i s the one whose meshes best match the 8 spectral maxima of the incoming speech-signal. It corresponds to the sieve for which C, = K,/(N, + L,) i s maximum, where

N, is the number of spectral maxima (obtained in step 2 of the algorithm) which do not exceed the upper bound of mesh number 8 in harmonic sieve i;

f Pitch

f Fund. frequency of

t Extreme

f Amp1 itude

t Real & imag.

Four ier spectrum frequencies opt imal sieve & coeff ic ients & threshold extreme frequencies

Fig. 7. Block diagram of pitch-extraction system for speech analysis. Between the blocks the communicating signals are indicated. In case of array signals, the dimension is shown as well.

328

- -

PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990

Table 1 a Pitch-Extraction System (see first block of Fig. 7)

#define W16 num<16,12> /* define signal type *I func main(freal,fimag:W16[64]) ampl:num[]; thresho1d:num = /*freal and firnag are arrays of input signals, amp1 is output array. All three are of length 64 and have type W16 */ begin

SILAGE Description of the Amplitude Function in

thresh[O] = W16(0); (kl : 0..63) :: /* iterate to define 64 amplitude signals */ begin

/*define coefficients for piecewise linear approximation. The ‘if’-statement corresponds to a multiplexing function. ‘abs‘ returns absolute value (see below) */

al[kl] = if (abs(freal[kl]) > = abs(fimag[kl])) + abs(freal[kl])

a2[kl] = if (abs(freal[kl]) > = abs(fimag[kl]))+ abs(fimag[kl])

cl[kl] = if (W16(a2[k1]<<2) < al[kl]) + W16(0.996)

c2[kl] = if (W16(a2[k1]<<2) < al[kl]) -, W16(0.123)

11 abs(fimag[kl]) fi;

11 abs(freal[kl]) fi;

11 (W16(a2[kll<<l) > = al[kl]) + w16 (0.817) 11 W16(0.941) fi;

(1 (W16(a2[kl]<<l) > = al[kl] + W16(0.589) 11 W16(0.348) fi;

I* define amplitude by means of piecewise linear approxi- mation. The 32-bit result of every multiplication is coerced into 16 bits */

ampl[kll = W16(al[kll*cl[kll) + W16(c2[kl]*a~[kl]); threshrkl + 11 = if (thresh[kll<ampl[kl]) + ampl[kl]

end; I* define threshold as maximum amplitude divided by 16 */ threshold = thresh [64]>>4;

1) threshtkl] fi;

end;

I* ‘abs’ i s a user-defined function, valid for any signal type */ func abs(a:num) : num = begin

return = if (a<O) + -a 11 a fi;

end;

L, is the number of meshes in sieve i that have a lower bound below the Nth spectral maximum. K, is the number of spectral maxima that pass through one of the lowest L, meshes of sieve i.

2) Architecture design and layout: Two alternative imple- mentations of this algorithm have been generated. The results are summarized in Table 2. In the first alternative, JACK-THE-MAPPER by default allocates one ALU,one ACU, and one small ROM to contain constants. Multiplications are automatically expanded into shifts and additions on the ALU. The ACU is used to keep track of the FOR-loop counters, and to compute RAM addresses. The following €XU allocation pragma has been added in the SILAGE description to allocate 2 RAM memories:

%alloc(ram, 2);

The S i~c~desc r ip t i on is nowtranslated into RT-code, which is scheduled by ATOMICS in 10034 machine cycles. Initially the data path contained seven buses, but this figure has been reduced to two by the MICROMECCANO, at the expense of a 3% growth of the cycle count. We rejected an alter- native single-bus solution, because it required 10% more cycles. In order to estimate the data rate, the module gen- erators are called to compute a timing view of every EXU instance. This i s based on accurate timing-formulas, taking into account the internal delay-paths for every set of param- eters. In this design theclock period (machinecycle), which is determined by the slowest EXU, equals 113 nsec. The results are shown in Table 2, together with some basic parameters of the EXUs and the controller, which are deter- mined by the synthesis tools. An estimate of the total chip area is given as well.

In a second alternative, a parallel multiplier has been

Table 2 Comparison Between Two Implementations of the Pitch-Extraction Algorithm (All figures hold for a 1.6 pm 2-metal CMOS process)

Alternative 1 Alternative 2

267 X 16 Ram 1 size (words x bits) 268 X 16 register file 1 2 3

timing view (ns) 79 79 register file 2 5 6

Ram 2 size (words x bits) 32 x 8 32 X 8 register file 1 1 1 register file 2 3 4 timing view (ns) 77 77

Alu register file 1 a 9 register file 2 6 6 timing view (ns) 113 71

Acu register file 1 7 10 register file 2 2 3 timing view (ns) a4 86

Constant ROM number of words 17 19 timing view (ns) 35 35

Multiplier size (bits x bits) - 16 X 16 register file 1 - register file 2 -

4 3

timing view (ns) - 130 Controller number of words

width of the instr.

timing view (ns) reg.

2aa 92 55

223 1 oa 48

Execution Cycles 10 034 7 794

Chip Area 46.1 mm2 52.2 mrn2

DE MAN et al.: VLSl IMPLEMENTATION OF DSP ALGORITHMS 329

___~

added in an attempt to speed up execution. For this pur- pose the following €XU-type selection pragma is added:

%assign ((-*-), mult, -);

The underscore character "-" represents a wild card, so that all multiplications in the SILAGE code are mapped onto a multiplier. The default allocation for this multiplier is one instance. This solution is scheduled in only 7794 machine cycles. The clock period however increases to 130 nsec, due to the large delay-path in the multiplier (see Table 2). More- over, this second alternative has a 10% larger chip area. Therefore, the first alternative i s preferred as the final solu- tion. Its architecture i s shown in Fig. 8. The data-path i s 16 bits wide, except fore the ACU which only requires 9 bits.

Fig. 8. Data path of pitch-extraction system as generated by CATHEDRAL-I I.

Finally, the PIRAMID module generators produce the detailed layouts of the modules, and the final chip is com- posed with afloorplanner. The result i s shown in Fig. 9. Note that the module generator for the ALU automatically strips the saturation unit, which i s not required by the algorithm, and selects the appropriate shifter-options. Due to the large amount of decision-making operations, the controller mod- ules consume a significant portion of the total chip-area. The total area, including bonding pads equals46.1 mm2(6.5 mm x 7.1 mm) in a 1.6-pm 2-metal CMOS process.

VIII. COOPERATING DATA-PATH SYNTHESIS IN CATHEDRAL-Ill

CATHEDRAL-Ill i s oriented towards cooperating data-path architectures (see section IV). In these architectures, hard- wired bit-sliced data-paths are communicating over buses with distributed memories, steered by a hierarchical con- troller. These data-paths can be viewed as application-spe- cific units, which can be shared for several clusters of oper- ations [35].

A. The CATHEDRAL-/// TOO/-f?OX

The synthesis process in CATHEDRAL-Ill consists of two major tasks: a high-level synthesis task, and a low-level assembly task. In the first task the behavioral description i s optimized and transformed into a control structure and a cooperating data-path architecture. Although this task i s currently performed manually, CAD tools are under devel- opment. It will not be further discussed in this paper.

The goal of the assembly task i s to optimize the orga- nization of the data-path and to assemble the final layout, taking area and performance requirements into account. The input description for this task is a signal flow graph, which corresponds to a data-path specification. A tool called CHOPIN [35] has been developed to support the assembly process.Theassemblytaskconsistsof two main parts:afirst part where the system takes a number of decisions which

affect the area and throughput of the design, and a second part which involves user interaction in order to tune the performance. An overview of this tool-box i s presented in Fig. 10. In the design stage the following subtasks can be identified:

1.

2.

3.

4.

5.

6.

In a first step, the hardware assignment task, func- tional building blocks are selected from the library and assigned to every operator in the signal flow graph. The next task is the linear placement ofthe functional building blocks. The algorithm in CHOPIN i s based on a simulated annealing routine with adaptive param- eters[5], and isable tooptimizeacost function which takes the total interconnection length into account [45]. However, a new penalty function is introduced which takes into account that the number of con- nections over a functional block i s limited by the available feed-throughs for that functional block. The placement algorithm guarantees 100% routability. The third step i s a trackassignmentstep. Each net s t i l l has to be assigned to one of the possible physical feed- through tracks, or to an equivalent input or output terminal. An optimal track assigment will yield the smallest possible routing area between all the func- tional blocks. In order to evaluate the performance of the design, a capacitanceestimation i s needed, which takes input capacitances and over-the-cell routing capacitances into account. These capacitances are used in a timing analysis program, which i s based on a critical path algorithm. If the delay of the critical path is smaller than the clock period Tclock, a valid implementation of the data path is obtained. Otherwise, a number of performance optimization steps [I61 have to be per- formed, as described in the next point. At the architectural level, we define performance optimizations at 3 sublevels (see feedback arcs in Fig. IO). The first transformation is possible at the register- transfer level. Given a maximum clock period, this step involves pipelining the circuit by replacing and inserting registers at the appropriate places so that the throughput increases without affecting the func- tionality. For this purpose the technique of retiming [34] is exploited. Pipelining i s supported both at the word level and the bit level in a fully automated way. Bit-level pipelining can result in bit-sliced systolic solution in the extreme case. In contrast to the non- recursive case, the introduction of additional regis- ters into loops or recursive flow graphs requires a lot of overhead. In this case, the throughput can how- ever be increased by selecting and assigning faster functional building blocks. This i s the optimization taskat the functionallevel. Another important feature at this level i s the automatic conversion of a data path from the carry-propagate approach to the carry-save approach [51], [52]. At the circuit level, the driving capability of a number of leaf-cells can be optimized depending on their load capacities. If the maximal load of a cell is exceeded, it i s replaced by a func- tionally equivalent cell with a higher drivingcapacity. The last part of the assembly task i s the conversion of the internal data-path representation into a netlist

330 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990

Fig. 9. Layout of pitch-extraction chip generated with CATHEORAL-II/PIRAMID.

which can be sent to the module generation envi- ronment [46].

Video sources normally generate three color signals; a red, a green, and a blue component (R, C, and B). Subse- quent processing typically requires these signals to be con- verted into a luminance signal ( Y ) and two chrominance signals (U and V). Since the red, green, and blue color sig- nals are oversampled to 27 MHz and then converted by the RCB matrix at this rate, the resulting Y, Uand V-signals need to bedecimated again. Dependingon theoutput frequency

B. Design of a Bit-Parallel Video Signal Convertor

To demonstrate the proposed methodology, the design of adigital video matrix system with decimation filters (DMF) is presented (Fig. 11). The algorithmic specifications and a first implementation can be found in [49].

331 DE MAN et al.: VLSI IMPLEMENTATION OF DSP ALGORITHMS

~

1 _ _ _ _ _ ~

(SFG descr ipt ion) Leaf-cel l c

l i b r a r y HARDWARE ASS I GNMENT area, t i m i n g I

1 e"o 1 mance OK

'GARDS', 'MGE'

Fig. 10. Assembly task overview as implemented in CATHEDRAL-II I.

27 Mhz 13.5 Mhz 6.75 Mhz 3.375 Mhz

Fig. 11. Digital video matrix with decimation filters.

of thechrominance signals, the resulting signalsare filtered by a number of decimation stages. Each of these stages fil- ters the signal to half of i ts frequency bandwidth and deci- mates it by a factor of two.

The three color signals are oversampled to 27 MHz and converted to 8 bits. The actual signal transformation i s then defined by the following formulas where a word-length of 16 bits i s assumed in the matrix:

Y = (77 . R + 150 . G + 29 * B)/256 (1)

U = (-44 . R - 87 . G + 131 * B)/256 (2)

V = (131 . R - 110 . G - 21 * B)/256 (3)

The decimation filter for the luminance signal and the decimation filters for the two chrominance signals are 10th and 6th order FIR half-band filters respectively, character- ized by the following transfer functions:

5

H(z) = c a , ~ - , + ~ with a, = a - , (for the Y-signal) n = - 5

3

H(z) = n = - 3 c a , , ~ - , + ~ with a, = a - ,

(4)

(for the U and V-signal) (5)

The implementation of the digital matrix and the deci- mation filters elegantly demonstrates some of the features of the synthesis system. In the sequel, only the implemen- tation of the Y path i s considered. The Uand V-path and the filters can be treated in a similar way.

The coefficient multiplications (with 77/256,150/256, and 29/256) in the RGB-matrix (see eq. (I)), are first expanded into shift and add operations, and a signal flow graph descrip- tion is generated. Starting from this intermediate level description, an interactive session contains the following steps:

By default, the add and shift operations are mapped onto a ripple adder and a hard-wired shifter, respec- tively. - In a second implementation, all the ripple adders are converted to carry save adders to speed up the design. At the end, a 16-bit merging adder i s needed, though, to convert the carry and sum into one final signal needed for further processing. The data-path i s pipelined at the word level, and the merging adder i s pipelined at the bit level. As a result of this pipelining strategy, the critical path i s reduced from a 16-bit to only a 4-bit addition.

Fig. 12 shows the layout of the complete video convertor chip [36].

IX. DISCUSSION

In this paperthefeasibilityof interactive synthesis of DSP algorithms into efficient silicon layout has been shown. We have stated that efficient silicon implementation requires

Fig. 12. Layout of the RGB convertor for video, generated with CATHEDRAL-Ill.

332 PROCEEDINGS OF THE IEEE, VOL 78, N O 2, FEBRUARY 1990

~

a clear definition of target architectural styles, tuned to a number of application classes. Fixed digital filters in the medium-rate audio and telecom spectrum on the one hand and decision-rich speech processing algorithms or adap- tive algorithms on the other hand require very different architectures. The former type of applications are well ,

served by bit-serial techniques while the latter are suited for implementation in microcoded processors. Video and image processing applications at higher rates require cus- tomized data paths with hard-wired controllers.

It has been shown that different architectures require dif- ferent optimization tools. This has culminated in three C A T H E D R A L ~ ~ I ~ ~ ~ ~ compilers, which all start from the bit-true applicative language SILAGE.

DSP designers spend a large part of the design time in optimizing the behavior of an algorithm on the bit level as this strongly affects performance and silicon area. There- fore we have indicated the existence of tools to optimize word-lengths of signals and coefficients in linear flow algo- rithms.

The CATHEDRAL-I system demonstrates the design of mul- tirate bit-serial wave digital filters from filter specs to sil i- con, based on a library of 20 low-complexity standard-cells. Optimization at all levels of the design leads to digital fil- ters, which are as compact as analog switched capacitor fil- ters.

The CATHEDRAL-II system is intended for complex audio algorithms and supports interactive synthesis of micro- coded multiprocessor systems. It relies on a mixture of knowledge-based techniques for mapping the SILAGE code into the execution units in the silicon library and of algo- rithmic optimization techniques for microcode scheduling and register and interconnect optimization. User interac- tivity i s achieved by the concept of pragmas. Testability is achieved by tuning the test strategy to the architecture [a]. This compiler has been used for many applications all the way from specs to silicon. Experience with industrial appli- cations indicates that further work i s necessary in the area of optimization of memory management and user inter- activity.

CATHEDRAL-Ill illustrates these points for very high speed applications. In this case, key features include the defini- tion of customized data-paths, matching the blocks in the aigorithm to these application-specific units, and perfor- mance-driven optimization of the detailed data-path struc- ture. Typically, the designer also wants a very close inter- action with the silicon level to meet the stringent timing requirements.

DSP is a very important application area for silicon syn- thesis techniques in view of short and fashionable design cycles. This i s especially true in the area of consumer elec- tronics, automotive systems, and telecommunications. Sil- icon-area efficiency, algorithmic complexity, and low power requirements-as well as the ability to cointegrate analog interfaces on the same chip-cannot be economically sat- isfied by the commodity processors. Their general-purpose nature (e.g., floating-point support, high accuracy m u It i- plierlaccumulator) are representing an overkill for many DSP requirements.

ACKNOWLEDGMENT

We thank all our colleagues in the ESPRIT 97 project, at IMEC, Philips, Ruhr University of Bochum, Siemens, and

BellTelephoneManufacturing Co,forthe many fruitful dis- cussions and the suggestions.

REFERENCES

J. Allen, “Computer Architecture for Digital Signal Process- ing,“ Proc. IEEE, vol. 73, no. 5, pp. 854-873, May 1985. R. K. Brayton, R. Camposano, C. De Micheli, R. H. J. H. Otten, and J. van Eijndhoven, “The Yorktown Silicon Compiler,” in D. Cajski, ed., “Silicon Compilation,”pp. 204-310, 1988. F. Catthoor, J. Rabaey, C. Coossens, J. Van Meerbergen, R. lain, H. De Man, and J. Vandewalle, “Architectural Strategies for an Application-Specific Synchronous Multiprocessor Environment,” /E€€ Trans. Acoust., Speech, Signal Process- ing, vol. 36, no. 2, pp. 265-284, Feb. 1988. F. Catthoor and H. DeMan,”CustomizedArchitectural Meth- odologies for High-speed Image and Video Processing,” in Proc. /EEL lnt. Conf. Acoust., Speech, Signal Process- ing. New York, NY: IEEE, 1988, pp. 1985-1988. F. Catthoor, J. Vandewalle, and H. De Man, “Simulated- Annealing based Optimisation of Coefficient and Data Word- lengths in Digital Filters,” lnt. ]. Circuit Theory and Applica- tions, vol. 16, pp. 371-390, Sep. 1988. F. Catthoor, J. Rabaey, and H. De Man, “Target Architectures in the CATHEDRAL Synthesis Systems: Objectives and Design Experience,” in Proc. / € € E lnt. Symp. on Circ. and Systems (Portland, OR, May 1989). F. Catthoor, D. Lanneer, and H. De. Man, “Application-Spe- cific Microcoded Architectures for Efficient Fixed-Rate DFT and FFT,” Proc. /€E€ lnt. Symp. on Circ. and Systems (Port- land, OR, May 1989). F. Catthoor, J. Van Sas, L. Inze, and H. De Man, “Testability Strategy for Multi-Processor Architecture in a Silicon Com- pilation Environment,” /FEE Design and Test o f Computers, Apr. 1989. L. Claesen, F. Catthoor, H. De Man, J. Vandewalle, S. Note, and K. Mertens, “A CAD Environment for the thorough Anal- ysis, Simulation and Characterization of VLSl implementable DSP Systems,” in Proc. /E€€ lnt. Conf. on Computer Design (Port Chester, NY, Oct. 1986), pp. 72-75. L. Claesen, F. Catthoor, D. Lanneer, C. Goossens, S . Note, J. Van Meerbergen, and H. De Man, ”Automatic Synthesis of Signal Processing Benchmark using the CATHEDRAL Silicon Compilers,” in Proc. /E€€ Custom lntegrated Circuits Conf. (Rochester, NY, May 1988), Paper 14.711-4. E. C. Coffman, Jr. Computer and lob Scheduling The- ory. New York, NY: Wiley, 1976. J. Decaluwe, J. Rabaey, J. Van Meerbergen, and H. De Man, “Inter-Processor Communication in Synchronous Multi-Pro- cessor Digital Signal Processing Chips,” to be published in IEEE Transactions on Acoustics, Speech and Signal Process- ing, 1989. H. De Man, ”Evolution of CAD-Tools towards Third Cener- ation Custom VLSI-Design,” in Digest Europ. Conf. on Solid- State Circuits (Toulouse, France, Sep. 1985) pp. 256-256c. H. De Man, J. Rabaey, P. Six, and L. Claesen, “CATHEDRAL-II: a Silicon Compiler for Digital Signal Processing,” I€€€ Design and Test Magazine, pp. 13-25, Dec. 1986. H. De Man, J. Rabaey, J. Huisken, and J. Van Meerbergen, “Silicon Compilation of DSP Systems with CATHEDRAL-11,’’ ESPRlT Technical Week, Brussels, Sep. 1987. C. De Micheli, “Performance-Oriented Synthesis in theYork- town Silicon Compiler,” in Proc. IEEE lnt. Conf. on Comp.- Aided Design, 1986, pp. 138-141. S. Eisenbach, Functional Programming: Languages, Tools and Architectures. Chichester, UK: Ellis Horwood Limited Publ., 1987. A. Fettweis, “Wave Digital Filters: Theory and Practice,” Proc. M E , vol. 74, no. 2, pp. 270-327, Feb. 1986. J. R. Fox and J. A. Fried, “Telecommunications Circuit Design using the SILC Silicon Compiler,” in Proc. IEEElnt. Conf. Com- puter Design (Port Chester, NY, Oct. 1985) pp. 213-219. D. Cajski, ed., Silicon Compilation. Reading, M A Addison- Wesley, 1988. T. Cautier and P. Le Cuernic, “SIGNAL, a Declarative Language for Synchronous Programming of Real-Time Systems,” Lec-

DE M A N et al.: VLSl IMPLEMENTATION OF DSP ALGORITHMS 333

ture Notes in Computer Science, vol. 274, (Functional Lan- guages and Computer Architectures), pp. 257-277,1987.

[22] L. Gazsi, “Explicit Formulas for Lattice Wave Digital Filters,” I€€€ Trans. Circuits Syst., vol. CAS-32, pp. 68-88, Jan. 1985.

[23] C. Goossens, R. lain, J.Vandewalle, and H. De Man,”An Opti- mal and Flexible Delay Management Technique for VLSI,” in C. I. Byrnes and A. Lindquist eds., Computational and Com- binatorial Methods in Systems Theory, Amsterdam, The Netherlands: Elsevier, 1986, pp. 409-418.

[24] G. Goossens, J . Rabaey, J. Vandewalle and H. De Man, “An Efficient Microcode-Compiler for Custom DSP-Processors,” in Proc. I€€€ Int. Conf. Comp.-Aided Design (Santa Clara, CA,

[25] C. Coossens, J. Vandewalle, and H. De Man, “Loop Opti- mization in Register-Transfer Scheduling for DSP-Systems,” in Proc. 26th ACM/I€€€ Design Automation Conf. (Las Vegas, NV, June 1989), pp. 826-831.

[26] B. S. Haroun and M. I. Elmasry, “Automatic Synthesis of a Multi-Bus Architecture for DSP,”in Proc. I€€€Int. Conf. Comp. Aided Design (Santa Clara, CA, Nov. 1988), pp. 44-47.

[271 R. I. Hartley and J. R. Jasica, “Behavioral to Structural Trans- lation in a Bit-Serial Silicon Compiler,” /€€€ Trans. Computer- Aided Design, vol. CAD-7, no. 8, pp. 877-886, Aug. 1988.

[28] P. N. Hilfinger, “A High-Level Language and Silicon Compiler for Digital Signal Processing,” in Proc. I€€€ Custom Inte- grated Circuits Conf. (Portland, OR, May 1985), pp. 213-216.

(291 J. Huisken, H. Janssens, P. Lippens, 0. McArdle, R. Segers, P. Zegers, A. Delaruelle, and J. v. Meerbergen, “Design of DSP Systems using the PIRAMID Libraryand Design Tools,” in froc. Int. Workshop on Logic and Arch. Synthesis for Silicon Com- pilers (Crenoble, France, May 1988).

[30] R. lain, F. Catthoor, J. Vanhoof, B. De Loore, L. Claesen, J. Van Ginderdeuren, H. De Man, and J. Vandewalle, ”Custom Inte- gration of PCM-FDM Transmultiplexer using a Computer- Aided Design Methodology,”/€€€ Trans. Circuits Syst., vol. CAS-33, pp. 183-195, Feb. 1986.

[31] R. Jamier, N. Bekkara, and A. Jerraya, “The Automatic Syn- thesis of Data Processing Systems,” in froc. I€€€ Int. Conf. on Computer Design (Port Chester, NY, Oct. 1986) pp. 64-67.

[32] H. Joepen and M. Glesner, ”Architecture Construction for a General Silicon Compiler System,” in Proc. I€€€ Int. Conf. on Computer Design (Port Chester, NY, 1985), pp. 312-316.

[33] H. T. Kung, “Why Systolic Architectures?,” I€€€ Computer Magazine, vol. 15(1), pp. 37-46, Jan. 1982.

[34] C. Leiserson and J. Saxe, “Optimizing Synchronous Circuitry by Retiming,” in R. Bryant, ed., Third Caltech Conference on VLSI. Boston, MA: Computer Science Press, 1983.

[35] S. Note, J. Van Meerbergen, F. Catthoor, and H. De Man, “Automated Synthesis of a High-speed CORDIC Algorithm with the CATHEDRAL-Ill Compilation System,” in Proc. I€€€ Int. Symp. on Circuits andsystems (Helsinki, Finland, June 1988),

[36] S. Note, J. Van Meerbergen, F. Catthoor, and H. De Man, “Hard-Wired Data-Path Synthesis for High-speed DSP Sys- tems with the CATHEDRAL-Ill Compilation Environment,” in Proc. Int. Workshop on Logic and Architecture Synthesis (Crenoble, France, May 1988).

[371 B. M. Pangrle and D. Gajski, ”State Synthesis and Connec- tivity Binding for Microarchitecture Compilation,’’ in Proc. I€€€ Int. Conf. Camp.-Aided Design (Santa Clara, CA, Nov.

[38] N. Park and A. C. Parker, “SEHWA, a Software Package for Syn- thesis of Pipelines from Behavioral Specifications,” /€€€Trans. on Comp. AidedDesign, vol. CAD-7, no. 3, pp. 356-370, Mar. 1988.

[39] P. G. Paulin and J. P. Knight, “Force-Directed Scheduling in Automatic Data Path Synthesis,” in Proc. 24th I€€€/ACM Design Automation Conf. (Miami, FL, July 1987), pp. 195-202.

[40] Z. Peng, ”Synthesis of VLSI Systems with the CAMAD Design Aid,” in Proc. 23rd ACM/I€€€ Design Automation Conf. (Las Vegas, NV, June 1986), pp. 278-284.

[41] S. Pope, “Automated Generation of Signal Processing Cir- cuits,” Doctoral Dissertation, U.C. Berkeley, Feb. 1985.

[42] J. Rabaey, S. Pope, and R. Broderson, “An Integrated Auto- matic Layout Generation System for DSP Circuits,” /€€€ Trans. on Computer-Aided Design, vol. CAD-4, no. 3, pp. 285-296, July 1985.

NOV. 1987), pp. 24-27.

pp. 581-584.

1986), pp. 210-213.

[43] J. Rabaey, J. Vanhoof, G. Goossens, F. Catthoor, and H. De Man, “CATHEDRAL II: Computer Aided Synthesis of Digital Sig- nal Processing Systems,” in Proc. /E€€ Custom Integrated Cir- cuits Conf. (Portland, OR, May 1987), pp. 157-160.

[44] J. Rabaey, H. De Man, J. Vanhoof, G. Goossens, and F. Cat- thoor, “CATHEDRAL-II: A Synthesis System for Multi-Processor DSP Systems,” in D. Gajski, ed., Silicon Compilation, pp. 311- 360, 1988.

[45] C. Sechen, A. Sangiovanni-Vincentelli, “The Timber Wolf Placement and Routing Package,” in Proc. I€€€ Custom Integr. Circuits Conf. (Rochester, NY, May 1984), pp. 522-527.

[46] P. Six, L. Claesen, J. Rabaey, and H. De Man, “An Intelligent Module Generator Environment,” in Proc. 23rd ACM/IE€€ Design Automation Conf. (Las Vegas, NV, June 1986), pp. 730- 735.

[47l R. J. Sluyter, H. J. Kotmans, and A. van Leeuwarden, ”A Novel Method for Pitch Extraction from Speech and a Hardware Model applicable to Vocoder Systems,” in froc. I€€€Int. Conf. on Acoustics, Speech and Signal Processing (Denver, CO, April 1980), pp. 45-48.

[48] J. Southhard, ”Mac Pitts: An Approach to Silicon Compila- tion,” /E€€ Computer Magazine, vol. 16, no. 12, 74-87, Dec. 1983.

[49] U. Totzek and F. Matthiesen, “Integrierte Video-Matrix mit Filtern zur Abtastratenkonversion,” ITG-Fachberichte, Proc. of ITG-Fachtagung, Berlin, Oct. 1988.

[50] C.-J. Tseng and D. Siewiorek, “Automated Synthesis of Data Paths in Digital Systems,” /E€€ Trans. on Computer-aided Design, vol. CAD-5, no. 3, pp. 379-395, July 1986.

[SI] W. Ulbrich, “MOS Digital Fiter Deisgn,” in B. Goodwin, ed., MOS IC Design for Telecommunications. Englewood Cliffs, NJ: Prentice-Hall, 1985.

[52] W. Ulbrich and T. G. Noll, “Design of Dedicated MOS Digital Filters for High-speed Applications,” in Proc. I€€€ Int. Symp. on Circ. and Systems (Kyoto, Japan, June 1985), pp. 255-258.

[53] I . Verbauwhede, F. Catthoor, J. Vandewalle, and H. De Man, “Background Memory Management for the Synthesis of Algebraic Algorithmson Multi-Processor DSP Chips,” in Proc. IFIP lnt. Conf. on VLSl (Munich, West Germany, Aug. 1989),

[54] J. Van Ginderdeuren, H. De Man, N. Concalves, and W. Van Noije, “Compact NMOS Building Blocks and a Methodology for Dedicated Digital Filter Applications,” I€€€ I . Solid-state Circ., vol. SC-18, pp. 306-316, June 1983.

[55] J . Van Ginderdeuren, H. De Man, F. Catthoor, and S. Beckers, “A Design Methodology for Compact Integration of Wave Digital Filters,” in Digest €urop. Solid-state Circuits Conf. (Edinburgh, UK, Sep. 1984), pp. 210-213.

(561 J. Vanhoof, J. Rabaey, and H. De Man, “A Knowledge Based CAD System for Synthesis of Multi-Processor Digital Signal Processing Chips,” in Proc. IFIP Int. Conf. on VLSI (Vancou- ver, Canada, Aug. 1987), pp. 41-46.

[571 J. Van Meerbergen, and H. De Man, “ATrue Silicon Compiler for the Design of Complex IC’s for Digital Signal Processing,” Phi l ip TechnicaIReview, vol. 44, no. 7, pp. 228-241, Feb. 1989.

pp. 209-218.

Hugo j. De Man (Fellow, IEEE) was born in Boom, Belgium on September 19,1940. He received the electrical engineering degree and Ph.D. degree in Applied Sciences from the Katholieke Universiteit Leuven, Hev- erlee, Belgium, in 1964 and 1968, respec- tively.

In 1968 he became a member of the staff of the Laboratory for Physics and Electron- ics of Semiconductors at the University of Leuven, working on device physics and

integrated circuit technology. From 1969 to 1971 he was at the Elec- tronic Research Laboratory, University of California, Berkeley, as an ESRO-NASA Postdoctoral Research Fellow, working on Com- puter-Aided Device and Circuit Design. In 1971 he returned to the University of Leuven as a Research Associate of the NFWO (Belgian National Science Foundation). In 1974 he became a Professor at the University of Leuven. During the winter quarter of 1974-1975 he was a Visiting Associate Professor at the University of California,

334 PROCEEDINGS OF THE IEEE, VOL. 78, NO. 2, FEBRUARY 1990

Berkeley. He was an Associate Editor for the IEEE JOURNALOFSOLID- STATE CIRCUITS from 1975-1980 and was European Associate Editor for the IEEE TRANSACTIONS ON CAD from 1982 to 1985. He received a Best Paper Award at the ISSCC of 1973 on Bipolar Device Sim- ulation and at the 1981 ESSCIRC conference for work on an inte- grated CAD system. His actual field of research i s the design of Integrated circuits and Computer-Aided Design. Since 1984 he is Vice-president of the VLSl systems design group in IMEC (Leuven, Belgium).

JanVanhoof was born in Balen, Belgium, on June 19, 1961. He received the degree in Electrical Engineering from the Katholieke Universiteit Leuven, Belgium, in 1984.

Currently he IS a research assistant at the Interuniversity Micro-Electronics Centre (IMEC), Leuven, Belgium. He i s working on silicon compilation, more specifically, on the synthesis of multiprocessor architec- tures for digital signal processing, and on the mapping of signal processing algo-

rithms onto these architectures.

Francky Catthoor was born in Temse, Bel- gium, on October 1959. He received the engineering degree and a Ph.D. in electri- cal engineering from the Katholieke Uni- versiteit Leuven (KUL), Belgium in 1982 and 1987, respectively.

From September1983to June1987 hewas a researcher in the group involved in VLSl design methodologies for Digital Signal Processing, with Prof. H. De Man and Prof. J. Vandewalle as thesis advisors. First, until

Pecember 1984, at ESAT, KUL; and from January 1985 at the Inter- university Micro-Electronics Center (IMEC), Heverlee, Belgium. In the summer of 1987 he spent a 2-month post-doctoral NFWO research fellowship at the Univ. o f Berkeley, CA. Currently, he i s heading the Applications and Architectural Strategies Group in the VSDM division at IMEC. His research activities mainly belong to the field of architecture design for application-specific ICs intended for DSP-algorithms, including design for testability. He i s also involved in the development of computer-aided design tools for the high-level (behavioral) synthesis and optimization of general DSP applications. In these fields, he has authored or coauthored about 30 papers, of which 2 received a Best Paper Award.

I n 1986 he received theYoung Scientist Award from the Marconi International Fellowship.

Cert Coossens was born in Genk, Belgium, on October I O , 1960. He received thedegree in electrical engineering and the Ph.D. degree in Applied Sciences from the Ka- tholieke Universiteit Leuven, Belgium, in 1984 and 1989, respectively.

Since 1984 he i s working as a research assistant in theVLSI Design Methodologies Division of the Interuniversity Micro- Electronics Centre (IMEC), Leuven, Bel- gium, where he i s currently heading the

Architectural Synthesi; Group. His interests and’ activities are mainly in the field of computer-aided synthesis of signal-process- ing architectures and in the application of analytical optimization techniques in VLSI-design. He received a best paper award at the 26th ACM/IEEE Design Automation Conference in 1989.

Jef L. Van Meerbergen was born in Aarschot, Belgium, in 1951. He received the engi- neering degree from the University of Leu- ven in 1975 and the Ph.D. degree from the same university in 1980, with a focus in bandgap narrowing in silicon solar cells.

In 1979 he joined the Philips Research Laboratories, Eindhoven, The Netherlands. He was involved in the design of MOS dig- ital circuits, general-purpose signal pro- cessors. and architectures for DSP. He i s

presently engaged in the design of silicon compilers for DSP appli- cations in general, and architectural level synthesis in particular.

Stefaan Note was born in Turnhout, Bel- gium on March 12, 1962. He received the Electrical Engineering Degree in 1985 at the Katholieke Universiteit Leuven, Belgium.

In 1985 he became a member of the CAD group at the ESAT Laboratory of the Ka- tholieke Universiteit Leuven, where he did work on the development of a bit-true sim- ulator for digital systems. For this work, he received an award of the National Associ- ation of Engineers in Belgium (KVIV). In 1986

he joined the IMEC-Laboratory, wvhere he i s d o h g research on the development of a silicon compiler for high-speed Digital Signal Processing (DSP) algorithms. Currently, he i s preparing his Ph.D in this domain of silicon compilation.

10s A. Huisken was born in Purmerend, The Netherlands, in 1958. He received the Masters degree of lngenieur in Electrical Engineering from Twente University of Technology, Enschede, The Netherlands.

In 1984 he joined the Philips Research Laboratories, where he is now engaged in design and design methodology of digital signal processors. His interests are archi- tectural and logic synthesis, and layout generation.

DE M A N et al.: VLSl IMPLEMENTATION OF DSP ALGORITHMS 335