Exploiting Varying Resource Requirements in Wavelet-based Applications in Dynamic Execution...

15
J Sign Process Syst (2009) 56:125–139 DOI 10.1007/s11265-008-0223-5 Exploiting Varying Resource Requirements in Wavelet-based Applications in Dynamic Execution Environments Bert Geelen · Vissarion Ferentinos · Francky Catthoor · Spyridon Toulatos · Gauthier Lafruit · Thanos Stouraitis · Rudy Lauwereins · Diederik Verkest Received: 18 March 2008 / Revised: 18 March 2008 / Accepted: 23 April 2008 / Published online: 22 May 2008 © 2008 Springer Science + Business Media, LLC. Manufactured in The United States Abstract In the context of future dynamic applications, systems will exhibit unpredictably varying platform re- source requirements. To deal with this, they will not only need to be programmable in terms of instruction set processors, but also at least partial reconfigurability will be required. In this context, it is important for applications to optimally exploit the memory hierarchy under varying memory availability. This article presents a mapping strategy for wavelet-based applications: de- pending on the encountered conditions, it switches to different memory optimized instantations or local- izations, permitting up to 51% energy gains in mem- ory accesses. Systematic and parameterized mapping B. Geelen (B ) · F. Catthoor · G. Lafruit · R. Lauwereins · D. Verkest IMEC vzw, Kapeldreef 75, 3001 Leuven, Belgium e-mail: [email protected] B. Geelen · F. Catthoor · R. Lauwereins · D.Verkest Department of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium V. Ferentinos · S. Toulatos · T. Stouraitis Department of Electrical and Computer Engineering, University of Patras, Patras, Greece R. Lauwereins Interdisciplinary Institute for BroadBand Technology (IBBT), Ghent, Belgium D. Verkest Department of Electrical Engineering, Vrije Universiteit Brussel, Brussels, Belgium guidelines indicate which localization should be se- lected when, for varying algorithmic wavelet parame- ters. The results have been formalized and generalized to be applicable to more general wavelet-based applications. Keywords Wavelets · Memory optimization · Dynamism · Loop transformations 1 Introduction Portable multimedia applications impose stringent and diverse requirements on the platforms they run on, as they should be energy-efficient for extended battery life, while providing high performance. Many multime- dia applications have huge data storage requirements, in addition to their pressing computational require- ments. Previous studies show that high off-chip memory latencies and energy consumptions are likely to be the limiting factor for future embedded systems [1, 2]. Memory hierarchy (MH) design has been introduced long ago to improve the data access bandwidth to cope with the ever growing performance mismatch between processing units and the memory subsystem (see e.g. [3, 4]). Moreover, an SRAM-based domain or applica- tion specific MH can be used to minimize the power consumption, as data memory power consumption de- pends primarily on the access frequency and the size of the data memory [5]. Performance and power savings can be obtained by accessing heavily used data from smaller Level 1 memories instead of large background memories.

Transcript of Exploiting Varying Resource Requirements in Wavelet-based Applications in Dynamic Execution...

J Sign Process Syst (2009) 56:125–139DOI 10.1007/s11265-008-0223-5

Exploiting Varying Resource Requirementsin Wavelet-based Applications in DynamicExecution Environments

Bert Geelen · Vissarion Ferentinos ·Francky Catthoor · Spyridon Toulatos ·Gauthier Lafruit · Thanos Stouraitis ·Rudy Lauwereins · Diederik Verkest

Received: 18 March 2008 / Revised: 18 March 2008 / Accepted: 23 April 2008 / Published online: 22 May 2008© 2008 Springer Science + Business Media, LLC. Manufactured in The United States

Abstract In the context of future dynamic applications,systems will exhibit unpredictably varying platform re-source requirements. To deal with this, they will notonly need to be programmable in terms of instructionset processors, but also at least partial reconfigurabilitywill be required. In this context, it is important forapplications to optimally exploit the memory hierarchyunder varying memory availability. This article presentsa mapping strategy for wavelet-based applications: de-pending on the encountered conditions, it switchesto different memory optimized instantations or local-izations, permitting up to 51% energy gains in mem-ory accesses. Systematic and parameterized mapping

B. Geelen (B) · F. Catthoor · G. Lafruit ·R. Lauwereins · D. VerkestIMEC vzw, Kapeldreef 75, 3001 Leuven, Belgiume-mail: [email protected]

B. Geelen · F. Catthoor · R. Lauwereins · D.VerkestDepartment of Electrical Engineering,Katholieke Universiteit Leuven,Leuven, Belgium

V. Ferentinos · S. Toulatos · T. StouraitisDepartment of Electrical and Computer Engineering,University of Patras, Patras, Greece

R. LauwereinsInterdisciplinary Institute for BroadBand Technology(IBBT), Ghent, Belgium

D. VerkestDepartment of Electrical Engineering,Vrije Universiteit Brussel, Brussels, Belgium

guidelines indicate which localization should be se-lected when, for varying algorithmic wavelet parame-ters. The results have been formalized and generalizedto be applicable to more general wavelet-basedapplications.

Keywords Wavelets · Memory optimization ·Dynamism · Loop transformations

1 Introduction

Portable multimedia applications impose stringent anddiverse requirements on the platforms they run on, asthey should be energy-efficient for extended batterylife, while providing high performance. Many multime-dia applications have huge data storage requirements,in addition to their pressing computational require-ments. Previous studies show that high off-chip memorylatencies and energy consumptions are likely to be thelimiting factor for future embedded systems [1, 2].

Memory hierarchy (MH) design has been introducedlong ago to improve the data access bandwidth to copewith the ever growing performance mismatch betweenprocessing units and the memory subsystem (see e.g.[3, 4]). Moreover, an SRAM-based domain or applica-tion specific MH can be used to minimize the powerconsumption, as data memory power consumption de-pends primarily on the access frequency and the size ofthe data memory [5]. Performance and power savingscan be obtained by accessing heavily used data fromsmaller Level 1 memories instead of large backgroundmemories.

126 B. Geelen et al.

The Wavelet Transform (WT) produces multi-resolution representations of signals, forming an im-portant but complex algorithmic component for a newclass of scalable applications. In these applications itis possible to successively refine the quality of the re-constructed signal (e.g., an image) using increasing sub-sets of the transformed signal. This allows connectingheterogeneous systems to the same network with dy-namically varying execution conditions, where each enduser can download a varying subset of the transformedsignal and still achieve a reconstructed signal of optimalquality, according to the system’s technical capabilitiesand the encountered conditions. On the system itself,the application will also have to deal with dynamismat task level by competing for resources with otherdynamically generated tasks with real-time constraints,meaning not only the amount of available signal con-tent varies, but also these resources. This offers theadditional freedom to dynamically scale the mappingrequirements to the available resources so the batterylife is adapted to these varying resources. Further en-ergy savings can be obtained by also exploiting the free-dom offered by reconfigurable systems and modifyingthe algorithm mappings to the changing system config-urations, e.g. extra Level 1 memory can be activatedunder a sudden heavy work load. The mapping canthen be adapted to this new configuration, by selectinga implementation with higher memory requirements,but 18% less misses as will be shown in Section 4.Obviously, this necessitates real-time mechanisms andmapping guidelines, derived by the compiler flow andadded to the middleware.

This article demonstrates that the optimal miss rateperformance at various cache sizes is obtained using dif-ferent memory-optimized WT versions or localizations.This phenomenon can be exploited when dynamicallyvarying Level 1 memory occurs, by switching betweenthese localizations at run-time to achieve the lowestpossible miss rate. This article is a significant extensionof previous work [6] focusing on various algorithmicWT parameters, which can be used to derive mappingguidelines indicating at run-time which WT localizationoffers the best performance. Moreover, the mappingguidelines are extendable to general WT-like algo-rithms, such as hierarchical filterbanks and iterative up-and downsampling as in Scalable Video Coding (SVC)standardized in MPEG [7].

This article is organized as follows: Section 2 presentsthe related work. Section 3 gives an overview of the WTand its possible implementation methods, while Sec-tion 4 shows a motivating example of the energy gains

that can be obtained by exploiting the dynamicallyvarying MH resources. Section 5 presents a detaileddescription of the miss rate characteristics of two lo-calizations in function of different algorithmic waveletparameters and uses this to derive run-time mappingguidelines. Finally, conclusions are drawn in Section 6.

2 Related Work

Catthoor et al. [1, 8] present an extended Data Trans-fer and Storage Exploration methodology developedat IMEC and consisting of multiple steps. Of specialinterest for these experiments are the Data ReuseExploration step [9] and the MH Layer Assignmentstep [10], combined in the MH tool to find Data Reusein the code at compile-time and to explore how itcan be optimally exploited. A similar compiler-assistedscratchpad management scheme is presented in [11].These optimization schemes are, however, not compat-ible with the complex index and loop code of the WT.

Meerwald et al. [12], Bernabe et al. [13], Chrysafisand Ortega [14] and Lafruit et al. [15] present differentmemory-optimized execution orders or localizations ofthe WT, offering various methods to avoid off-chipmisses: [12] reduces conflict misses in the vertical WTfiltering by modifying the data layout and improvesthe spatial locality by modifying the execution order.Bernabe et al. [13] reduces the cache misses during ver-tical filtering by computing tiles of merged horizontaland vertical filtering, [14] further avoids misses duringthe higher WT levels by merging lines of computationover all the WT levels, while [15] offers the sameadvantages, but by merging in a block-based manner,which corresponds well to further processing blocks.Chaver et al. [16] finally realizes a trade-off betweenthe in-placing freedom and spatial locality present incertain implementation styles of the WT. None of theseimplementations consider dynamically varying mem-ory access or storage requirements, or their impact onthe mapping.

In contrast, this article focuses on energy gainsrelated to the miss rates of these different execu-tion orders for wavelet-based applications. It demon-strates that the execution order leading to the lowestmemory-transfer energy cost depends on the encoun-tered Level 1 memory space, which varies at run-timein environments with dynamically introduced taskscompeting for shared resources. It derives systematicmapping guidelines to select the optimal localizationat run-time.

Dynamically varying resource requirements for the wavelet transform 127

3 The Wavelet Transform

Wavelet-based coding is a powerful enabling technol-ogy for scalable applications, including a.o. theJPEG2000 [17] and MPEG4-VTC [18] compressionstandards and several similar proposed up- anddownsampling schemes for scalable video coding inMPEG [7, 19]. Schemes coded by wavelet-based videocoding have proven that they can preserve excellentrate-distortion behavior, while offering full scalability[20, 21]. It is based on the WT, which is a specialcase of a subband transform [22] producing a type oflocalized time-frequency analysis [23]. Figure 1 showsan example of the transformed Lena image, as a hier-archy of subimages, grouped in levels. Multiresolutioncoding can be easily achieved by selective decoding oftransform coefficients related to a certain frequencyrange or subband image.

In digital compression applications, the WT is tra-ditionally computed as an iterated filter bank [24].Figure 2 shows a schematic representation of the1D multi-level Forward Wavelet Transform (FWT)procedure. The 1D WT can be generalized to a higherdimensional WT by interleaving the filtering in variousdimensions during each level of the WT.

Each level of the FWT consists of a lowpass (L)and a highpass (H) filtering followed by a subsamplingoperation. The output of the lowpass filtering, aftersubsampling, represents the input to the next level ofthe transform, while the highpass samples are directlysent to the output. The final output of an n level FWTis the result of the nth lowpass filtering, combined withall the highpass filtering results.

The iterative filtering or lifting typically leads to codewith complex data dependencies and non-linear loopand index expressions, as is illustrated in Fig. 3. Thismakes automated optimization strategies difficult toapply to wavelet-based applications, as was shown forthe automated MH mapping tool MH [6].

LL2(DC) HL2

LH2 HH2

HL1

LH1 HH1

Figure 1 Lena input image, 2 level transformed Lena and outputsubband organization.

Level 0 Level 1 Level 2 Level 3 Level n

InputStream

HP 1

HP 2

HP 3

LP n

22

2

22

2

H

H

HL

L

L Forward Lowpass Filter H Forward Highpass Filter 2 Subsampling

Figure 2 FWT filtering procedure.

3.1 Execution Order

The execution order of the filtering instructions has abig influence on the memory and energy requirementsof the WT. Two important cases are the traditionallevel-by-level and the block-based localized executionorders.

3.1.1 Level-By-Level

The iterative filtering creates strong dependencies be-tween the operations of the various levels of the trans-formation. The simplest scheduling order to respectthese dependencies is the traditional level-by-level im-plementation. Here, the WT is sequentially calculatedlevel by level, starting from the original input signal.However, this schedule requires large amounts of mem-ory to store results at the intermediate levels. More pre-cisely, in the level-by-level algorithm, it is only after allthe data of an intermediate level have been calculated(and temporarily stored in memory) that the calcula-tion of the next (higher) level may start. Consequently,the required memory size to store the temporal datais equal to the input length. These large amounts ofintermediate data are unlikely to fit in small, efficientLevel 1 (L1) memories and will need to be stored inlarge L2 memories or even off-chip SDRAM, if the on-chip space is restricted for area or power reasons.

for(level=0; level<nrLevels; level++) {

//horizontal filtering…

//vertical filteringfor (col=0; col< W>>level; col+=2) {

for (hor=0; hor<2; hor++) {

for(row=0; row< H>>level; row+=2) {

data[level+1][1+2*hor][row/2][col/2] = data[level][0][row+1][col+hor]

– y * (data[level][0][row][col+hor] + data[level][0][row+2][col+hor]);

data[level+1][0+2*hor][row/2][col/2] = data[level][0][row][col+hor]

+ δ * (data[level+1][1+2*hor][row/2-1][col/2]

+ data[level+1][1+2*hor][row/2 ][col/2]);

}}}}

Figure 3 Pseudocode for the WT, showing complex index andloop expressions.

128 B. Geelen et al.

3.1.2 Block-Based

The block-based ordering tries to avoid these costlybackground accesses [15]. It is an “as soon as possi-ble” scheduling algorithm, which consumes inputs andintermediate results as soon as possible after theircreation, so that immediate storage in SDRAM canbe avoided. The block-based schedule takes advan-tage of block-based constrained applications (motionestimation/compensation and entropy coding in videocoding), where the data has to be processed on a block-by-block basis.

Ideally, for each input block calculation step, the ex-ecution schedule prioritizes the vertical order throughthe wavelet levels. However, because of the waveletfiltering data dependencies mentioned earlier, this isnot feasible, and thus a “skewed” schedule is followed.For the 1D WT, the difference between the ideal andthe best possible execution schedule is illustrated inFig. 4. The combination of loop merging with thesedata dependencies lead to the even more complex code,partially illustrated in Fig. 5. To resume the calcula-tions after the first front line of the schedule, the lastprocessed lowpass samples must be stored in a tempo-rary memory.

The space required for this data is more likely to fit inmore efficient Level 1 memory, leading to a potentiallyfaster and more energy efficient implementation of theWT in comparison to the level-by-level approach, butthen the available opportunities for data reuse have tobe exploited very effectively.

4 Motivating Example

Future embedded application execution conditions willbe more and more dynamic. Tasks such as audio playersand 3D visualisation will unpredictably pop up andget shut down at run-time. Moreover, the applications

ASAP Execution OrderIdeal Execution OrderWT Data-dependency

Level 3

Level 2

Level 1

Level 0

Figure 4 Block-based 1D-WT, with the skewed as-soon-as-possible execution schedule compared to the ideal (but not feasi-ble) schedule.

for(rowC=0; rowC<(H+TILE)/TILE; rowC++)

for(colC=0; colC<(W+TILE)/TILE; colC++)

for(level=0; level<nrLevels; level++)

for(rowF=0; rowF<(TILE>>level); rowF+=2)

for(colF=0; colF<(TILE>>level); colF+=2){

//horizontal filtering…

//vertical filteringfor (hor=0; hor<2; hor++){

data[level+1][1+2*hor][((TILE>>level)*rowC+rowF- vrowBUMP[L])/2]

[((TILE>>level)*colC+colF- colBUMP[L])/2] = …

}

Figure 5 Partial block-based WT pseudocode illustrating com-plex index expressions & loop structures.

themselves will also present more and more dynamism:a scalable video coder can switch to a lower qualityvideo mode with a lower resolution and frame rate,depending on the user’s wishes, the network condi-tions, . . . This will obviously result in different require-ments being placed on the system architecture, as theprocessing requirements for these different modes canvary orders of magnitude [25]. If this is handled usingstatic worst-case design strategies, the designs will typ-ically be severely overdimensioned, making inefficientuse of system resources. If the design is instead adaptedto the encountered execution conditions, the systemresources will be exploited more efficiently and energygains can be obtained. This will be demonstrated inthis section, by switching between different memory-optimized implementations or localizations of a 2DWT, which are more suited for specific Level 1 memorysize ranges. At design time, the system can then beprofiled to determine the most likely set of operatingconditions or scenarios [26, 27], allowing the selectionof a minimal set of different memory-optimized imple-mentations. At run-time the middleware can then de-termine what the actual execution scenario is and, usingmapping guidelines derived at design time, switch to themost compatible localization, providing an optimizeddivision of work over design-time and run-time. Thisrequires knowledge of the behavior of the differentlocalizations under different execution conditions andfor all possible algorithmic settings, as will be presentedin Section 5.

The two WT execution orders described above resultin different MH performance: the level-by-level styleneeds to access the samples between the different WTlevels from off-chip, as the required size for this inter-mediate data is typically too large to still store in theforeground layers. This results in many more missesto off-chip memory than for the block-based localizedexecution order, which can keep the intermediate dataon-chip and more or less only needs to go off-chip toaccess the input and output data.

However this presumes that a certain amount ofspace is available in the Level 1 memory to achievethis performance. The behavior described for level-by-

Dynamically varying resource requirements for the wavelet transform 129

level assumes that only the temporal filtering reusewithin each level is exploited, so that enough spacecorresponding to the filter size should be present inLevel 1. This space can be reused for the different levels(and directions in a N-dimensional WT). In contrast,in a block-based localization style, all WT levels aresimultaneously alive and “blocks” of samples shouldbe stored for each level, meaning the Level 1 spacerequired for the desired block-based MH performanceis much larger than for level-by-level.

When dynamically introduced, pre-emptable tasksare present, especially in a multi-threading context, thisLevel 1 space can be traded off at run-time with otherapplications, meaning the amount of space varies overtime and cannot be statically predicted. How the twoexecution orders behave when less (or more) Level 1space than in the typical configuration is available,is different for both and results in a crossover pointat smaller Level 1 sizes where level-by-level performsbetter. This can be exploited at run-time by switching tothe pareto-optimal execution order for the encounteredsize, where pareto-optimal signifies a solution is opti-mal in at least one trade-off direction, when all otherdirections are fixed. Figure 6 illustrates this for a 4 levelforward WT of a 512 × 512 image, using the biorthog-onal 9/7 wavelet filters, as employed in the JPEG2000lossy mode [17]. The performance is compared usingmiss rate as the evaluation metric. A miss tries to accessdata which is not present in Level 1 memory and whichfirst needs to be fetched from costly L2 memory. A hit,in contrast, accesses data which is present in Level 1memory [4].

From these results, we can see that the level-by-levelexecution order indeed offers a better miss rate at small

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

10 100 1000 10000 100000 1e+06 1e+07

Mis

sra

te(lo

ad

sa

nd

write

ba

cks)

L1 size (bytes)

Block-BasedLevel-by-level

52%

18%

Figure 6 Relative miss rate differences for 2 localizations of a4 level, 9/7 filter WT.

Level 1 sizes, but that its performance does not scalewith a growing Level 1 size: due to the processing order,the miss rate only starts improving when completewavelet levels are stored on-chip, which is typically forunrealistically large Level 1 sizes. When the availableLevel 1 space exceeds about 2 kB, the block-basedlocalized WT on the other hand is able to exploit moredata reuse, between the horizontal and vertical filteringand between the various levels of the WT. As a con-sequence, at Level 1 sizes larger than 4 kB the block-based localized WT offers a miss rate 18% lower thanthe level-by-level WT miss rate . Accordingly, it makessense to select this execution order for implementation.

If less Level 1 space is available at run-time dueto task level dynamism and a very heavy workload,the block-based WT performs significantly worse. Itis then not only impossible to exploit the data reusebetween horizontal and vertical filtering and betweenthe WT levels, but also the temporal filtering reuse isunderexploited. This is not the case for the level-by-level execution order. Accordingly at very small sizes,corresponding to a heavy workload, level-by-level canhave up to a 52% lower miss rate.

To determine how these relative miss rate differ-ences translate to relative memory energy differences,we will now derive a translation factor linking a relativegain in miss rate to a relative gain in energy consump-tion: we assign a fixed cost EL1 to a Level 1 memoryaccess and a fixed cost EL2 to a Level 2 memory ac-cess [28]. This results in a cost EL1 for each cache hitand a cost 2 · EL1 + EL2 for each cache miss (due toone Level 2 read and one Level 1 write and read), or atotal energy cost:

E = (2 · EL1 + EL2) · miss + EL1 · hit

= EL1 · (hit + miss) + (EL1 + EL2) · miss

= EL1 · accessestotal + (EL1 + EL2) · miss

leading to a relative energy difference (assumingmiss2 > miss1 and E2 > E1):

E2 − E1

E2= (EL1 + EL2) · (miss2 − miss1)

EL1 · (accessestotal) + (EL1 + EL2) · miss2

(1)

= miss2 − miss1

miss2× 1

1 + EL1EL1+EL2

accessestotalmiss2

(2)

= miss2 − miss1

miss2× τ (3)

130 B. Geelen et al.

When converted to a relative gain in energy, a gain inmiss rate is reduced by the translation factor τ given inequation (3) above. It indicates that when the miss rateis relatively low and EL2 is not much higher than EL1,most of the gains related to the miss rate reduction arehidden by the constant EL1 · accessestotal in the energyformula. However, in a good MH design for embeddedapplications, the total energy consumption is reducedby making EL1 low, requiring small Level 1 sizes. TheLevel 2 size and the corresponding EL2 will then bemuch larger, as it still needs to be large enough to storelarge data in background. In such cases, the translationfactor τ is close to one, even for small miss rates, and areduction in miss rate translates directly to a reductionin energy.

According to the energy numbers of Table 1, a 4 kBLevel 1 SRAM memory costs 0.045 nJ/access, whereasa 128 MB Level 2 SDRAM memory on average leadsto an energy cost of 4.8 nJ/access, if we assume 20%page misses. For these energy costs, and with the missrates given in Fig. 6, the translation factor τ for themiss rate reduction of 18% is 96.4%, leading to anenergy gain of 17%, while for the miss rate reductionof 52%, τ is 98.3%, giving an energy gain of 51%.This shows that within the embedded multimedia andwireless context, these miss rate reductions indeed leadto nearly proportional energy reductions.

In a static environment, a designer knows how muchLevel 1 space is available and can choose a suitableexecution order. As level-by-level is only optimal forquite limited Level 1 sizes, a block-based execution or-der will be chosen typically. In a dynamic environmentwe cannot predict at design time how much Level 1space is available. We can then either keep only oneblock-based execution order, which will have an opti-mal miss rate most of the time, or we can switch to alevel-by-level execution order when there is a heavyworkload and Level 1 space becomes limited. If weassume enough memory for the block-based executionorder is available 80% of the time, switching over tothe level-by-level execution order in the remaining 20%

would still deliver a 20% energy saving, compared toalways staying in one block-based execution order, ifthe switching cost may be ignored.

5 Systematic Mapping Guidelines

In order to achieve the 20% energy saving mentionedabove under general algorithmic conditions, the mid-dleware needs to know which execution order offersthe best miss rate under all these conditions. In thissection, the miss rate performance of both the above-mentioned execution orders will be evaluated over anextended range of Level 1 memory sizes and for variousalgorithmic parameters, such as image size, filter sizeand number of WT levels. This is essential to derivemapping guidelines, allowing the middleware to knowwhich execution order to choose.

To describe the miss rate performance for applica-tions under varying Level 1 sizes, we use a simulatorbased on the concept of Reuse Distance, a metric ex-pressing locality: the (backwards) reuse distance of acertain memory access represents the number of dis-tinct memory addresses accessed between the currentaccess and the previous one (if any). A one-to-onerelationship exists between the miss rate of a fullyassociative cache of a certain size and a least recentlyused replacement (LRU) policy and the reuse distancehistogram at that size [29]. This implies that the per-formance over various Level 1 sizes for caches of thisconfiguration is automatically given.

When a real cache hierarchy has to be chosen, fullyassociative caches are not an option though, as theyare very area and power-inefficient. However, usingthe proper code and data layout techniques [30, 31], ithas been shown that the miss rate of a direct mappedcache can be substantially reduced. Figure 7 illustratesthis for a direct mapped cache, simulated using theDinero IV cache simulator [32], after applying thelayout transformations mentioned in [12]. The trendsobserved for fully associative LRU caches also hold

Table 1 SRAM & SDRAMenergy per access. Size (bytes) E (nJ/access) E (nJ/row activate)

On-chip SRAM energy [35] 1024 0.0244096 0.045

16384 0.12265536 0.311

262144 0.9611048576 1.894

Off-chip SDRAM energy [36] 134217728 3.5 10

Dynamically varying resource requirements for the wavelet transform 131

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

1 10 100 1000 10000 100000 1e+06

Mis

srat

e(lo

ads

and

writ

eba

cks)

L1 size (words)

Misses Block-Based Fully AssociativeMisses Level-by-Level Fully Associative

Misses Block-Based Direct MappedMisses Level-by-Level Direct Mapped

Figure 7 Comparison of behavior under a fully associative anda direct mapped cache for a 4 level, 5/3 filter WT with level-by-level and block-based execution orders.

here. In low power contexts, direct mapped caches,software controlled caches or scratchpad memories arepreferred due to their higher energy efficiency and be-cause they allow sharing Level 1 space between threadsmore easily. Nevertheless, this article derives mappingguidelines for fully associative caches, since Fig. 7 illus-trates they can be easily extended towards other cachesystems.

Originally the simulator evaluated the effect of loadcosts on the miss rate. Using techniques similar to thoseof [33], the simulator has been extended to incorporatethe effects of write accesses as well as write back costs.

5.1 Level-by-Level Behavior

Figure 8 illustrates the miss rate behavior of the typicallevel-by-level execution order style. It will be comparedto the block-based miss rate behavior to derive whichexecution order offers the best behavior for varying WTparameters. This info can then be used at run-time bythe middleware to switch to the optimal version. Thelevel-by-level behavior can be summarized at 3 cachesizes: the smallest size, at which no temporal reuse isexploited, a certain small size at which the horizontaland vertical filtering temporal reuse is exploited, and alarge size at which all possible reuse is exploited.

To explain the miss rate behavior in terms of varyingfilter sizes, it is useful to examine our lifting-basedimplementation. The lifting scheme is a method forsimplifying the WT by decomposing the filters into a setof prediction and update lifting stages [34]. By exploit-ing the redundancy between the highpass and lowpassfilters, this results in computationally more efficientimplementations. Moreover, lifting scheme algorithmshave the advantage that they do not require temporary

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

1 10 100 1000 10000 100000 1e+06

Mis

srat

e (lo

ads

and

writ

e ba

cks)

L1 size (words)

Load MissesTotal Misses

Write back Misses

Maximal reuse

No reuse

Inter-kernel reuse

Intra-kernel reuse

Figure 8 Characteristic points of a 4 level, 9/7 filter, level-by-level WT.

arrays in the calculation steps, as the output of thelifting stage can be directly stored in place of the input.Therefore it has been applied in our experiments.

Each pair of filtering kernels of sizes 2n + 1 (low-pass) / 2n − 1 (highpass) is decomposed into a set ofn lifting stages, as shown in Fig. 9 for 4 lifting stages.Each of these lifting stages reads 3 inputs and writes 1output which is directly read by the next lifting stage.This is referred to as intra-kernel reuse, which will beimportant to derive the relative miss rate differences atsmall level 1 sizes. Furthermore, when the final liftingoutput at position T has been computed, the “filteringkernel” moves to position T + 2, where it reloads anumber of samples used in the previous position. Thiscorresponds to the temporal filtering reuse which alsooccurs in FIR-filter implementations, and is referred toas inter-kernel reuse.

Since all accesses miss at cache size 0, the miss ratesare equal to the total load and write back counts forthe lifting implementation described above. We assumea fetch-on-write-miss policy, giving 4 load misses and1 write back miss per lifting stage of Lift number ofstages in total. Each kernel is applied on every other

r1r1r1

w1 stage 1

stage 2

stage 3

stage 4

output T-4 T-2 T

filtering position T-4 T-2 T

Figure 9 9/7 Filtering kernel decomposed in 4 lifting stages.

132 B. Geelen et al.

column and every other row during the horizontal andthe vertical filtering, respectively. Finally, each consec-utive WT level operates on the previously produced LLsubband, meaning the sizes of the levels are reducedby four each time. Representing the image width andheight by W and H, the number of WT levels by L andthe initial load and write back miss rates by Mno reuse,l

and Mno reuse,w, this gives

Mno reuse,l = (Lift · 4) · 2 · W · H2

·L−1∑

k=0

(1

2k· 1

2k

)

Mno reuse,w = (Lift · 1) · 2 · W · H2

·L−1∑

k=0

(1

2k· 1

2k

)

The following opportunity for reuse is the earlierdescribed intra-kernel reuse, where the samples pro-duced by one particular lifting stage are read by theconsecutive lifting stage. This intra-kernel reuse occursat a size of 3 words, corresponding to 1 lifting stage.It avoids the 1 write and 4 load misses per liftingstage. Instead, the miss rate is determined by the in-placing choices: in the horizontal filtering the outputdirectly overwrites the input, as is allowed by the lift-ing scheme, whereas for vertical filtering the output iswritten to a new output array (similar to [16]). Thishas the advantage of offering a higher spatial localityfor the following WT levels, as exploiting the liftingscheme in-placing options requires writing the variousWT subbands in an interleaved manner. The drawbackis a higher amount of compulsory misses and potentiallymore space required to exploit this reuse. More pre-cisely, for varying filter sizes we define the intra reusemisses as

IntraH,Lift=1 = 3 IntraV,Lift=1 = 4

IntraH,Lift≥2 = Lift + 2 IntraV,Lift≥2 = Lift + 4

reducing the load miss rates (only) to

Mintra,l = (IntraH + IntraV) · W · H2

·L−1∑

k=0

(1

2k· 1

2k

)

The next characteristic point occurs when all the lift-ing stages fit in Level 1 memory, so that both thereuse between the lifting stages in one kernel and thehorizontal and vertical filtering reuse can be exploited.The space required differs for the horizontal and thevertical filtering due to the in-placing configuration

Sinter, hor = Lift + 2

Sinter, ver ≈ Lift + 4

leading to miss rates where, for each WT level, duringthe horizontal filtering the input/output array is loadedand written back once, while during the vertical filteringboth an input and a different output array are loaded,and this output array is also written back.

Minter,l = (1 + 2) ·L−1∑

k=0

(W2k

· H2k

)

Minter,w = (1 + 1) ·L−1∑

k=0

(W2k

· H2k

)

The final characteristic cache size is reached whenall possible reuse is exploited. This occurs when alldata produced during the horizontal filtering remains inLevel 1 before being consumed by the vertical filtering,requiring sufficient space to store the input and theoutput array for the first WT level. At this size, only theoriginal input array and all the output arrays will miss

Smax reuse = 2 · W H

Mmax reuse,l = W H + W H ·L−1∑

k=0

(1

2k· 1

2k

)

Mmax reuse,w = W H + W H ·L−1∑

k=0

(1

2k· 1

2k

)

Before this point is reached, the miss rate graduallydecreases. This is caused by the mismatch in the direc-tion between the horizontal and vertical filtering. Typ-ically, the sizes required for this reuse are too large foron-chip storage. Nevertheless, their miss rate behaviorcan be accurately approximated. For the load missesbetween the horizontal and vertical filtering, the reusedistance at each location corresponds to the pointsaccessed after the last horizontal filtering access andbefore the first vertical filtering access, as is shown inFig. 10. This implies that the reuse distance d = W H −1 − row · (W − 2 − col) + H · col, or that the miss rateat a certain size decreases by the amount of arraylocations whose reuse distance is equal to that size.

Array A: Horizontal Filtering

row

colcol 000

WW

HArray A: Vertical Filtering

rowrow

col000

W

HArray B: Vertical Filtering

H

Figure 10 Reuse distance resulting from the orientation mis-match between horizontal and vertical filtering.

Dynamically varying resource requirements for the wavelet transform 133

0

100

200

300

400

500

0 100 200 300 400 500

Column

row

0

50

100

150

200

250

0 50 100 150 200 250

Column

row

a) b)

d=low d=high d=low d=high

Figure 11 Reuse distance distribution resulting from the filteringorientation mismatch. a Horizontal to vertical filtering mismatch.b Vertical to horizontal filtering mismatch.

As Fig. 11a indicates, this sum of discrete array loca-tions can be well approximated by the integral under

col = d − W H + 1 + (W − 2) · rowH + row

or

area(d) =∫ H

β

d − W H + 1 + (W − 2) · rowH + row

d row

whereβ =max(0, row(col =0))=max(0, W H−1−d

W−2

). Then

for (W − 1) + 2 · (H − 1) < d < W H − 1,

area(d) = (W H − 1 − d)

+ (2H(W−1)−d−1) · ln

∣∣∣∣2H(W−1)−1−d

2H(W − 2)

∣∣∣∣

while for W H − 1 < d < 2 · (W H − 1),

area(d) = H · (W − 2) + (d + 1 − 2H(W − 1)) · ln(2)

This equation describes how the miss rate drops dueto the reuse of horizontal filtering outputs, read as inputfor vertical filtering. By modifying the size ranges inthis formula, it also describes the corresponding effectin the higher WT levels. Moreover, a final componentof data reuse remains, where the output of verticalfiltering is loaded again and overwritten in the followingWT level. Figure 11b shows the resulting reuse dis-tance distribution over the output image of d = row·(col − 3) − col · (4H − 1) + 2W H − 1.

For W/2 + 4H − 2 < d < W2

H2 + 2H − 2, this gives

area(d) =(

W2

− 3

)2d − W − 8H + 4

W − 8

+ (2H(W−6)+−d) · ln

∣∣∣∣2H(W − 6) + 2 − d(4H − 1)(W/2 − 4)

∣∣∣∣

while for W2

H2 + 2H − 2 < d < 2W H − 1,

area(d) = (W/2 − 3) · H/2

+(2H(W − 6) + 2 − d) · ln

∣∣∣∣7H

2(4H − 1)

∣∣∣∣

These formulas describe the corresponding effect onthe loads as well as the write back miss rates, which canagain be directly adapted for the higher WT levels bymodifying the W and H parameters. Figure 12 showsa portion of the load miss rate curve and the corre-sponding approximation, which is practically indistin-guishable. This experimentally proves the accuracy ofour formalization.

5.2 Block-Based Behavior

The miss rate behavior of a block-based WT executionorder style is given in Fig. 13. This execution orderproceeds directly through the WT levels, by calculatingoutput values of the highest WT level as soon as pos-sible. When extended to 2D, the calculation of a newLL subband value at the highest WT level L requiresfiltering 2 × 2 new samples in WT level L − 1, 22 × 22

new samples in WT level L − 2 and in general a blockof 2l × 2l new samples in WT level L − l. The reusebetween the various WT levels can be exploited onceall these blocks can be stored in Level 1 memory. Forthe level-by-level execution order the miss rate behav-ior corresponding to this inter level reuse occurred atextremely large Level 1 Memory sizes.

Each of these blocks are individually filtered hori-zontally and vertically. Within each block, this horizon-tal and vertical filtering is merged together by verticallyfiltering groups of 2 × 2 samples directly after horizon-tal filtering. This allows capturing the reuse between

500000

600000

700000

800000

900000

1e+06

1.1e+06

10000 100000

Mis

sra

te (

loa

ds)

L1 size (words)

Load missesApproximation

908000

909000

910000

911000

912000

913000

914000

915000

125000 126000 127000 128000 129000 130000

Figure 12 Load miss rate and its approximation.

134 B. Geelen et al.

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

1 10 100 1000 10000 100000

Mis

sra

te (

loa

ds a

nd

write

ba

cks)

L1 size (words)

Load MissesTotal Misses

Write back Misses

Maximalreuse

Block-basedreuse

Subrowreuse

Inter-kernel reuse

Intra-kernel reuse

No reuse

Figure 13 Characteristic points of a 4 level, 9/7 filter, block-based WT.

horizontal and vertical filtering at much smaller sizesthan without this merging, where it would require stor-ing a complete block.

These localization features can be used to explainthe characteristic points of the block-based localizationmiss rate curves.

At cache size 0, the block-based WT does not exploitany reuse either, resulting in the same miss rate as forlevel-by-level

Mno reuse,l = (Lift · 4) · 2 · W · H2

·L−1∑

k=0

(1

2k· 1

2k

)

Mno reuse,w = (Lift · 1) · 2 · W · H2

·L−1∑

k=0

(1

2k· 1

2k

)

The intra-kernel reuse again occurs at the same sizeof 3 words as for level-by-level, and reduces the loadmiss rates to

Mintra,l ≈ (IntraH + IntraV) · W · H2

·L−1∑

k=0

(1

2k· 1

2k

)

Furthermore, depending on the examined filter sizesand their alignment, the miss rate can be slightly lowerfor the block-based curve: the reuse between horizontalfiltering is brought to much lower distances by doing aclose merging between horizontal and vertical kernels.This way the 4 new samples produced by 2 horizontalfiltering operations can be directly consumed by thevertical filtering. For one of these samples, this reusecan already be exploited at a Level 1 size of 3, effec-tively decreasing IntraV by half.

Compared to the level-by-level execution order, nowthe inter-kernel reuse occurs at significantly larger sizesdue to the tight merging of horizontal and vertical

filtering: instead of accommodating space for 1 kernel, 2horizontal and 2 vertical kernels are now required. If weassume these kernels to be aligned so that the producedhorizontal filtering outputs are directly consumed, 4locations are reused and the required space is

Sinter ≈ 2 · (Skernel, hor + Skernel,ver) − 4 = 4 · Lift + 8

This allows exploiting all of the horizontal filteringreuse within a block, as well as reuse due to horizontalproduction and vertical consumption. Consequently,since each of the W

2L · H2L blocks is filtered in a row major

order, we only need to load the data for an entiremerged kernel at each first column position of theblock. Afterwards the horizontal filtering only requires4 new samples, but the vertical filtering still requires2 · Skernel,ver, reduced by 4 thanks to the exploited pro-duction consumption reuse. The write back miss ratedrops in a similar manner

Minter,l ≈ W2L

H2L

L−1∑

k=0

2L−k

2

· (2 · (Skernel, hor − 2) + 2L−k · Skernel, ver)

Minter,w ≈ W2L

H2L

L−1∑

k=0

2L−k

2· (2 · (Lift − 2)

+2L−k · (2 + Lift))

The exploitation of vertical filtering reuse requiresspace to avoid flushing the data loaded while filteringone subrow within a block, before proceeding to thenext subrow. Since the block size is WT level depen-dent, so is the size required for this reuse

Ssubrow,k ≈ 2 · (Skernel, hor − 2) + 2L−k · Skernel, ver

This will additionally make use of the vertical filter-ing reuse within each block. When it is exploited for alllevels, the resulting miss rate is

Msubrow,l ≈ W2L

H2L

·L−1∑

k=0

2L−k (2L−k · 2 + (Skernel,hor − 2)

+ (Skernel,ver − 4))

Msubrow,w ≈ W2L

H2L

·L−1∑

k=0

2L−k (2L−k · 2 + (Skernel,hor − 4)

+ (Skernel,ver − 6))

Proceeding towards a block-based size, we exploit2 additional sources of reuse: production consump-tion between the WT levels and filtering reuse at the

Dynamically varying resource requirements for the wavelet transform 135

boundaries of horizontally neighboring blocks. Thelatter requires the space to store the blocks for alllevels, while the former can already be exploited onceblocks corresponding to specific levels fit. To exploitboth these sources of reuse, the space corresponding toblocks over all levels is

Sblock ≈2L2L+L−1∑

k=0

2L

2k

(Skernel,hor−2+ 2L

2k+Skernel,ver−4

)

The corresponding miss rate consists of each newrow of block that has to be loaded, as well as theadditional vertical filtering overlap between the rows ofblocks

Mblock,l = H2L

(W2L +

L−1∑

k=0

W2k

(2L−k + Skernel,ver − 4)

)

Mblock,w = H2L

(W2L +

L−1∑

k=0

W2k

(2L−k + Skernel,ver − 6)

)

Finally, when a complete row of blocks over all levelsfits in Level 1 Memory, the filtering reuse between therows of blocks is also exploited. This requires a size of

Smax reuse = W · 2L +L−1∑

k=0

W2k

(2L−k + Skernel,ver − 4

)

The corresponding miss rate is the minimal possiblemiss rate and consists of all data that is ever loaded orwritten back

Mmax reuse = W H + W H ·L−1∑

k=0

(1

2k· 1

2k

)

Mmax reuse = W H + W H ·L−1∑

k=0

(1

2k· 1

2k

)

We have now fully and systematically described themiss rate behavior for both the block-based and thelevel-by-level execution orders. This will be used inthe next section to determine the size region wherea particular execution order is optimal, as well as theamount of misses by which it is better.

5.3 Guidelines

To efficiently exploit the varying system resources,mapping guidelines should contain information on thesize region where a certain localization is better, andthe gains that can be obtained by switching to anotherlocalization. Figure 14 illustrates this for the 9/7 filtersize employed in JPEG2000 and varying numbers ofWT levels.

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

1 10 100 1000 10000 100000 1e+06

L1 size (words)

Mis

sra

te (

loa

ds a

nd

write

ba

cks)

Level-by-Level & Block-Based,3 LevelsLevel-by-Level & Block-Based,4 LevelsLevel-by-Level & Block-Based,5 LevelsLevel-by-Level & Block-Based,6 Levels

Δ

Δ

Δ

Figure 14 Evolution of miss rate behavior for a 9/7 filter basedWT, over 4 WT levels.

In the complete region of interest the behavior of thelevel-by-level execution order remains rather constant,as can be verified using the formulas from Section 5.1.Moreover, in the initial region where level-by-level isalso optimal, the behavior of the block-based executionorder isn’t heavily influenced by the number of WTlevels either. From the miss rate formulas we derivedthat, depending on the amount of exploited production-consumption reuse, the difference between both local-izations for small Level 1 sizes is heavily dependent onthe filter size and varies between

�small,1 = (2 · Lif t + 3) ·L−1∑

k=0

(W2k

· H2k

)

�small,2 = (2 · Lif t + 2) ·L−1∑

k=0

(W2k

· H2k

)

In the crossover region the behavior of the block-based curve is strongly dependent on the number of lev-els. In our example it determines whether the locationof this crossover point corresponds to block-based orsubrow-based reuse: for the 3 Level WT, the crossoverpoint lies in the high cache size ranges correspondingto the block-based reuse, while for more WT levelsit is situated at the subrow-based reuse sizes, therebyincreasing in size by two with each further WT level.

Finally, the miss rate difference between level-by-Level and Block-based for block-based Level 1 sizes isgiven by

�large =W H+3 ·L−1∑

k=1

(W2k

· H2k

)− 2 · Lift − 1

2L

L−1∑

k=0

W H2k

136 B. Geelen et al.

where it is primarily the final term which determines thedependence on the algorithmic parameters of filter sizeand number of WT levels.

This behavior can indeed be observed in Fig. 14:initially the level-by-level execution order offers thebest miss rate, with gains �small that are not heavilydependent on the varying amounts of WT levels. Thisholds until the cross-over region, after which block-based offers miss rate gains �large, varying with theamount of WT levels as described above.

Using this compile-time generated informationabout the potential miss rate gains, the middleware canmake a run-time evaluation, trading off these gains withcosts such as the overhead of switching between variousexecution orders or a modified Level 1 access cost afteractivating or deactivating Level 1 memory during aplatform reconfiguration.

6 Conclusion

Future dynamic applications will lead to dynamicallyand unpredictably varying platform resource require-ments. For this context it is important that applicati-ons optimally exploit the MH under varying memoryavailability. This article demonstrates how to achievethis for wavelet-based applications by trading of theLevel 1 memory space required for various executionorders with their potential miss rate gains, permittingup to 51% energy gains in memory accesses. System-atic and parameterized mapping guidelines are given,indicating which localization should be selected when.Using this compile-time generated information, themiddleware can make an informed decision at run-timeabout switching. The results have been formalized andgeneralized to be applicable to more general wavelet-based applications.

Acknowledgements This work was supported in part bythe Institute for Promotion of Innovation through Sci-ence, Technology-Flanders (IWT-Vlaanderen, PhD bursary B.Geelen). We thank the European Social Fund (ESF), Oper-ational Program for Educational and Vocational Training II(EPEAEK II), and particularly the Program PYTHAGORAS II,for partially supporting this work.

References

1. Catthoor, F., Wuytack, S., et al. (1998). Custom memorymgmt. methodology. Deventer: Kluwer.

2. Vijaykrishnan, N., Kandemir, M., et al. (2000). Energy-drivenintegrated hardware-software optimizations using Simple-Power. In Proc. of the 27th annual int. symp. 8 on computerarchitecture (pp. 95–106). New York: ACM Press.

3. Hill, M. (1987). Aspects of cache memory and instruc-tion buffer performance. Ph.D. thesis, Berkeley, Univ. ofCalifornia Press.

4. Patterson, D., & Henessy, J. (1996). Computer architecture:A quantitative approach. San Mateo: Morgan KaufmannPubl.

5. Amrutur, B., & Horowitz, M. (2000). Speed and powerscaling of SRAM’s. In IEEE journal of solid-state circuits(Vol. 35, pp. 175–185).

6. Geelen, B., Ferentinos, A., et al. (2006). Software-controlledscratchpad mapping strategies for wavelet-based applicati-ons. In Proc. IEEE workshop on sig. proc. sys.

7. Tran, T. D., Liu, L., et al. (2007). Advanced Dyadic Spa-tial Re-sampling Filters for SVC. http://ftp3.itu.ch/av-arch/jvt-site/2007_01_Marrakech/JVT-V031.zip.

8. Catthoor, F., Danckaert, K., et al. (2002). Data access andstorage management for embedded programmable processors.Deventer: Kluwer.

9. Van Achteren, T., Lauwereins, R., et al. (2002). Data reuseexploration techniques for loop-dominated applications. InProc. 5th ACM/IEEE DATE conf. (pp. 428–435).

10. Masselos, K., Catthoor, F., et al. (2001). Memory hierarchylayer assignment for data re-use exploitation in multimediaalgorithms realized on predefined processor architectures. InIEEE int. conf. on electronics, circ. and syst. (pp. 281–287).

11. Steinke, S., Wehmeyer, L., et al. (2002). Assigning programand data objects to scratchpad for energy reduction. In Proc.5th ACM/IEEE DATE conf. (pp. 409–415).

12. Meerwald, P., Norcen, R., et al. (2002). Cache issues withJPEG2000 wavelet lifting. In Proc. SPIE, visual communica-tions and image processing (Vol. 4671, pp. 626–634).

13. Bernabe, G., Garcia, J., et al. (2005). Reducing 3D fastwavelet transform execution time using blocking and thestreaming SIMD extensions. The Journal of VLSI SignalProcessing, 41(2), 209–223.

14. Chrysafis, C., & Ortega, A. (2000). Line-based, reducedmemory, wavelet image compression. IEEE Transactions onImage Processing, 9, 378–389.

15. Lafruit, G., Nachtergaele, L., et al. (1999). Opt. mem. organi-zation for scalable texture codecs in MPEG-4. IEEE Trans-actions on Circuits and Systems for Video Technology, 2,218–243.

16. Chaver, D., Tenllado, C., et al. (2003). Vectorization of the2D wavelet lifting transform using SIMD extensions. In Proc.of the 17th int. symp. on parallel and distributed processing(p. 228.2). Washington, DC: IEEE Computer Society.

17. Skodras, A., Christopoulos, C., et al. (2001). The JPEG2000still image compression standard. IEEE Signal ProcessingMagazine, 9(7), 36–58.

18. Sodagar, I., Lee, H., et al. (1999). Scalable wavelet cod-ing for synthetic/natural hybrid images. IEEE Transac-tions on Circuits and Systems for Video Technology, 9(2),244–254.

19. Ohm, J.-R. (1994). Three-dimensional subband-coding withmotion compensation. IEEE Transactions on Image Process-ing, 3, 559–571.

20. Van der Auwera, G., Munteanu, A., et al. (2002). Bottom-upmot. comp. pred. in the wavelet dom. for spatially scalablevid. cod. IEEE Electron Device Letters, 38, 1251–1253.

21. Woods, J. W. (2001). A resolution and frame-rate scalablesubband/wavelet video coder. IEEE Transactions on Circuitsand Systems for Video Technology, 11(9), 1035–1044.

22. Woods, J. W. (1991). Subband image coding. Deventer:Kluwer Academic.

23. Vetterli, M., & Kovacic, J. (1995). Wavelets and subband cod.New York: Prentice-Hall.

Dynamically varying resource requirements for the wavelet transform 137

24. Mallat, S. G. (1989). A theory for multires. Signal decom-position: The wavelet representation. IEEE Transactions onPattern Analysis and Machine Intelligence, 11(7), 674–693.

25. Verdicchio, F., Andreopoulos, Y., et al. (2004). Scalablevideo coding based on motion compensated temporal fil-tering: Complexity and functionality analysis. In Proceed-ings of IEEE international conference on image processing,Singapore.

26. Palkovic, M., Corporaal, H., et al. (2005). Global memoryoptimisation for embedded systems allowed by code duplica-tion. In Proc. of the 9th int. workshop on software and compil-ers for embedded systems (pp. 72–79). New York: ACM Press.

27. Marchal, P., Gomez, J. I., et al. (2003). SDRAM-energy-aware memory allocation for dynamic multimedia applica-tions on multi-processor platforms. In Proc. of DATE conf.(p. 10516). Washington, DC: IEEE Computer Society.

28. Kamble, M. B., & Ghose, K. (1997). Analytical energy dis-sipation models for low-power caches. In ISLPED ’97: Pro-ceedings of the 1997 international symposium on low powerelectronics and design (pp. 143–148). New York, NY: ACM.

29. Beyls, K., & D’Hollander, E. (2002). Reuse distance-basedcache hint selection. In The 8th international Euro-Par conf.(pp. 265–274).

30. Kulkarni, C., Miranda, M., et al. (2001). Cache conscious datalayout organization for embedded multimedia applications.In Proc. 4th ACM/IEEE DATE conf.

31. Vander Aa, T., Jayapala, M., et al. (2003). Instruction buffer-ing exploration for low energy embedded processors. In Proc.of the 13th int. PATMOS workshop (pp. 409–419).

32. Hill, M. S. (1998). Dinero IV, release 7, trace-drivenuniprocessor cache simulator. www.cs.wisc.edu/∼markhill/DineroIV.

33. Thompson, J. G., & Smith, A. J. (1989). Efficient (stack)algorithms for analysis of write-back and sector memories.ACM Transactions on Computer Systems, 7(1), 78–117.

34. Sweldens, W. (1995). The lifting scheme: A new philoso-phy in biorthogonal wavelet constructions. In A. F. Laine& M. Unser (Eds.) Wavelet applications in signal and imageprocessing III, proc. SPIE 2569 (pp. 68–79).

35. Papanikolaou, A., Miranda, M., et al. (2003). Global inter-connect trade-off for technology over memory modules toapplication level: Case study. In Proc. of SLIP. New York:ACM.

36. Micron (1999). 128MSDRAM MT48LC16M8A2, www.micron.com.

Bert Geelen is a member of the NES (Nomadic Embedded Sys-tems) Multimedia group at the Inter-University MicroelectronicsCenter (IMEC) and a PhD student at the Katholieke Unversiteit

Leuven. His research interests include parallelization, low-powerembedded-systems design and computer architectures. BertGeelen has a MEng in electrical engineering from the KatholiekeUniversiteit Leuven.

Vissarion Ferentinos received the electrical and computer en-gineering degree in 2000 and the M.Sc. degree in hardware-software integrated systems in 2002 (Award of Performance bythe National Institute of Scholarships in Greece), both fromUniversity of Patras, Patras, Greece.

During the elaboration of the M.S. thesis he was a studentin the Multimedia Image Compression Systems group in theDesign Technology for Integrated Information and Communica-tion Systems (DESICS) division, at the Inter-university Micro-Electronics Center (IMEC), Leuven, Belgium, studying the fieldof wavelet-based systems. Since 2002, he collaborates with thesame group working on the area of low-power design and im-plementation of embedded scalable video codecs. Subsequently(since 2003), he is a Ph.D. student with the Electrical andComputer Engineering Dept. of University of Patras.

Francky Catthoor is a fellow at the Inter-university Micro-Electronics Center (IMEC), Heverlee, Belgium. He received theengineering degree and a Ph.D. in electrical engineering fromthe Katholieke Universiteit Leuven, Belgium in 1982 and 1987respectively. From September 1983 till June 1987 he has been aresearcher in the area of VLSI design methodologies for Digi-tal Signal Processing, with Prof. Hugo De Man and Prof. JoosVandewalle as Ph.D. thesis advisors.

Since 1987, he has headed several research domains in the areaof high-level and system synthesis techniques and architectural

138 B. Geelen et al.

methodologies, all within the Design Technology for IntegratedInformation and Telecom Systems (DESICS - formerly VSDM)division at IMEC. He was assistant professor at the EE de-partment of the K.U. Leuven since 1989, and has become fullprofessor (part-time) since 2000.

His current research activities belong to the field of architec-ture design methods and system-level exploration for power andmemory footprint within real-time constraints, oriented towardsdata storage management, global data transfer optimization andconcurrency exploitation. The major target application domainsare real-time signal and data processing algorithms in image,video and end-user telecom applications, and data-structure-dominated modules in telecom networks. Platforms that containboth customized architectures (potentially on an underlying con-figurable technology) and (parallel) programmable instruction-set processors are targeted. Also deep-submicron technologyissues are included.

In 1986 he received the Young Scientist Award from theMarconi International Fellowship Council.

He was an associate editor for the “IEEE Transactions onVLSI Systems” for the period 1995/1998, and for the “IEEETransactions on Multi-Media” in 1999–2001. Since 2002 he isan associate editor of the “ACM Transactions on Design Au-tomation for Embedded Systems (TODAES)” and since 1996 aneditor for Kluwer’s “Journal of VLSI Signal Processing”. Begin1997 he became member of the steering board for the VLSITechnical Committee of the IEEE Circuits & Systems Societyand since 1999 he also served on the IEEE Trans. on VLSISystems steering board.

He was the program chair of the 1997 IEEE Intnl. Symposiumon System Synthesis (ISSS) and the general chair for the 1998ISSS. He was also the program chair and main organizer of the2001 IEEE Signal Processing Systems (SIPS) conference. He hasbeen elected an IEEE fellow in 2005.

Spyridon Toulatos is studying electrical and computer engineer-ing at the University of Patras, Greece. During the elaborationof his master thesis, on Wavelet-based applications: mappingstrategies and localizations in dynamic execution environments,he was a student researcher at the Multimedia Image Compres-sion Systems group in the Design Technology for IntegratedInformation and Communication Systems (DESICS) division, atthe Inter-university Micro-Electronics Center (IMEC), Leuven,Belgium.

Gauthier Lafruit was research scientist with the Belgian NationalFoundation for Scientific Research from 1989 to 1994, beingmainly active in the area of NMR image acquisition, waveletimage compression and VLSI implementations for image trans-mission. From 1995, he was research assistant at the departmentof Electronics at the Vrije Universiteit Brussel, Belgium. In 1996,he joined IMEC, where he was first involved as senior scientistin projects for the low-power VLSI implementation of combinedJPEG/Wavelet compression engines for the European SpaceAgency (ESA). His main current activities are with progressivetransmission in still image and video coding, scalability in 3Dimage rendering (meshes and texture rendering) and resourcemonitoring in 3D virtual reality. In this role, he has made deci-sive contributions to the standardisation of 3D-implementationcomplexity management in MPEG-4.

Thanos Stouraitis an IEEE fellow, is a Professor of the ECEDept. at the University of Patras. He has served as a memberof the Administrative Committee of the University of StereaHellas in Greece. He has served on the faculty of The OhioState University and has visited the University of Florida andPolytechnic University.

He got his Ph.D. in Electrical Engineering from the Universityof Florida, an MSc. from the University of Cincinnati, an MS fromthe University of Athens, Greece, and a BS in Physics from theUniversity of Athens, Greece.

His current research interests include signal and imageprocessing systems, application-specific processor technology anddesign, computer arithmetic, and design and architecture of op-timal digital systems. He has authored and co-authored over 150technical papers. He holds one patent on DSP processor design.He has authored several books and book chapters.

Dynamically varying resource requirements for the wavelet transform 139

He has led several DSP processor design projects funded bythe European Union, American organizations, and the Greekgovernment and industry. He serves as Regional Editor forEurope for the Journal of Circuits, Systems, and Computers,as Associate Editor for several IEEE Transactions, and as aconsultant for industry. He regularly reviews for the IEEE SP,C&S, C, and Education Transactions, for IEE Proceedings E, F,G, and for conferences like IEEE ISCAS, ICASSP, VLSI, SiPS,Computer Arithmetic, Euro DAC, etc. He also reviews proposalsfor NSF, the European Commission, and other agencies.

He was the general chair of the IEEE ISCAS 2006 and ofthe IEEE SiPS 2005. He has served as the Chair of the VLSISystems and Applications (VSA) Technical Committee and as amember of the DSP and the Multimedia Technical Committeesof the IEEE Circuits and Systems Society. He is a founder andChair of the IEEE Signal Processing chapter in Greece. He wasthe General Chair of the 1996 IEEE Int. Conf. on Electronics,Circuits, and Systems (ICECS) and the Technical Program Chairof Eusipco ’98 and ICECS ’99. He has served as Chair or as amember of the Technical Program Committees of a multitudeof IEEE Conferences, including ISCAS (Program CommitteeTrack Chair). He received the 2000 IEEE Circuits and Sys-tems Society Guillemin-Cauer (best paper) Award for the paper“Multi-Function Architectures for RNS Processors.”

Rudy Lauwereins is vice-president of IMEC, Belgium’s Interuni-versity Micro-Electronic Centre, which performs research anddevelopment, ahead of industrial needs by 3 to 10 years, inmicroelectronics, nano-technology, enabling design methods andtechnologies for ICT systems. He leads the Nomadic Embed-ded Systems division of 190 researchers, focused on the imple-mentation of nomadic embedded systems as required for smartnetworked mobile devices. He is also a part-time Professor atthe Katholieke Universiteit Leuven, Belgium, where he teachesComputer Architectures in the Master of Science in Electrotech-

nical Engineering program, and a director of the Institute forBroadBand Technologies (IBBT).

Before joining IMEC in 2001, he held a tenure Professorshipin the Faculty of Engineering at the Katholieke UniversiteitLeuven since 1993. He had obtained a Ph.D. in Electrical En-gineering in 1989. In 2001, he was elected by students as the bestteacher of the engineering faculty’s Master’s curriculum.

Rudy Lauwereins served in numerous international programcommittees and organizational committees, and gave many in-vited and keynote speeches. He will be the general chair ofthe DATE conference (Design, design Automation and Test inEurope) in 2007. He is a senior member of the IEEE. ProfessorLauwereins has authored and co-authored more than 300 publi-cations in international journals and conference proceedings.

Diederik Verkest received the Master and Ph.D. degree inApplied Sciences from the Katholieke Universiteit Leuven(Belgium) in 1987 and 1994, respectively. He has been workingin the VLSI design methodology group of the IMEC laboratory(Leuven, Belgium) on several topics related to formal methods,system design, hardware/software co-design, re-configurable sys-tems, and multi-processor systems. He is currently in charge ofthe research at IMEC on design technology for nomadic embed-ded systems.

Diederik Verkest is Professor at the University of Brussels(VUB) and at the University of Leuven (KU-Leuven). He ismember of IEEE and a Golden Core Member of the IEEEComputer Society.

Diederik Verkest published and presented over 100 articlesin International Journals and at International Conferences. Overthe past years he was a member of the programme and/or or-ganisation committees of several major international conferencessuch as ISSS, CODES, FPL, DATE, and DAC. He was theGeneral Chair of the Design, Automation and Test in EuropeConference, DATE’03.