An Efficient Architecture for a Lifted 2D Biorthogonal DWT

8
Journal of VLSI Signal Processing 40, 335–342, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. An Efficient Architecture for a Lifted 2D Biorthogonal DWT MEHBOOB ALAM, WAEL BADAWY, VASSIL DIMITROV AND GRAHAM JULLIEN ATIPS Laboratory, Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, T2N 1N4, Canada Received September 18, 2003; Revised December 24, 2003; Accepted February 26, 2004 Abstract. This paper presents a new algorithm for a 2D non-separable lifted bi-orthogonal wavelet transform. The algorithm is derived by factoring complementary pairs of wavelet transform 2D filters. The results are efficient architectures for real time signal processing, which do not require transpose memory for the 2D processing of data. The proposed architecture exploits in place implementation, inherit from the algorithm, and can take advantage of both vertical and horizontal parallelism in the direct implementation. The processing in our architecture is scheduled by carefully pipelining the lifted steps, which allows for up to four times faster processing than the direct imple- mentation. The proposed architecture operates at high speed, consumes low power and has reduced computational complexity as compared to previously published filter and lifted based bi-orthogonal wavelet architectures. Keywords: discrete wavelet transforms, lifting, biorthogonal transform, wavelet architectures, image compres- sion, lifted architectures, Mallat’s algorithms 1. Introduction The Discrete Wavelet Transform (DWT) is widely used in solving practical problems of signal processing with applications in image compression [1, 2]. A major im- petus came from the work of Mallat [3] and Daubechies [4]. Daubechies introduced the concept of compactly supported wavelets and the theory of frames. Mallat introduced a multi-resolution signal decomposition of the DWT and provided the basic foundation for its im- plementation. Sweldens [5] proposed a wavelet trans- form construction via a lifting scheme. These initial breakthrough investigations gave rise to the study of efficient algorithms and architectures for the construc- tion of wavelet transforms. The 2D DWT architectures are classified accord- ing to the method used for decomposition of the in- put signal. They are grouped into separable (indirect approach) and non-separable (direct approach) archi- tectures. There exist very few 2D architectures which employ a non-separable approach. Marino [6, 7] pro- posed a well-known design using the direct approach. In this approach the architecture performs a complete dyadic decomposition in N 2 /4 clock cycles (cc). How- ever the architecture is based on a filter design ap- proach and uses zero padding. The filters have rela- tively short impulse responses (typically the case for the bi-orthogonal DWT), therefore reducing the filter complexity does not have a large impact on reducing the overall bi-orthogonal DWT implementation com- plexity. Chakrabarti [8] also proposed an architecture, based on parallel filters, for the 2D DWT, which uses MRPA (modified recursive pyramid algorithm) for its implementation. Two other architectures for the 2D di- rect form DWT have been proposed by Yu [9] and Sheu [10]. These architectures assume equal length filters and cannot be generalized for bi-orthogonal wavelet transforms used for image compression. In addition Meng [11] proposed a spatial combinative lifting algo- rithm using 9/7 filters. The design reduces the number of multiplications; however, it is an application specific solution (algorithm based on a 9/7 transform) and the final VLSI implementation [12] is not scalable. Our solution, as presented in this paper, can be generalized to any type or size of filter wavelet transform, and we present an example (using a 9/3 filter) of a scalable

Transcript of An Efficient Architecture for a Lifted 2D Biorthogonal DWT

Journal of VLSI Signal Processing 40, 335–342, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

An Efficient Architecture for a Lifted 2D Biorthogonal DWT

MEHBOOB ALAM, WAEL BADAWY, VASSIL DIMITROV AND GRAHAM JULLIENATIPS Laboratory, Department of Electrical and Computer Engineering, University of Calgary,

Calgary, Alberta, T2N 1N4, Canada

Received September 18, 2003; Revised December 24, 2003; Accepted February 26, 2004

Abstract. This paper presents a new algorithm for a 2D non-separable lifted bi-orthogonal wavelet transform.The algorithm is derived by factoring complementary pairs of wavelet transform 2D filters. The results are efficientarchitectures for real time signal processing, which do not require transpose memory for the 2D processing of data.The proposed architecture exploits in place implementation, inherit from the algorithm, and can take advantage ofboth vertical and horizontal parallelism in the direct implementation. The processing in our architecture is scheduledby carefully pipelining the lifted steps, which allows for up to four times faster processing than the direct imple-mentation. The proposed architecture operates at high speed, consumes low power and has reduced computationalcomplexity as compared to previously published filter and lifted based bi-orthogonal wavelet architectures.

Keywords: discrete wavelet transforms, lifting, biorthogonal transform, wavelet architectures, image compres-sion, lifted architectures, Mallat’s algorithms

1. Introduction

The Discrete Wavelet Transform (DWT) is widely usedin solving practical problems of signal processing withapplications in image compression [1, 2]. A major im-petus came from the work of Mallat [3] and Daubechies[4]. Daubechies introduced the concept of compactlysupported wavelets and the theory of frames. Mallatintroduced a multi-resolution signal decomposition ofthe DWT and provided the basic foundation for its im-plementation. Sweldens [5] proposed a wavelet trans-form construction via a lifting scheme. These initialbreakthrough investigations gave rise to the study ofefficient algorithms and architectures for the construc-tion of wavelet transforms.

The 2D DWT architectures are classified accord-ing to the method used for decomposition of the in-put signal. They are grouped into separable (indirectapproach) and non-separable (direct approach) archi-tectures. There exist very few 2D architectures whichemploy a non-separable approach. Marino [6, 7] pro-posed a well-known design using the direct approach.In this approach the architecture performs a complete

dyadic decomposition in N 2/4 clock cycles (cc). How-ever the architecture is based on a filter design ap-proach and uses zero padding. The filters have rela-tively short impulse responses (typically the case forthe bi-orthogonal DWT), therefore reducing the filtercomplexity does not have a large impact on reducingthe overall bi-orthogonal DWT implementation com-plexity. Chakrabarti [8] also proposed an architecture,based on parallel filters, for the 2D DWT, which usesMRPA (modified recursive pyramid algorithm) for itsimplementation. Two other architectures for the 2D di-rect form DWT have been proposed by Yu [9] and Sheu[10]. These architectures assume equal length filtersand cannot be generalized for bi-orthogonal wavelettransforms used for image compression. In additionMeng [11] proposed a spatial combinative lifting algo-rithm using 9/7 filters. The design reduces the numberof multiplications; however, it is an application specificsolution (algorithm based on a 9/7 transform) and thefinal VLSI implementation [12] is not scalable. Oursolution, as presented in this paper, can be generalizedto any type or size of filter wavelet transform, and wepresent an example (using a 9/3 filter) of a scalable

336 Alam et al.

architecture, which can alternatively be used for highor low bit rate applications.

The paper is organized as follows. The 2D algorithmand preliminaries are presented in Section 2. Section 3presents the proposed algorithm and architecture. Theperformance analysis is presented in Section 4 and thepaper is finally concluded in Section 5.

2. 2D DWT Algorithm

The 2D DWT computational kernal, using the directapproach, is given in Eq. (1).

W jhh(m, n) =

L−1∑n1=0

L−1∑n2=0

hh(n1, n2)

× W j−1hh (2m − n1, 2n − n2)

W jhg(m, n) =

L−1∑n1=0

M−1∑n2=0

hg(n1, n2)

× W j−1hg (2m − n1, 2n − n2)

W jgh(m, n) =

M−1∑n1=0

L−1∑n2=0

gh(n1, n2)

× W j−1gh (2m − n1, 2n − n2)

W jgg(m, n) =

M−1∑n1=0

M−1∑n2=0

gg(n1, n2)

× W j−1gg (2m − n1, 2n − n2)

(1)

where W jhh(m, n) is the input N × N point sequence,

that is zero outside 0 ≤ m ≤ N − 1, 0 ≤ n ≤ N −1. hh(n1, n2), hg(n1, n2), gh(n1, n2) and gg(n1, n2) arethe Low-Low, Low-High, High-Low and High-High(L × M) tap filter coefficients respectively and j is thecurrent decomposition level, 0 ≤ j ≤ J − 1. J is themaximum level of decomposition. L is the width of thelow pass filter, M is the width of the high pass filter.W j

hh(m, n), W jhg(m, n), W j

gh(m, n) and W jgg(m, n) are

the Low-Low, High-Low, Low-High and High-Highsub bands respectively produced at decomposition levelj .

To illustrate the computational complexity of the di-rect implementation of hg(n1, n2) the output of the sys-tem is expressed as

W jhg(m, n) =

L−1∑n1=0

M−1∑n2=0

hg(n1, n2)

× W j−1hg (2m − n1, 2n − n2) (2)

Figure 1. 2D DWT using 2D non-separable filtering: (a) the input todecomposition level j −1, (b) the LxM tap filter and (c) the resultantDWT output.

Assuming N � M and N � L (image size muchgreater than filter size), the region of support (ROS) of(n1, n2), where the input is W j−1

hg (2m − n1, 2n − n2)and the filter response is hg(n1, n2), is shown in Fig. 1.

The direct computation of output W jhg(2m−n1, 2n−

n2) will take approximately ( N2 + M) ML arithmetic

operations. These results are obtained based on thearea of the output ROS which is ( N

2 + L − 1) ×( N

2 + M − 1). Arithmetic operations here refer to onemultiplication and one addition. If hg(n1, n2) is sepa-rable then hg(n1, n2) = h(n1)g(n2) where h(n1) = 0outside 0 ≤ n1 ≤ M − 1 and g(n2) = 0 outside0 ≤ n2 ≤ L − 1.

Equation (2) can be simplified as shown in Eq. (3)

W jhg(m, n) =

L−1∑n1=0

M−1∑n2=0

hg(n1, n2)

× W j−1hg (2m − n1, 2n − n2)

=L−1∑

n1 = 0

h(n1)

[M−1∑

n2 = 0

g(n2)W j−1hg

× (2m − n1, 2n − n2)

]

(3)

where∑M−1

n2=0 g(n2)W j−1hg (2m − n1, 2n − n2) is the

1D convolution over the input signal. Since thereare M different values of n2, for which theabove result is nonzero, the 1D evaluation requires( N

2 )M( N2 + L − 1) arithmetic operations. Since there

are ( N2 + M − 1) different values which need to be

An Efficient Architecture for a Lifted 2D Biorthogonal DWT 337

Figure 2. Four-level decomposition of a non-separable 2D DWT.

computed, we require L(N + M − 1)2 arithmetic op-erations. Hence the computation of W j

hg(m, n) forlevel 1, where the input matrix is N × N , requiresapproximately ( N

2 )M( N2 + M − 1) + L( N

2 + M − 1)2

arithmetic operations. This is in comparison withthe direct implementation computational complexity,[( N

2 + L − 1)( N2 + M − 1)ML], for which we see a

considerable reduction.The 2D DWT implementation, using the direct ap-

proach for J levels of decomposition, is shown inFig. 2. The direct computation of N × N inputs pro-cessed with J levels of decomposition of a 2D DWT,requires the number of filtering computations given byEq. (4).

N 2 +(

N

2

)2

+(

N

4

)2

+ · · · +(

N

4J−1

)2

= 4

3(1 − 4−J )N 2 (4)

Using the simplifying assumptions N � M and N �L in Eq. (4), we obtain:

[(N

2+ M − 1

)(N

2+ L − 1

)L M

]≈ N 2

4(5)

For the entire image (including HH, LH and LL) thecomputation will require N 2 filtering computations.The computation at each sublevel is reduced by a factorof 1

2J−1 .

3. The New Non-Separable Lifting Schemeand Architecture

Our non-separable 2D DWT lifting scheme is based onperforming lifting row-wise and convolution column-

wise. The operation can also be performed usingcolumn-wise lifting and row-wise convolution. It iswell known that 2D convolution by an L × M tap filteris equivalent to performing L 1D convolutions with anM tap filter, the filter coefficients being obtained fromthe rows of the 2D filter and the 1D input sequencebeing obtained from the rows of the 2D input data set.Conversely, the 2D filter mask for an L × M tap filtercan be expressed as an L × M matrix of the analysis fil-ters. Letting g and h be the analysis filters of the DWT,where g(z) = bn Z−n + bn−1 Z−(n−1) + · · · + b0 andh(z) = am Z−m + am−1 Z−(m−1) +· · ·+ a0, then the 2Dfilter mask will be given as:

hg =

ambn ambn−1 · · · amb0

am−1bn am−1bn−1 · · · am−1b0

· · ·· · ·

· · ·a0bn a0bn−1 · · · a0b0

z−n

z−(n−1)

·

·1

(6)

Performing lifting row-wise and convolution column-wise will lead to the following results.

hg =

cm 0

cn−1 0. .

. .

c0 0

∏ [1 u(z)

0 1

][1 0

p(z) 1

](7)

In Eq. (7), the length of the DWT filter determinesthe number of predicting and updating steps. The fi-nal equation involves three matrices. The first is thescaling coefficient, which can be implemented usingFIR filters; the second and third being the update andprediction steps of the lifting scheme.

3.1. Example—Daubechies 9-3 Filter Bank(Floating Point)

An example of a wavelet transform is presented and thecomputation of the non-separable lifting step, based onour proposed factoring algorithm, is explained. In thisexample, the popular Daubechies 9/3 filter bank (float-ing point) is used as defined in the MPEG-4 standard.The results can be generalized for other integer andfloating point bi-orthogonal wavelet transforms.

338 Alam et al.

The floating point 9/3 forward transform uses twoanalysis filters: L(low-pass) and H (high-pass).

L (9,3) = [−0.35355 0.70711 −0.35355];

H(9,3) = [0.03315 −0.06629 −0.17678

+ 0.41984 +0.99437 +0.41984

− 0.17678 −0.06629 +0.03315].

Applying our algorithm, the following results are ob-tained for HL(9,3), LL(9,3), LH(9,3) and HH(9,3).

HL(9,3) =

a(1) 0

a(2) 0

a(3) 0

×[

1 U (1)Z−1 + U (2) + U (3)Z1 + U (4)Z2

0 1

]

×[

1 0

P(1)Z−1 + P(2) 1

]

LL(9,3) = S

b(1) 0

b(2) 0

b(3) 0

b(4) 0

b(5) 0

b(6) 0

b(7) 0

b(8) 0

b(9) 0

×[

1 U (1)Z−1 + U (2) + U (3)Z1 + U (4)Z2

0 1

]

×[

1 0

P(1)Z−1 + P(2) 1

]

LH(9,3) =

0 b(1)

0 b(2)

0 b(3)

0 b(4)

0 b(5)

0 b(6)

0 b(7)

0 b(8)

0 b(9)

[ 1 0

P(1)Z−1 + P(2) 1

]

HH(9,3) = S

0 a(1)

0 a(2)

0 a(3)

[1 0

P(1)Z−1 + P(2) 1

]

(8)

An inherent parallelism can be found in the aboveequations for HL, HH, LL & LH, which can be exploitedin different ways.

3.2. Proposed 2D Architecture

The proposed lifted pipelined-parallel architecture, forthe computation of the direct 2D DWT, is based on ournew algorithm. The architecture will be referred to as2DWT-A which is a direct mapping of our algorithmon to a pipelined-parallel framework where the highspeed sequential flow of input data, and a throughputapproaching two-four data outputs every cycle, makeit an attractive option for real time signal processingfor high speed applications. The motivation for addingextra hardware in a low-complexity algorithm is to in-crease the throughput, and is an option worth consider-ing. The same architecture can be used for low powerapplications by reducing the clock frequency and sup-ply voltage and at the same time achieving reasonablethroughput (at least 1 data output/cc).

To explain the 2DWT-A architecture, we will usethe results from the Daubechies 9/3 filter bank exam-ple. Equation (8), obtained from the example below,shows coefficients, generic multipliers and scaling fac-tors. The proposed algorithm also can be extended tothe design of other bi-orthogonal wavelet transforms.

3.3. The 2DWT-A Architecture

To show our 2D DWT architecture we will use the 9/3filter example discussed above. After applying our al-gorithm, the Daubechies 9/3 filters can be representedas the set of four equations (Eq. (8)) for the computationof the LL, LH, HH and HL DWT coefficients, respec-tively. The conceptual model defined by these sets ofequation can be easily mapped to a pipelined-parallelarchitecture. A top-level diagram of the 2DWT-A ar-chitecture is shown in Fig. 3. The architecture has twostages. Stage 1 deals with the level-1 architecture, andstage 2 shows the general form of the architecture forthe higher decomposition level (level ≥ 2) where downsampling by two takes place at each subsequent stage.

An Efficient Architecture for a Lifted 2D Biorthogonal DWT 339

Figure 3. The direct 2D computation Stage-I using a pipelined-parallel architecture.

3.3.1. Stage I—2DWT-A Design. Stage-I (Level 1)consists of two parallel modules, PM1

1 and PM12, with

each module being responsible for the pipelined com-putation of HH, HL and LL, LH respectively. The rout-ing network performs the decimation of the input data,x(n1,n2), in to L0

odd and L0even, as shown in Fig. 4(a).

The sampling frequency can be 2 or 4 times faster thanthe processing frequency. The 2DWT-A architecture,in Fig. 5, can process data sampled at twice the modulefrequency, and the processing block has extra hardwareto process the data. The architecture of 2DWT-A con-sists of two main components. PM1

1 processes the inputline, L0

odd (odd row input of 0 decomposition level), andthe LL and LH outputs, and PM1

2 processes the inputline, L0

even (even row input of 0 decomposition level)and the HH and HL outputs. Each module further con-sists of a finite impulse response (FIR) filter and a liftingstep. The architecture clearly differs from the typicallifting step and, in our architecture, the input is passedthrough an L or M-tap FIR filter, depending upon the

Figure 4. The scanning path allows the sampling frequency to betwo or four times the processing frequency.

Figure 5. 2DWT-A architecture showing parallel inputs processedthrough parallel filters and parallel-pipelined lifted stages.

processing block. The scheduling in this case will besimple, with serial input data being decimated to LL0

oddand L0

even input terminals. The FIR filter is responsi-ble for appropriately scheduling the input data to thepredicting and update stages.

The decimation of 1D input data, row-wise by a fac-tor of 2, at the input stage is advantageous in dividingthe entire level 1 module into parallel modules (PM1

1and PM1

2). The decimation in the other dimension (col-umn wise) helps in achieving pipelining within the pro-cessing modules. High speed is therefore achieved byparallel processing; which, combined with a pipelinedarchitecture within lifting, achieves high throughput,equal to 2 data cycles per clock cycle.

A variation of the architecture 2DWT-A(V-I) is givenby the input sampling diagram shown in Fig. 4(b). Theinput sampling rate will be four times higher than theprocessing frequency; the processing block, therefore,will require additional hardware to process at the inputdata rate. The architecture will result in an output datarate of 4-outputs/cc. The complete dyadic operation ofthe processing of an N × N image is computed by the2DWT-A(V-I) architecture in N 2/4 clock cycles. Thismodified 2DWT-A architecture will have a lifting stepof 4 instead of 2. This architectural variation will not bediscussed here; however, results of its computationalcomplexity and comparisons with other architectureswill be provided in Section 4.

3.3.2. Stage II—2DWT-A Design. In order toachieve 100% hardware utilization in the 2nd and sub-sequent stages, a polyphase decomposition is used. Thetechnique is similar to the one proposed by Chen [13].

340 Alam et al.

Figure 6. Polyphase decomposition used in the Level 2 architecturefor coefficient computation.

Due to the decimation operation in Level 1, the quan-tity of data required for filtering at Level 2 reduces toone half. Figure 6 shows the polyphase decompositionand splitting of even and odd order parts for the secondstage. The internal clock rate is half the input clock rate,hence we can double the clock rate and the second stagewill only take half of the time of Level 1 to computethe second level coefficients. Here f /2 implies that theoutput frequency is one half of the input frequency. Thearchitecture therefore constantly increases the speed ofthe transform computation with half of the time takenat each subsequent stage as compared to the previousstage.

4. Performance Analysis

A Typical comparison of DWT architectures includesevaluation of latency, hardware complexity and the ef-ficiency (hardware utilization) of the architectures. Fora (L×M) tap filter (9/3 transform) with the input imagebeing N × N a comparison with two faster architec-tures (to the best of our knowledge) to the 2D directDWT is given by Table 1.

The Low Power architecture is achieved by keep-ing the same computational complexity but reducingthe computing performance by reducing the clock fre-quency by a factor of 4 compared to that used for theother architectures [8]. We still maintain the comput-ing performance higher then the comparison architec-tures (computing performance of N 2/4 as compared toN 2 + N ).

Our architecture is flexible in that, by maintainingthe same clock frequency, we can outperform the otherarchitectures in terms of speed [8] or, by reducing thecomputational complexity, we can maintain the samespeed [6]. In addition, our architecture maintains itshigh performance with different filter sizes—a casequite frequently required in DWTs used for imagecompression.

Table 1. Comparison with other published techniques.

Complexity(9 × 3 Tap Filter)

Multipliers AddersEfficiency/Utilization

Chakrabarti [8] N 2 + N 54 52 Low

Marino [6] N 2/4 144 120 ∗Medium

Proposed 2DWT-A N 2/2 24 18 High

Proposed 2DWT-A (V-I) N 2/4 36 26 High

∗The efficiency quoted in [6] takes in to account equal length filters;however, in the case of a bi-orthogonal DWT the size of the filters(high & low pass) is different and the efficiency may be reducedconsiderably.Note: In Table 1 computing performance refers to the number of cy-cles required to complete a dyadic operation. The proposed algorithmcan be used in two ways (2DWT-A & 2DWT-A(V-I)).

To summarize, in this work we have developed analgorithm for a lifted based direct approach implemen-tation of a 2D DWT. The algorithm has been used todevelop an architecture, which is scalable and can beapplied to any general bi-orthogonal wavelet transform.Our architecture is able to compute a complete dyadicin N 2/4 cycles with more then 50% savings in hard-ware compared to previous techniques. A portion ofthis work has already appeared in [14]. Our approachhas the following steps:

1. Exploit in-place implementation. This allows thetransform to be calculated without allocating auxil-iary memory [9].

2. The transform architecture can be directly used forthe inverse transform.

3. Exploit vertical and horizontal parallelism in non-separable transforms by performing lifting row-wise and convolution column-wise.

4. Eliminate the use of zero padding for filters havingshorter sizes (typically the case for bi-orthogonalDWT) by performing lifting row-wise.

5. Eliminate transpose memory (direct approach) toreduce latency.

6. Reduce the computational complexity of perform-ing lifting for the non-separable approach in orderto offset the main disadvantage of using the directapproach.

5. Conclusion

A solution for a 2D non-separable lifted bi-orthogonalwavelet transform is presented. The technique can be

An Efficient Architecture for a Lifted 2D Biorthogonal DWT 341

easily adapted to any efficient architecture where in-herent parallelism in the solution can be exploited. Theproposed architecture is low-power/high performancewith higher hardware utilization, reduced dyadic op-eration computational time (N 2/4) and low hardwarecomplexity compared to other published approaches.Both our new algorithm and architecture are eminentlysuitable for next generation image compression stan-dards for multimedia applications.

Acknowledgments

The authors acknowledge financial support fromiCORE (Alberta), the Micronet Network of Centres ofexcellence, and the Natural Sciences and EngineeringResearch Council of Canada.

References

1. C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000Still Image Coding System: An Overview,” IEEE Transactionson Consumer Electronics, vol. 46, no. 4, 2000, pp. 1103–1127.

2. M.D. Adams and F. Kossentini, “Reversible Integer-to-IntegerWavelet Transforms for Image Compression: Performance Eval-uation and Analysis,” IEEE Transactions on Image Processing,vol. 9, no. 6, 2000, pp. 1010–1024.

3. S.G. Mallat, “A Theory for Multiresolution Signal Decomposi-tion: The Wavelet Representation,” IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 11, 1989, pp. 674–693.

4. I. Daubechies and W. Sweldens, “Factoring Wavelet Transformsin to Lifting Steps,” Journal of Fourier Analysis and Applica-tions, vol. 4, no. 3, 1998, pp. 247–269.

5. W. Sweldens, “The Lifting Scheme: A Custom—design con-struction of biorthognal wavelets,” Journal of Applied and Com-putational Harmonic Analysis, vol. 3, pp. 186–200, 1996.

6. F. Marino, “Efficient High-Speed/Low-Power Pipelined Archi-tecture for the Direct 2-D Discrete Wavelet Transform,” IEEETransactions on Circuits and Systems II: Analog and DigitalSignal Processing, vol. 47, 2000, pp. 1476–1491.

7. F. Marino, “Two Fast Architectures for the Direct 2-D DiscreteWavelet Transform,” IEEE Transactions on Signal Processing,vol. 49, 2001, pp. 1248–1259.

8. C. Chakrabarti and M. Vishwanath, “Efficient Realizations ofthe Discrete and Continuous Wavelet Transforms: From SingleChip Implementations to Mappings on SIMD Array Comput-ers,” IEEE Transactions on Signal Processing, vol. 43, 1995,pp. 759–771.

9. Chu Yu and Sao-Jie Chen, “Design of an Efficient VLSI Archi-tecture for 2-D Discrete Wavelet Transforms,” IEEE Transac-tions on Consumer Electronics, vol. 45, 1999, pp. 135–140.

10. M.H. Sheu, M.D. Shieh, and S.W. Liu, “A Low Cost VLSI Archi-tecture Design for Non-Separable 2-D Discrete Wavelet Trans-form,” in Proc. 40th Midwest Symp. Circuits and Systems, vol. 2,pp. 1217–1220, 1997.

11. H. Meng and Z. Wang, “Fast Spatial Combinative Lifting Algo-rithm of Wavelet Transform Using the 9/7 Filter for Image BlockCompression,” Electronics Letters, vol. 36, 2000, pp. 1766–1767.

12. L. Liu, X. Wang, H. Meng, L. Zhang, Z. Wang, and H. Chen,“A VLSI Architecture of Spatial Combinative Lifting AlgorithmBased 2-D DWT/IDWT,” in Asia-Pacific Conference on Circuitsand Systems, October 2002, pp. 299–304.

13. Po-Cheng Wu and L. G. Chen, “An Efficient Architecture forTwo-Dimensional Discrete Wavelet Transform,” IEEE Trans-actions on Circuits & Systems for Video Tech., vol. 11, no. 4,2001.

14. M. Alam, W. Badawy, V. Dimitrov, and G. Jullien, “Effi-cient Direct 2D Architecture for Lifted Biorthogonal DWT,”IEEE Workshop on Signal Processing Systems, 2003, pp. 340–345.

M. Alam (Student) is currently M.Sc. student in the Department ofElectrical and Computer Engineering at University of Calgary. Hisresearch interest includes VLSI signal processing. He is recipient ofiCORE International Graduate [email protected]

Wael Badawy (Ph.D. 00, M.Sc 98, 97; B.Sc. 94) is an associateprofessor in the Department of Electrical and Computer Engineer-ing. He holds an adjunct professor in the Department of MechanicalEngineering, University of Alberta.

Dr. Badawy’s research interests are in the areas of: Microelec-tronics, VLSI architectures for video applications with low-bit rateapplications, digital video processing, low power design methodolo-gies, and VLSI prototyping. His research involves designing newmodels, techniques, algorithms, architectures and low power proto-type for novel system and consumer products. Dr. Badawy authoredand co-authored more than 100 peer reviewed Journal and Confer-ence papers and about 30 technical reports. He is the Guest Editor forthe special issue on System on Chip for Real-Time Applications inthe Canadian Journal on Electrical and Computer Engineering, theTechnical Chair for the 2002 International Workshop on SoC for real-time applications, and a technical reviewer in several IEEE journalsand conferences. He is currently a member of the IEEE-CAS Tech-nical Committee on Communication. Dr. Badawy was honored withthe “2002 Petro Canada Young Innovator Award”, “2001 MicralyneMicrosystems Design Award” and the “1998 Upsilon Pi EpsilonHonor Society and IEEE Computer Society Award for Academic Ex-cellence in Computer Disciplines. He is currently the Chairman of the

342 Alam et al.

Canadian Advisor Committee (CAC) and Head of the Canadian Del-egation on ISO/IEC/JTC1/SC6 “Telecommunications and Informa-tion Exchange Between Systems”. Member, The Canadian AdvisoryCommittee for the Standards Council of Canada-Subcommittee 29:Coding of Audio, Picture Multimedia and Hypermedia Information,and Canadian Delegate, The ISO/IEC MPEG standard committee.He is a voting Member on the VSI Alliance. He is also the Chair ofthe IEEE-Southern Alberta Society-Computer [email protected]

Vassil S. Dimitrov was born in Plovdiv, Bulgaria, in 1964. He re-ceived his Ph.D. degree in mathematics in 1995 from the Mathemati-cal Institute of the Bulgarian Academy of Sciences. Since then, he hasspent two years as a postdocral fellow at the VLSI Research Group,University of Windsor, Canada, one year as a research scientist atthe Reliable Software Technology Corporation, Virginia, USA, oneyear as a chief research scientist at the Signal Processing and Com-puter Technology Laboratory, Helsinki University of Technology,Finland, and one year as an Associate Professor at the Universityof Windsor, Canada. Since July 2001 he has held the position ofAssociate Professor at the Department of Electrical and ComputerEngineering, University of Calgary, Canada. His main interests are inthe area of number theoretic algorithms, computational complexity,cryptography, optimization theory, fast algorithms for digital signalprocessing and related topics. Dr. Dimitrov is a member of the NewYork Academy of [email protected]

Graham Jullien (Fellow IEEE) was educated in the United King-dom, receiving degrees, in Electrical Engineering, from the Uni-versities of Loughborough, Birmingham and Aston (Ph.D., 1969).He was a student engineer and data processing engineer at EnglishElectric Computers, UK, from 1961 to 1966, and a visiting seniorresearch engineer at the Central Research Laboratories of EMI Ltd.,UK, from 1975 to 1976. From 1969 until 2000 he was with the De-partment of Electrical and Computer Engineering at the Universityof Windsor, Ontario, Canada, where he held the rank of UniversityProfessor and was the Director of the VLSI Research Group. SinceJanuary 2001, he has been with the Department of Electrical andComputer Engineering at the University of Calgary, where he holdsthe iCORE Research Chair in Advanced Technology InformationProcessing Systems. He is a member of the Board of Directors ofthe Canadian Microelectronics Corporation (CMC) and is a memberof the Steering Committee and Board of Directors of the MicronetNetwork of Centres of Excellence. He has published widely in thefields of Digital Signal Processing, Computer Arithmetic, NeuralNetworks and VLSI Systems, and teaches courses in related areas.He has served on the technical committees of many international con-ferences; he currently serves on the Editorial Board of the Journal ofVLSI Signal Processing; and is a past Associate Editor of the IEEETransactions on Computers. He hosted and was program co-chair ofthe 11th IEEE Symposium on Computer Arithmetic, was programchair for the 8th Great Lakes Symposium on VLSI, and was the tech-nical program chair for the 1999 Asilomar Conference on Signals,Systems and Computers. He is general chair for the 2003 AsilomarConference and general co-chair of the International Workshop onSystem-on-Chip for Real-Time Systems, Calgary, Alberta [email protected]