Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet...

15
European Journal of Scientific Research ISSN 1450-216X Vol.35 No.3 (2009), pp.378-392 © EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.htm Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core M. Nagabushanam Department of Electronics and Communication Engineering M.S. Ramaiah Institute of Technology, VTU, Bangalore – 560054 E-mail: [email protected] Tel: 09986019083, +91-080-23600822 / 23603123; Fax: +91-080-23600822 Cyril Prasanna Raj P Asst. Professor, VSDC, MSRSAS, Bangalore E-mail: [email protected] S. Ramachandran Professor, Dept of ECE, SJBIT, Bangalore E-mail: [email protected] Abstract The Discrete Wavelet Transform (DWT) has gained the reputation of being a very effective signal analysis tool for many practical applications. This paper presents an approach towards VLSI implementation of the Discrete Wavelet Transform for image compression. The design conforms to JPEG2000 standard and can be used for both lossy and lossless compression. In Discrete Wavelet transform, the filter implementation plays the key role. Poly phase structure is proposed for the filter implementation, which uses Distributive Arithmetic (DA) technique. The implementation of DA based DWT IP core in ASIC exploits the lookup table-based architecture, which is popular in FPGA implementations. To exploit the available resources on FPGAs, a new technique which incorporates pipelining and parallel processing of the input samples is proposed. The proposed DA based DWT architecture is faster than the conventional, The RTL design works both on FPGA and ASIC platforms. The soft IP core design was targeted on to Xilinx Spartan3E, Virtex II pro& the same design was carried out on ASIC platform Keywords: Discrete Wavelet Transforms (DWT), Distributive Arithmetic (DA), Poly- phase structure, and convolution. 1. Introduction The Discrete Wavelet Transform (DWT) is being increasingly used for image coding. This is because the DWT can decompose the signals into different sub-bands with both time and frequency information. It also supports features like progressive image transmission, compressed image manipulation, and region of interest coding [1]. Recently several VLSI architectures have been

Transcript of Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet...

European Journal of Scientific Research ISSN 1450-216X Vol.35 No.3 (2009), pp.378-392 © EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.htm

Design and Implementation of Parallel and Pipelined

Distributive Arithmetic Based Discrete Wavelet Transform IP Core

M. Nagabushanam Department of Electronics and Communication Engineering

M.S. Ramaiah Institute of Technology, VTU, Bangalore – 560054 E-mail: [email protected]

Tel: 09986019083, +91-080-23600822 / 23603123; Fax: +91-080-23600822

Cyril Prasanna Raj P Asst. Professor, VSDC, MSRSAS, Bangalore

E-mail: [email protected]

S. Ramachandran Professor, Dept of ECE, SJBIT, Bangalore

E-mail: [email protected]

Abstract

The Discrete Wavelet Transform (DWT) has gained the reputation of being a very effective signal analysis tool for many practical applications. This paper presents an approach towards VLSI implementation of the Discrete Wavelet Transform for image compression. The design conforms to JPEG2000 standard and can be used for both lossy and lossless compression. In Discrete Wavelet transform, the filter implementation plays the key role. Poly phase structure is proposed for the filter implementation, which uses Distributive Arithmetic (DA) technique. The implementation of DA based DWT IP core in ASIC exploits the lookup table-based architecture, which is popular in FPGA implementations. To exploit the available resources on FPGAs, a new technique which incorporates pipelining and parallel processing of the input samples is proposed. The proposed DA based DWT architecture is faster than the conventional, The RTL design works both on FPGA and ASIC platforms. The soft IP core design was targeted on to Xilinx Spartan3E, Virtex II pro& the same design was carried out on ASIC platform Keywords: Discrete Wavelet Transforms (DWT), Distributive Arithmetic (DA), Poly-

phase structure, and convolution. 1. Introduction The Discrete Wavelet Transform (DWT) is being increasingly used for image coding. This is because the DWT can decompose the signals into different sub-bands with both time and frequency information. It also supports features like progressive image transmission, compressed image manipulation, and region of interest coding [1]. Recently several VLSI architectures have been

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 379

proposed to realize single chip designs for DWT [2]-[7]. Traditionally, such algorithms are implemented using programmable DSP chips for low-rate applications, or VLSI application specific integrated circuits (ASICs) for higher rates.

In wavelet transforms, the original signal is divide into frequency resolution and time resolution contents. For this purpose, a cutting window is used. This window is known as “Mother Wavelet”. The problem here is that cutting the signal corresponds to a convolution between the signal and the cutting window. The signal will convolve with the specified filter coefficients and gives the required frequency information. The decomposition of the image using 2-level DWT is shown in figure-1 [8].

Figure 1: Decomposition of Image [8].

In order to perform the convolution, we require a fast multiplier that performs the operations

efficiently and quickly. Because filter coefficients remain constant during the entire duration of the transform, constant coefficient multipliers (CCMs) are considered for the design. In this work, Distributed Arithmetic based multiplier is used. Distributed arithmetic is a bit level rearrangement of a multiply -accumulate to hide the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiply-accumulate.

The paper is organized as follows: Section 2 describes in brief the concept of traditional DWT and convolution-based filter architectures specified in JPEG2000 standard. In Section 3 MATLAB results are shown, a detailed analysis of the distributive arithmetic algorithm and filter architecture will be addressed. This is followed by FPGA/ASIC implementation results and the conclusions are presented in Section 4 and Section 5 is conclusion. 2. Discrete Wavelet Transform Wavelets convert the image into a series of wavelets that can be stored more efficiently than pixel blocks. Although wavelets also have rough edges, they are able to render pictures better by eliminating the “blockiness” that is a common feature of DCT based compression [7]. Not only does this make for smoother color toning and clearer edges where there are sharp changes of color, it also gives smaller file sizes than a JPEG image with the same level of compression.

Compression is accomplished through the use of the encoder, which is presented in Figure-2. This is similar to other transform based coding schemes. The transform is first applied on the source image data. The transform coefficients are then quantized and entropy coded, before generating compressed bit stream. The decoder is just the inverse of the encoder. Unlike other coding schemes, JPEG 2000 can be both lossy and lossless. This depends on the wavelet transform and the quantization applied [8]-[10].

380 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

Figure 2: JPEG2000 Block Diagram [8].

The JPEG 2000 standard works on image tiles. The source image is partitioned into rectangular non-overlapping blocks in a process called tiling. These tiles are compressed independently as though they were entirely independent images. All operations, including component mixing, wavelet transform, quantization, and entropy coding, are performed independently on each different tile. The nominal tile dimensions are powers of two, except for those on the boundaries of the image. Tiling is done to reduce memory requirements, and since each tile is reconstructed independently, they can be used to decode specific parts of the image, rather than the whole image. Each tile can be thought of as an array of integers in sign-magnitude representation. This array is then described in a number of bit planes. These bit planes are a sequence of binary arrays with one bit from each coefficient of the integer array. The first bit plane contains the most significant bit (MSB) of all the magnitudes. The second array contains the next MSB of all the magnitudes, continuing in this fashion until the final array, which consists of the least significant bits of all the magnitudes.

Before the forward discrete wavelet transform, or DWT, is applied to each tile, all image tiles are DC level shifted by subtracting the same quantity, such as the component depth, from each sample. DC level shifting involves moving the image tile to a desired bit plane, and is also used for region of interest coding, which is explained later. This process is pictured in Figure-3.

Figure 3: Tiling, DC Level Shifting, and DWT on Each Tile [8].

As mentioned earlier, both reversible integer-to-integer and nonreversible real-to-real wavelet transforms can be used. Since lossless compression requires that no data be lost due to rounding, a reversible wavelet transform that uses only rational filter coefficients for this type of compression. In contrast, lossy compression allows for some data to be lost in the compression process, and therefore nonreversible wavelet transforms with non-rational filter coefficients can be used. In order to handle filtering at signal boundaries, symmetric extension is used. Symmetric extension adds a mirror image of the signal to the outside of the boundaries so that large errors are not introduced at the boundaries. The default irreversible transform is implemented by means of the biorthogonal Daubechies 9-tap/7-tap filter. The Daubechies wavelet family is one of the most important and widely used wavelet families. The analysis filter coefficients for the Daubechies 9-tap/7-tap filter, which are used for the dyadic decomposition, are given in Table 1 [8]. The default reversible transform is implemented by means of the Le Gall 5-tap/3-tap filter, the coefficients of which are given in Table 2 [8].

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 381

Table 1: Daubecies 9/7 Analysis Filter Coefficients [8]

k Low pass Filter (hk) High pass Filter (gk) 0 0.6029490182363579 1.115087052456994 ± 1 0.2668641184428723 -0.5912717631142470 ± 2 -0.07822326652898785 -0.05754352622849957 ± 3 -0.01686411844287495 0.09127176311424948 ± 4 0.02674875741080976

Table 2: Le Gall 5/3 Analysis Filter Coefficients [8]

k Low pass Filter (hk) High pass Filter (gk) 0 6 / 8 1 ± 1 2 / 8 1 / 2 −2 1 / 8

3. Matlab Implementation In this paper, a software reference model is developed to realize image compression using DWT. The purpose of software reference model is that it helps in understanding the image properties after and before compression and decompression. It also helps in understanding the wavelet filter properties. Figures 4 to 8 demonstrate the results obtained using the software reference model.

Figure 4: Original image

382 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

Figure 5: First level decomposition using 9/7 wavelet filter

Figure 6: Second level decomposition using 9/7 wavelet filter

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 383

Figure 7: Two level decomposition results

Figure 8: Decompressed results

Results obtained and shown in above figures demonstrate that 9/7 wavelet filter is a perfect reconstruction filter. The results of decompression are shown in Figure 8.

384 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

4. Hardware Realization of DWT for Image Compression To implement the DWT architecture, the architecture shown in Figure 9 is adopted. This architecture is a combination of line by line and level by level decomposition architecture [11]. This architecture has highest throughput and also minimum latency.

Figure 9: DWT architecture [11].

Figure 10 shows the top level architecture of the data flow in the DWT computation. External memory stores the original image, which is fed into the row processor and the intermediate output is stored in the intermediate memory. The column processor computes the 2-D DWT for level one. The row processor and column processor are reused to find the three level decomposition output. The approximation and detail coefficients are obtained as shown in Figure 10.

Figure 10: Proposed hardware architecture for DWT implementation

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 385

4.1. Computation Complexity of DWT Architecture

The computational complexity for hardware implementation of DWT is discussed as follows: Computation complexity of DWT for an M x M image is C = 16. M2. L (1 – 4 –J)/3 L and J are filter length and No of Decomposition levels. Example 256 x 256 image decomposed using J = 5 and L = 10 requires 3.5 million operations. Proposed Architecture Image Size 512 x 512 Symmetric extended 9/7 filter coefficients Distributive arithmetic algorithm for hardware simplification signed number representation

4.2. CONVOLUTION based Discrete Wavelet Transform

The traditional DWT can be realized by convolution based implementation [10] [11]. In the forward transform, the input sequences [ ]x n are down-sampling and filtered by the low-pass filters [ ]h k and high-pass filters [ ]g k to obtain the low-pass and high-pass DWT coefficients, [ ]s n and [ ]d n . The equations may be written as follows:

[ ] [ ] [2 ]k

s n h k x n k= −∑

[ ] [ ] [2 ]k

d n g k x n k= −∑ .

The poly-phase structure for the convolution based DWT is as shown in the Figure 11. The poly-phase form takes advantage of splitting the input signal into odd and even samples (which automatically decimates the input by 2), similarly, the filter coefficients are also split into even and odd components so that Xeven convolves with G0,even of the filter and Xodd convolves with G0,odd of the filter. The two phases are added together in the end to produce the low pass output. Similar method is applied to the high pass filter where the high pass filter is split into even and odd phases H0, even and H0, odd.

Figure 11: Poly-phase high pass filter bank

The FDWT can be performed on a signal using different types of filters such as db9, db7, db4 or Haar. The Forward transform can be done in two ways, such as matrix multiply method and linear equations. In the FDWT, each step calculates a set of wavelet averages (approximation or smooth values) and a set of details. If a data set s0, s1 ... sN-1 contains N elements, there will be N/2 averages and N/2 detail values. The averages are stored in the upper half and the details are stored in the lower half of the N element array. In this paper, Daubechies 9/7 filter coefficients are considered. The filter coefficients are given in Table 3.

386 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

Table 3: Daubechies 9/7-tap filter coefficients

Taps Low Pass Filter Taps High Pass Filter 4 0.6029490183263579 3 1.1115087052456994

3,5 0.2668641184428723 2,4 -0.5912717631142470 2,6 -0.078223266528987 1,5 -0.0575435262284995 1,7 -0.016864118442874 0,6 0.09127176311424948 0,8 0.02674875741080976

By using the filter structure proposed, the 3-level image compression is as shown in figure12.

Figure 12: Image Decomposition

4.3. Distributive Arithmetic Based Architecture

In this section, we first outline how to perform multiplication by using memory based architecture. Following this, we briefly explain architecture for DWT filter bank. Using this we show complete design for block based DWT.

In computing the DCT, DFT and DWT multipliers are the fundamental computing elements. Since these multipliers consume significant area, the number of multipliers and adders that can be employed on a chip is limited. The memory based approach provides an efficient way to replace multipliers by small ROM tables such that the DWT filter can attain high computing speeds with a small silicon area.

Traditionally, multiplication is performed using logic elements such as adders, registers etc. However, multiplication of two n-bit input variables can be performed by a ROM table of size of 22n entries. Each entry stores the pre-computed result of a multiplication (See Figure.13). The speed of the ROM table lookup is faster than that of hardware multiplication if the look-up table is stored in the on-chip memory. In general 22n-word ROM table is too large to be practical. In DWT, one of the input variables in the multiplier can be fixed. Therefore, a multiplier can be realized by 2n entries of ROM as shown in Figure 4.

Figure 13: ROM table approach for multiplication

4.4. Design of Filters

The computational modules of DWT consist of filters, ( )H t and ( )G t as defined in previous section. The transfer function of these filters can be represented as,

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 387

1 ( 1)0 1 ( 1)( ) ........ l

lG z g g z g z− − −−= + + +

1 ( 1)0 1 ( 1)( ) ........ l

lH z h h z h z− − −−= + + +

The DWT filters consist of filter coefficients g (i) and delays z-i. The DWT coefficients are generated after applying the high-pass and the low-pass filter. Assume that the size of input variables is n bits. If we multiply the l input variables with the filter coefficients in parallel, l n-bit multipliers and {l-1}2n-bit adders are needed.

In the proposed memory based architecture the multiplications are performed using ROM tables. The size of the ROM table needed to implement a filter with n input variables is 2n. The table consists of all the possible combinations of input. The table with three filter coefficients is as shown in the Table 4. Table 4: Look-up table

Address Content Address Content 000 0 100 C2 001 C0 101 C2 +C0 010 C1 110 C2 +C1 011 C0 +C1 111 C2+C1+C0

To access the look-up table, we have the same number of registers as filter coefficients. The

input x[n] will enter into the serial shift register which has to access the look-up table. When the next input comes into first register, the old value will be pushed into the next register. In the same way, when next values come into registers, the old values will go off from the registers.

Now, to get the address from the input values, we consider the bit positions and get the values of inputs by that bit position. For example, if we want to get the first address, we have to consider the LSBs of all serial registers. By this address we will get the first position value. In the same manner, we have to get all bit position addresses and get the corresponding values from the look-up table. While adding, we have to shift the values by the bit position value and give them to adder. Finally, we have the result, which is the convolution of the filter coefficients and the inputs. The accessing of look-up table is as shown in the figure 14.

Figure 14: Accessing ROM table [12].

The same architecture will be used for the both high-pass and low-pass filters. If the input is 8-bit length, then we require 8 clock cycles to get the convolved value. In computing the wavelet coefficients the filter operations are specified using floating point arithmetic. However, integer arithmetic is used in practice. Thus, the filter coefficients are truncated. This truncation reduces the accuracy of the computed coefficients and hence affects the reconstructed image quality. By using a wide length multiplier, we can reduce the truncation error and can potentially improve the image quality.

From Table 3, it is clear that the filter coefficients are in fractional form. To represent them in binary form, number of the representation bits will increase. So, the number of computations will

388 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

increase. So, we have to round-off them to decimal values. For, this we have to multiply each coefficient with a known higher decimal value. In this paper, decimal number 256 is considered. At the final stage, we have to divide the results with the same 256.

To speed up the process we can go for the parallel implementation of the Distributive Arithmetic (DA). The structure is as shown in the Figure.15. In parallel implementation, we divide the input data into even samples and the odd samples based on their position. Even we can split the filter coefficients into even and odd samples. So, the even samples convolve with the even and odd filter coefficients and at the same time the odd samples also convolve with the same coefficients. So, by the same time we are getting the result for both even and odd samples of input.

Figure 15: Parallel implementation of DA technique [12].

Figure 16 shows the Modelsim simulation result for DWT, the architecture proposed and discussed in the previous section is modeled using HDL and suitable test bench is developed for verification. The test vectors are taken from image samples available in Matlab and used as test vectors. Figure 17 is the comparison results with MATLAB. We find that the MatLab and Modelsim results exactly match hence conforming our architecture validity.

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 389

Figure 16: DWT Modelsim simulation results

Figure 17: Comparisons with MATLAB Results

The design is synthesized using Xilinx ISE; the targeted device is Virtex II pro consisting of 30 million gates. We have chosen this device family, as image compression algorithm requires large memory. Storage of large data is supported by Virtex ii pro device. The design is optimized for area, power and timing.

390 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

Figure 18: FPGA synthesis schematic

4.5. FPGA Synthesis Results

Selected Device: 2vp30ff896-7 Number of Slices: 6059 out of 3696 44% Number of Slice Flip Flops: 5222 out of 27392 19% Number of 4 input LUTs: 8025 out of 27392 29% Number of bonded IOBs: 120 out of 556 21% Number of GCLKs: 1 out of 16 6% Speed Grade: -7 Minimum period: 6.248ns (Maximum Frequency: 160.051MHz) Minimum input arrival time before clock: 4.367ns Maximum output required time after clock: 3.922ns Maximum combinational path delay: No path found The device utilization is 44%, which implies that the design requires 13.2 million gates out of

30 million gates. This ensures that there is enough space for the further improvement and also more space for multiple functions to be implemented on the selected FPGA. The maximum frequency at which the design works is at 160 MHz; this can be further improved by changing the architecture complexity 4.6. ASIC Implementation

The coding was done by using bottom-up design flow. i.e., the individual components were designed by using behavioural coding and finally the individual blocks were connected by using the structural description.

The design was implemented on SPARTAN-2E FPGA board with the 32x32 image. For the implementation the image was stored in a ROM and mapped on to the FPGA. The DWT code was mapped on to another FPGA and interfaced together. The results were observed on the LEDs. The same code was simulated in ModelSim simulator.

After simulating the designs extensively, they were synthesized using the Synopsys Design -Compiler with TSMC’s 0.13-micron CMOS standard cell library at 125 MHz frequency. The designs were optimized for delay with a maximum output capacitance of 0.1, an operating voltage of 1.2 V, and a temperature of 250 C.

The synthesized code was compared with the reference design and found that both are same. For this purpose, Synopsys Formality has been used. The tool take some reference points in implemented design and compare with the reference design.

Static timing analysis was carried out using Primetime. In Primetime block level timing analysis is carried out for the same design compiler constraints. The timing analysis was done two

Design and Implementation of Parallel and Pipelined Distributive Arithmetic Based Discrete Wavelet Transform IP Core 391

times, before and after the layout. In pre-layout timing analysis, wire load models were used to calculate the delays. In Post global route timing analysis, the original parasitic values were considered. Finally, the placement and routing is carried out in the Synopsys Astro. In the process, the placement is carried out manually; timing analysis was done for the placed design. The final placed and routed design is as shown in figure19.

Figure 19: Placed and Routed design from Astro

5. Conclusions The Discrete Wavelet Transform provides a multi-resolution representation of images. The transform has been implemented using filter banks. For the design, based on the constraints the area, power and timing performance were obtained. Based on the application and the constraints imposed, the appropriate architecture can be chosen. For the Daubechies 9/7-biorthogonal filter, the poly-phase architecture, with DA technique was implemented. It is seen that, in applications, which require low area, power consumption, and high throughput, e.g., real-time applications, the poly-phase with DA architecture is more suitable. The biorthogonal wavelets, with different number of coefficients in the low pass and high pass filters, increase the number of operations and the complexity of the design, but they have better SNR than the orthogonal filters. The DWT IP core was designed by using Synopsys ASIC design flow. First, the code was written in VHDL and implemented on the FPGA using a 32x32 random image. Then, the code was taken through the ASIC design flow. For the ASIC design flow, a 8x8 memory considered to store the image. This architecture enables fast computation of DWT with parallel processing. It has low memory requirements and consumes low power. By using the same concepts which are mentioned above are useful in designing the Inverse Discrete Wavelet Transform (IDWT).

392 M. Nagabushanam, Cyril Prasanna Raj P and S. Ramachandran

References [1] David S. Taubman, Michael W. Marcellin – JPEG 2000 – Image compression, fundamentals,

standards and practice”, Kluwer academic publishers, Second printing – 2002. [2] G. Knowles, “VLSI Architecture for the Discrete Wavelet Transform,” Electronics Letters,

vol.26, pp. 1184-1185, 1990. [3] M, Vishwanath, R. M. Owens, and M. J. Irwin, “VLSI Architectures for the Discrete Wavelet

Transform,” IEEE Trans. Circuits And Systems II, vol. 42, no. 5, pp. 305-316, May. 1995. [4] A.S. Lewis and G. Knowles, “VLSI Architectures for 2-D Daubechies Wavelet Transform

without Multipliers”. Electron Letter, vol.27, pp. 171-173, Jan 1991. [5] K.K. Parhi and T. Nishitani “VSLSi Architecture for Discrete Wavelet Transform”, IEEE

Trans. VLSI Systems, vol. 1, pp. 191-202, June 1993. [6] M. Vishwanath, R.M. Owens and M.J. Irwin, “VLSI Architecture for the Discrete Wavelet

Transform”, IEEE Trans. Circuits and Systems, vol. 42, pp. 305-316, May 1996. [7] C. Chakrabarti and M. Vishwanath, “Architectures for Wavelet Transforms: A Syrvey”, Journal

of VLSI Signal Processing, Kulwer, vol.10, pp. 225-236, 1995. [8] David S. Tabman and Michael W. Marcelliun, “JPEG 2000 – Image Compression,

Fundamentals, Standards and Practice”, Kulwer Academic Publishers, Second printing 2002. [9] Charilaos Christopoulos, Athanassios Skodras, and Touradj Ebrahimi - "THE JPEG2000

STILL IMAGE CODING SYSTEM - AN OVERVIEW", Published in IEEE Transactions on Consumer Electronics, Vol. 46, No. 4, pp. 1103-1127, November 2000.

[10] Majid Rannani and Rajan Joshi, “An Overview of the JPEG2000 Still Image Compression Standard”, Signal Processing, Image Communication, vol. 17, pp. 3-48, 2002.

[11] K. Seth and S. Srinivasan, “VLSI Implementation of 2D DWT/IDWT Cores using 9/7-tap filter banks based on Non-Expansive Symmetric Extension Scheme” proceedings of ASP-DAC/VLSI Design 2002, Bangalore, India, 7-11 January 2002.

[12] J. Fridman and E.S. Manolakas, “Discrete Wavelet Transform: Data Dependence Analysis of Distributed Memory and Control Array Architectures”, IEEE Transactions on Signal Processing, vol. 45, pp. 1291-1308, May 1997.