A 1.2 mW video-rate 2D color subband decoder

1510 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 12, DECEMBER 1995

.2 Video-Rate 2-D Color Subband Decoder Benjamin M. Gordon and Teresa H. Meng, Senior Member, IEEE

Abstract-A 2-D subband decoder has been designed for real- time video decompression in low-power applications. The chip is capable of reconstructing four levels of hierarchical subband structures for images up to 256 pixels wide. The chip’s maximum performance decodes 60 M pixels per second of color RGB video output with a 5 V supply and 120 MHz clock frequency. For portable applications with a resolution of 176 pixels wide, 240 lines, and 30 frame& color video, the chip dissipates less than 1.2 mW with a 1 V supply. Besides the reduced supply voltage, the low power consumption is achieved through algorithmic innovations for energy conservation. Higher-resolution images can be decoded by cascading multiple chips in parallel each operating on a 256 pixel wide image slice.

- Input Image

Fig. 1.

Horizontal Subbands Horizontal

Reconstrucled

HH

EnLode Decode

2-D subband filtering structure.

I. INTRODUCTION

ODAY, a growing number of computer systems are incorporating multimedia capabilities for displaying and

manipulating video data. This interest in multimedia, combined with the popularity of portable devices, provides the impetus for designing a real-time, low-power video decoder. The single-chip subband decoder described in this paper implements subband decoding used in many video compression algorithms [ 11 with very low power consumption. Previously designed subband decoders were not optimized for low power operation and additionally required large external memory support which consumes even more power [2]-[6]. The single- chip decoder described in this paper does not require any external hardware support such as off-chip memory or video control for the delivery of real-time video signals to a color display. Also, this chip, unlike other designs, supports both 2-D data and multiple levels of subband decomposition.

A. Subband Coding

Subband coding is a data transformation used in video compression to compact image information into smaller bands through filtering and decimation. In our 2-D subband decomposition scheme, the image passes through both low- and high-pass filters, vertically and then horizontally, creating four subbands: low-pass verticaHow-pass horizontal (LL), low-pass verticalhigh-pass horizontal (LH), high-pass verticaMow-pass horizontal (HI,), and high-pass verticalhigh- pass horizontal (HH), as shown in Fig. 1. After the filtering, the bands are decimated by two in each direction, as shown by the down arrows in Fig. 1. In decode, the data is upsampled, illustrated by the up arrows, filtered, and combined.

Because of the filtering, the bands have reduced spatial frequency content allowing the decimation without loss of

Manuscript received May 15, 1995, revised September 4, 1995 This work was supported in part by ARPA Contract DAFT63-91-K 0002

The authors are with the Center for Integrated Systems, Stanford University, Stanford, CA 94305 USA

IEEE Log Number 9415839

information. The filters are specially designed so that aliasing from the decimation is perfectly or nearly perfectly cancelled in the subband reconstruction [7]. Each band is one-fourth the size of the original image but contains different spatial frequencies. The LL band appears as a subsampled version of the original while the other bands hold the high frequency details as illustrated in Fig. 2. Since most of the image information is contained within the LL band, the subband filtering process is reapplied to this band creating another level of 4 subbands. This continues through four levels of filtering for the luminance (Y) data resulting in 13 subbands as shown in Fig. 3. The chrominance components (U and V), which are one-fourth the size of the luminance data, are filtered through three levels, forming 10 bands each. Multilevel subband decomposition is critical for achieving higher compression efficiency by compacting the majority of image information into smaller and smaller bands.

The decoding process recursively reconstructs LL bands from their respective lower-level bands. For a single level of decoding, the LL is horizontally upsampled and low-pass filtered while the LH data is upsampled and high-pass filtered. These filter results are then added to form the low-pass vertical data (L). The same process is applied to the HL and HH data to form the high-pass vertical data (H). This L and H data are then upsampled and filtered in the vertical direction and added for reconstruction of that level. When all the levels have been reconstructed, the original image is produced. The designed subband decoder chip implements this decoding process and supports the multiple levels of 2-D color subband decomposition.

B. Low Power Design Designing for low power first requires an understanding

of the major sources of power dissipation. An equation to estimate the power consumption in CMOS circuits is given below

P = Ce,V2f (1)

where Ceg is the effective switching capacitance, V is the

0018-9200/95$04.00 0 1995 IEEE

GORDON AND MENG: A 1.2 mW VIDEO-RATE 2-D COLOR SUBBAND DECODER 1511

Fig. 2. Subband decomposed image.

supply voltage, and the f is the frequency of switching [SI. This equation unfortunately does not provide any insight into what are the largest sources of power dissipation or how to reduce them. Alternatively, the power consumption can be represented as the sum, over all operation types, of the operation throughput multiplied by the energy per operation

(2) p = ~ o p s x - - . energy

From a power perspective, the most power consuming operations are memory access, computation, and VO, while power for control circuitry is usually negligible. The algorithm and architecture determine the throughputs of the various operations and, with careful reformulation and design, major reductions can be made in the dominant power consuming operations. The key, however, is in choosing reductions which do not compromise the performance of the overall system.

The energy per operation from the power equation is determined by technology, circuits, and supply voltage. The technology available is typically fixed to the designer. Instead, circuit techniques can provide modest savings especially for

sec op OPS vertical (y)

frequency

dc band horizontal (x) frequency

7 Fig. 3. Luminance 4-level band decomposition.

memory or analog circuits. Supply scaling provides the largest reductions in operation energy due to the squared dependency.

1512 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 12, DECEMBER 1995

Line-Delay Intermediate-Store Memory Memory

1.3 Xi+ I

low-passi+, 3

13 X: LL,LH,HL or HH Y L o r H R: Register (masterhlave latch) i: horizontai’index <<: lefL shift

L: Latch

Fig. 4. Datapath architecture. Fig. 5 . Filter implementation.

11. ARCHITECTURE This section describes the architecture of the decoder chip

and the techniques used to achieve power savings. First, the energy per operation is reduced by using a lower supply voltage and providing enough performance to compensate for the increased gate delay. Next, the computation operation throughput is minimized by careful selection of the subband filter. Finally, data ordering and representation limit the U 0 required as well as the amount of memory and accesses. In addition, with a systemwide view of power consumption, low power implementations of numerous external support functions were incorporated into the decoder chip, including a video controller, color conberter, and output frame buffer.

Fig. 4 illustrates the overall chip architecture. A small input buffer stores the subband data and passes it to the horizontal filter unit. The line delay memory stores the horizontal filter output and sends them to the vertical filter followed by the scale unit. Lower level results go to intermediate result memory where they are passed back to the input buffer for reconstruction of the next level. Top level subband results are stored in the final result memory buffer before conversion from YUV to RGB color space. The RGB results are sent off chip to a digital to analog convertor (DAC) and then to the display.

A. Supply Scaling

The design exploits the natural parallelism of the subband algorithm to achieve high peak performance, providing excess throughput that can be traded off for lower power by reducing the supply voltage [9]. Lowering the supply to 1.5 V from 5 V reduces the power by 12 times for a constant processing throughput. The gate delay increases by seven times, for which the excess performance compensates. The design uses concurrent high- and low-pass horizontal and vertical filters which are efficiently pipelined to allow for a faster clock rate. All of the circuit designs, except for the memory, use static CMOS because of their ability to function correctly at low voltages with minimal leakage current.

B. Filter Implementation

A major consideration in subband coding is the choice of filter. This determines the amount of computation required and

the number of line delays needed, as well as affecting the algorithm’s compression performance. Extensive simulations were performed to select the filter, resulting in a short 4- tap asymmetric wavelet filter (3, 6, 2, -1) as the low-pass kemel filter [lo]. This filter performs comparably to a 9- tap quadrature mirror filter (QMF) [ I l l but uses only one third the power and greatly reduces the line delay memory requirements.

The filter implementation, shown in Fig. 5, uses shift and adds to implement multiplications of the simple filter coefficients, requiring only a single 3-2 adder. Only two values are needed when applied to the four tap filter since the upsampling fills in two zeros between the data values. A small input buffer holds two horizontally consecutive values from each of the four band types. These two inputs generate two output values: one from the even filter taps and the other from the odd filter taps as shown in the following equation:

Y, = 6 X 2 - Xz+l

Y,+1 = 3X2 + 2X2+1 (3)

where X , and Xz+l are the two consecutive values and Y, and Y,+1 are the even and odd filter outputs.

The same hardware also implements the high-pass filter by reversing the coefficients and negating the (6, - 1) coefficient pair. The reversal is accomplished by switching the order of the two input values. The negation occurs when the low-pass and high-pass values, still in carry save form, are combined with a 4-2 adder and then a carry-select adder. The input data alternates between the four band types, while moving horizontally across a line as shown in Fig. 6(a). The LL and LH data are combined to form two values of L data while the HL and HH data form the H data as shown in Fig. 6(b). An incrementer rounds these results back to the input precision of 10 b. This achieves the same signal to quantization noise ratio as truncation to 12 b but saves memory space and power. The rounded results are stored in the line delay memory and passed to the vertical filter.

The vertical filter operates nearly identically to the horizontal filter but with one input from the line delay memory and the other from the output of the horizontal filter, forming two vertically oriented values. The filter inputs alternates between L and H data, requiring a reordering of horizontal filter results

GORDON AND MENG: A 1.2 mW VIDEO-RATE 2-D COLOR SUBBAND DECODER 1513

--t

6 7

LL F h . L 8 9 10 I1 12 13 14 15

LL for next level of reconstruction

vertical upsample/filter

- 4

upsamplelfilter

(a) (b)

Fig. 6. Subband input data flow. (a) input data. (b) processing.

which consist of two L values followed by two H values at every other cycle. The L and H vertical filter results are added and passed through another shift and add operation for unity gain normalization. A single 3-2 adder and carry-select adder approximate the required 1/25 multiplication with 41/1024. Shift lefts of 5, 3, and 0 added together implements the 41, while shift $right of 10 performs the divide by 1024. These results are then rounded to 10 b and stored in the intermediate result memory. If the data forms the top level reconstruction, then the results are rounded to 8 b and stored in the final result memory.

A final advantage of using the short filter manifests in the increased quality at the edge of the image. At the horizontal edge, the input data must circularly wrap around for correct operation. For longer filters this is impractical and approxima- tions are frequently used. However, with only four taps and the upsampling, the filter does not extend over the beginning of the line and extends only one value over the end. Thus the chip saves the first value from each line and reuses it at the end, improving image quality at the edges over longer filter designs. This introduces a pipeline bubble which is easily handled by the control logic. The extra cycle is not wasted since the controller uses it for a complicated state transition calculation required at the end of a line.

C. Memory The size of the line delay, intermediate, and final result

memories is critical for achieving a low power implementation. First, the power consumed in the memory increases with its size. Second, the size of these memory units determine if they can be kept on chip, greatly reducing the U 0 power. For instance, holding the intermediate reconstructed level results on-chip eliminates the need to write these values out only to be read in later. Also, a power expensive full frame buffer is eliminated by generating the output data in raster scan order and storing it in the final result buffer. The data can therefore be passed directly to the display device through a DAC. As previously mentioned, the short filter length reduces the number of lines of subband data required for the vertical filtering and, by using the values just created by the horizontal filter, only a single delay line is required.

The ordering of subband level processing also affects the amount of memory needed. Processing one line at the lowest level of the subband hierarchy generates two lines of 1- D results (one line of L, one line of H) and two lines of intermediate results as seen in Fig. 6(b). If both of the intermediate result lines are processed, this results in four

lines at the next level, followed by eight and finally 16 lines at the top level as seen in Fig. 7(a). In our implementation, the order of levels is interleaved such that processing of the next line from a given level only occurs when both lines at the higher level have already been processed. This allows memory to be reused when the next results are generated as shown in Fig. 7(b). The actual sequence of levels is shown below with zero representing the lowest frequency level and three representing the highest level

Interleaved level sequence:

0 ,1 ,2 ,313 ,2 ,3 ,3111 2 , 3 , 3 , 2 , 3 , 3 Thus, only two lines of line delay and two lines of 2-D store are required per level. Additionally, the chip decodes two chrominance bands along with the luminance data. The chrominance bands are also interleaved and inserted into the luminance band stream so that the data sets maintain synchronization.

The total amount of memory for the line delays is 20 Kb organized as eight banks of 128 high by two words of 10 b. The intermediate-store memory, totalling 10 Kb (128 x 4 x 20), holds two lines of subband coefficients for each level and then feeds them back to the input buffer for processing of the next level. The final result buffer was increased to four lines to prevent buffer underflow and holds 20 Kb (128 x 8 x 20).

The memory uses a six transistor cell with a cross coupled sense amplifier capable of operation with a supply voltage as low as 1 V. In the first clock phase, a word line is selected and a differential voltage forms on the bit lines. In the second phase, the sense amplifier evaluates, and the bit lines are precharged high. The sense amplifier result is passed to a push-pull, tri-state driver which puts the value onto the output bus. A completion signal turns the sense amplifier off and connects it back to the bit-lines for precharging. A local word line activates only the desired memory bank, saving the unnecessary power of activating other banks [12]. Fig. 8 contains a simplified schematic of the bit column circuitry.

The intermediate and final result memories also perform a transpose of the data from the vertical filter since the two values generated in a cycle are oriented vertically, but need

1514 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL 30, NO 12, DECEMBER 1995

Bit-b

Done I

output $7

Fig. 8. Memory circuitry.

to be accessed horizontally. Two vertically filtered outputs are combined before each horizontal pair is written to its own bank. The memory also has separate read and write buses eliminating the problem of disabling the read output before the write data is activated. Furthermore, with dedicated buses, as opposed to shared buses, only the necessary capacitance is switched, reducing the power consumption.

D. Color Conversion The chip includes the YUV-to-RGB color conversion op-

eration because of its potentially high power dissipation. The conversion algorithm is greatly simplified but with no visual degradation in image quality [13]. The original matrix coefficients, shown in (4), are rounded to only 2-b of precision, shown in (5), reducing the entire conversion to only five full

adds and one carry-save add per RGB output

0 F::n] = [' -0.34414

[:&I = [: -0.375

Blue 1 1.772 0 (4)

0

Blue 1 1.5 0

A video timing controller included in the chip regulates reading the YUV data from the final result buffer and the generation of the RGB outputs. Programmable timing param- eters for vertical and horizontal synch and blanking intervals specify the video synchronization signal needed by the display device. When the final result buffer is full, the chip executes power down stall cycles, using gated clocks, until buffer space is available. Stalling occurs during vertical and horizontal blanking periods since no data is read from the buffer. Since the data is buffered and formed in raster scan order, the RGB output can be sent directly through a DAC to the display device without the need for a high power frame buffer.

E. Zero Processing

When processing the highest frequency bands, the design takes advantage of the large percentage of zeros in these bands. The input data for the top level luminance LH and HH bands is zero run-length encoded. The top level luminance HH as well as all the top level bands for the chrominance are set to zero. This reduces the number of external inputs by almost a factor of 4, lowering the overall system power by reducing the power dissipated in the external input source. The control logic detects when zero values are present and skips their horizontal processing. The controller uses the run lengths to determine when a value is zero and stores this information in a shift register. Processing is skipped when two consecutive values of the HL band are zero. When this occurs, the LL and LH data are processed as usual, but the HL and HH data are skipped, moving to the next LL and LH data. The vertical processing uses the L data but skips the H processing, using the L filtered result as the final result. To allow skipping of the vertical H cycle, the two HL values from the previous line must also be zero, as determined by the shift register. With this process skipping, the average number of clock cycles required per output pixel, including generation of all intermediate level and final image data for all three components, is reduced from 1.98 to 1.23, resulting in a 15% reduction of total chip power by increasing the number of stall cycles. Because the processing of multiple levels does not generate an even flow of output results, the chip, operating at twice the pixel rate, may not meet the original 1.98 cycles/pixel requirement. Therefore, the reduction in processing cycles is also important to ensure that output buffer underflow cannot occur.

F. Control Flow Fig. 9 contains a simplified diagram of the controller's

functionality which has been greatly increased to implement all of the datapath and memory optimizations for energy

GORDON AND MENG: A 1.2 mW VIDEO-RATE 2-D COLOR SUBBAND DECODER

300.0.

1515

Total Datapath

Add (16 hits) 3-2 add (16 bits)

Latching (16 hits) Internal Memory

Internal Read (16 hits) Internal Wnte (16 bits)

Count

Result Read Address

: Video Timing l Controller

Fig. 9. Control flow.

I Buffer L _ _ _ _ _ _ _ _ _ _ _ _ _ - _ _ - _ _ - - ~

0.35 0.34 29

17 7 7 0 12

25 2 0 05

100 1 8 0 18

0.26 0.39 33

2 4 36 009

2 4 71 0 17

400.0 i I

Control 0.13 I 0.09 7

v=37 i

Total 1.0

Full Stall

1.2 100

Maximum Operating Frequency (MHz)

conservation. The run-length section deals with keeping track of the zero data positions and determines when processing is required. The skip processing further complicates the pipeline control since now two different operation sequences are pos- sible. One sequence consists of alternating between all four band types, while the other only alternates between the LL and LH types. The current state must be passed down the pipeline to ensure the correct operation of the computational and memory units.

The main state machine keeps track of the band type, current position within a line, the color component, and the current level in the interleaving scheme. Additionally, it generates addresses for the memories. The video controller determines the display timing and controls reading from the final result buffer. This read address is compared to the state controller's final result buffer write address to determine whether the buffer is full and the chip must be stalled.

111. PERFORMANCE

The peak performance at 5 V generates 60 Mpixelds of RGB components with a 120 MHz operating frequency while dissipating 1.2 W. For a target rate of 3.2 MHz for use with a small portable display, the chip provides significant excess throughput, allowing it to meet the requirements with a 1 V supply while dissipating under 1.2 mW. Fig. 10 illustrates the power dissipation at the maximum operating frequency for various supply voltages. Fig. 11 illustrates the variation in energy and delay as a function of the supply voltage. The energy curve grows at approximately the voltage squared while the delay curve displays the expected 1/(V& - V t h )

behavior. Table I displays the breakdown of total power consumption

among the different sections of the chip at the 3.2 MHz rate. As expected the computation, I/O, and memory dominate the power, each consuming approximately one third of the total power. Power usage in the control section remains a small percentage despite the increased complexity required to implement all of the power saving strategies. The estimated power numbers were derived by determining the operation throughputs multiplied by simulation based energy per operation numbers. These estimates were used in making

Fig. 10. Frequency versus power.

H d e l a y

Supply Voltage

Fig. 11. Energy and delay versus voltage.

TABLE I POWER BRFAKDOWN FOR 1 V, 3.2 MHZ

9i of Energy/op Estimated Measured 1 sec 1 (PJ) 1 I 1 Total I Operation

I External access 80 I 0.22 I 0.38 I 31 I (16 bits) 1 2'7 I

high level algorithmic design decisions and are very close to the measured numbers, indicating the validity of this design approach.

For higher-resolution images, multiple chips would be cas- caded, each operating on a maximum of 256 pixel wide slice, producing a final image without boundary artifacts. Table I1 illustrates the power dissipation at the required clock frequency

1516 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL 30, NO. 12, DECEMBER 1995

TABLE I1 HIGH RESOLUTION CONFIGURATIONS

Format Size

2.4

to function correctly at these lower voltages while the design compensates for the increased gate delay. These modifications can result in increased control complexity, but since the control power is only a small fraction of the total power consumption, the impact on total power dissipation is negligible. These ideas have been demonstrated with the described 2-D, multilevel, color subband decoder achieving extremely low levels of power consumption for portable applications.

I HiresMonitor 1 1024x768~75 I 4 1 37 (2.0V) I 192 I ACKNOWLEDGMENT

The authors would like to thank E. Tsern, N. Chaddha, B. Amrutur, and R. Gonzalez for their help on the design and implementation.

Fig. 12. Chip micrograph.

when used for decompressing high-resolution images. The operating voltages are determined by the real-time computation requirements. This additional chip level parallelism keeps the operating frequency and thus the supply voltage low, resulting in extremely low power dissipation even for HDTV applications.

Fig. 12 is a chip micrograph of the subband decoder which contains 415 000 transistors in a 9.5 x 8.7mm2 area imple- mented in a 0.8 p, three level metal, CMOS technology with a threshold voltage of 0.7 V. The three memory blocks dominate the area, emphasizing the importance of minimizing their size.

IV. CONCLUSIONS

High level algorithmic and architectural design lead to the largest power savings by evaluating tradeoffs between algorithm performance and flexibility versus power. These decisions can be effectively estimated by their resulting operation throughputs and energy per operation. Minimizing the memory and I/O operations through architectural and data flow transformations is crucial. Further, by integrating otherwise high power functionality into a single design, the overall system power can be managed. Finally, providing excess throughput allows the design to operate at lower voltages with the associated savings in power. The circuits must be able

REFERENCES

[l] J. Woods, Ed., Subband Image Coding. Boston: Kluwer, 1991. [2] M. Winzker et al., “VLSI chip set for 2D HDTV subband filtering

with on-chp line memories,” IEEE J. Solid-State Circuits, vol. 28, pp. 1354-61, Dec. 1993.

[3] G. Van Der Wal and P. Burt, “A VLSI pyramid chip for multiresolution image analysis,” Int. J. Comput. Vision, pp. 177-189, Sept. 1992.

[4] M. Vishwanath and C. Chakrabarti, “A VLSI architecture for real-time hierarchical encoding/decoding of video using the wavelet transform,” in Proc. ICASSP 1994, Apr. 1994, vol. 2, pp. 401-404.

[5] A. Lewis and G. Knowles, “VLSI architecture for 2-D Daubechies wavelet transform without multipliers,” Electron. Lett., pp. 171-173, Jan. 1991.

[6] J. Kowalczuk et al. “A VLSI filter architecture for digital HDTV codecs,” in 1992 IEEE Int. Symp. Circuits Syst., May 1992, vol. 3, pp.

171 P. Vaidyanathan, Multirate Systems and Filter Banks. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[8] A. P. Chandrakasan et al., “Low-power CMOS digital design,” IEEE J. Solid-state Circuits, vol 27, pp. 473484, Apr. 1992.

[9] M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-power djgital design,” in I994 IEEE Symp. Low Power Electron., Oct. 1994, vol. 1, pp. 8-11.

[ 101 A. Akansu, “Multiplierless suboptimal PR-QMF design,” SPIE, vol. 1818, pp. 723-734, Nov. 1992.

[ 11 J T. Senoo and B. Girod, “Vector quantization for entropy coding of image subbands,” IEEE Trans. Image Processing, vol. 1, pp. 526-533, Oct. 1992.

[12] B. Amrutur and M. Horowitz, “Techniques to reduce power in fast wide memories,” in Proc. 1994 Symp. Low-Power Electron., Oct. 1994, pp. 92-93.

[13] B. Gordon, N. Chaddha, and T. Meng, “A low power multiplier-less YW to RGB converter based on human vision perception,” in 1994 IEEE Workshop VLSI Signal Processing, Oct. 1994, pp. 408417.

1077-1080.

Benjamin M. Gordon was born in State College, PA, in 1965. He received the B.S. degree in elec- trical engineering from MIT, Cambridge, MA, in 1987 and the M.S. degree from Stanford University, Stanford, CA, in 1992.

From 1987 to 1991 he worked as a systems engineer for Advanced Processing Labs., Inc., San Diego, CA, on real-time signal processing systems. He is currently finishing the Ph.D. degree at Stan- ford University where he is working on low power video compression for portable applications.

Teresa H. Meng, for a photograph and biography, see this issue, p 1301

A 1.2 mW video-rate 2D color subband decoder

Documents

Transcript of A 1.2 mW video-rate 2D color subband decoder