CHAPTER ONE - CiteSeerX

SANTA CLARA UNIVERSITY Department of Computer Engineering

Date: January 26, 2004

I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY

Nien-Tsu Wang

ENTITLED

Processing and Storage Models for MPEG-2 Main Level and High Level Video Decoding

— A Block-Level Pipeline Approach

BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE

DOCTOR OF PHILOSOPHY IN COMPUTER ENGINEERING

Thesis Advisor Thesis Reader Chairman of Department Thesis Reader

Thesis Reader

Processing and Storage Models for MPEG-2

Main Level and High Level Video Decoding

Nien-Tsu Wang

DISSERTATION

Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

in Computer Engineering in the School of Engineering of Santa Clara University, 2004

Santa Clara, California

Dedicated to

my mother Mei-Ying and father Hsieh-Chung, and

my wife Mei-Chuan, and children Terrance and Angelica

for their love and care

Acknowledgements

I would first like to thank Professor Nam Ling for serving as my advisor

during my time at SCU. His total support of my project and countless

contributions to my technical and professional development made for a truly

enjoyable and fruitful experience.

To Professors Silvia Figueira, Tokunbo Ogunfunmi, Weijia Shang, and Shoba

Krishnan for serving on my Ph.D. committee. Their detailed and illuminating

comments strengthened this dissertation considerably and widened my

research knowledge.

To all past and current graduate students, and Mrs. Duan-Juat Ho in our

research group, for their valuable interaction and technical discussion.

To my wife Mei and family, for their unconditional support and everlasting

To Medianix Semiconductor Inc. and NJR corporation, for their financial

support of my project.

TABLE OF CONTENTS Acknowledgements

List of Figures

List of Tables

Abstract

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of the Dissertation . . . . . . . . . . . . . . . . . . . 2

1.3 Terminology of MPEG . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Overview of the MPEG-2 Video Decoding Process . . . . . . . 11

1.5 The Design of an MPEG-2 Video Decoder . . . . . . . . . . . . 14

1.6 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 2 Processing and Storage Models for MPEG-2 MP@ML

Video Decoding —Review of Prior Art . . . . . . . . . . . . . 22

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Processing Model . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 Memory Storage Organization and Interface . . . . . . . 25

2.2.3 External Memory Access Scheduling . . . . . . . . . . . 29

2.2.4 Variable-Length Decoder (VLD) . . . . . . . . . . . . . 31

2.2.5 Inverse Discrete Cosine Transform (IDCT) . . . . . . . . 34

2.2.6 Motion Compensator (MC) . . . . . . . . . . . . . . . . 36

2.3 Motivations and Challenge . . . . . . . . . . . . . . . . . . . . . 37

2.4 Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 3 Block-Level Pipeline Scheme for MPEG-2 MP@ML Video Decoding — Processing, Storage, and Scheduling . . . . . . . 40

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Designing for Data Transfer Efficiency . . . . . . . . . . . . . . 40

3.3 The BLP Processing Model . . . . . . . . . . . . . . . . . . . . 43

3.3.1 Semantics of the BLP Processing Model . . . . . . . . . 43

3.3.2 Comparison with the Macroblock Level Processing Model . . . . . . . . . . . . . . . . . . . . . 46

3.4 Memory Storage Organization . . . . . . . . . . . . . . . . . . . 51

3.4.1 Data Storing Profile . . . . . . . . . . . . . . . . . . . . 51

3.4.2 Features of SDRAM . . . . . . . . . . . . . . . . . . . 51

3.4.3 Data Storage Organization in SDRAM . . . . . . . . . . 53

3.5 External Memory Access Scheduling . . . . . . . . . . . . . . . 64

3.5.1 Review of Related Work . . . . . . . . . . . . . . . . . 64

3.5.2 Fixed Priority Scheduling Model . . . . . . . . . . . . . 66

3.5.3 The Proposed Bus Scheduling and Internal Buffer Size Reduction . . . . . . . . . . . . . . 71

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Chapter 4 Design of a Video Decoder for DVD: Block-Level Pipeline Scheme Application Example I . . . . . 78

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2 Design Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Overall Decoding System . . . . . . . . . . . . . . . . . . . . . 84

4.4 BLP Controller Mechanism . . . . . . . . . . . . . . . . . . . . 87

4.5 Architectures of Video Processing Units . . . . . . . . . . . . . 91

4.5.2 Inverse Quantization Unit (IQ) . . . . . . . . . . . . . . 95

4.5.3 Inverse Discrete Cosine Transform Unit (IDCT) . . . . . 96

4.5.4 Motion Compensation Unit (MC) . . . . . . . . . . . . . 100

4.6 Display Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.7 Performance Simulation Model . . . . . . . . . . . . . . . . . . 109

4.8 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 112

Chapter 5 Processing and Storage Models for MPEG-2 MP@HL Video Decoding — Review of Prior Art . . . . . . . . . . . . 119

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2 Overview of the Grand Alliance HDTV System . . . . . . . . . 119

5.3 Review of Related Work . . . . . . . . . . . . . . . . . . . . . 122

5.3.1 Processing Model . . . . . . . . . . . . . . . . . . . . . 122

5.3.2 Memory Storage Organization and Interface . . . . . . . 124

5.3.3 External Memory Access Scheduling . . . . . . . . . . 126

5.3.5 Inverse Discrete Cosine Transform (IDCT) . . . . . . . 131

5.3.6 Motion Compensator (MC) . . . . . . . . . . . . . . . . 132

5.4 Motivations and Challenge . . . . . . . . . . . . . . . . . . . . 134

5.5 Research Direction . . . . . . . . . . . . . . . . . . . . . . . . 135

Chapter 6 Design of a Video Decoder for HDTV: Block-Level Pipeline Scheme Application Example II . . . . . 138

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.2 Overview of the Proposed Decoding Approach . . . . . . . . . 140

6.3 Overall Decoding System . . . . . . . . . . . . . . . . . . . . . 142

6.4 BLP Controller Mechanism . . . . . . . . . . . . . . . . . . . . 145

6.4.1 Overall Controller Scheme . . . . . . . . . . . . . . . . 145

6.4.2 Memory I/O Scheduling . . . . . . . . . . . . . . . . . 147

6.5 Memory Interface Scheme . . . . . . . . . . . . . . . . . . . . 150

6.6 Architecture of Video Processing Units . . . . . . . . . . . . . . 152

6.6.1 Inverse Discrete Cosine Transform Unit (IDCT) . . . . 153

6.6.2 Motion Compensation Unit (MC) . . . . . . . . . . . . 155

6.7 Performance Simulation Model . . . . . . . . . . . . . . . . . . 159

6.8 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 161

6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Chapter 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.1 Additional Applications of BLP . . . . . . . . . . . . . . . . . . 177

7.2 Conclusions and Future Research . . . . . . . . . . . . . . . . . 178

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

LIST OF FIGURES Figure 1.1 Data hierarchy and functionality of MPEG-2 video bitstream . . . . . . . 7 Figure 1.2 Two methods for scanning DCT coefficients are available in MPEG-2 . . 9

Figure 1.3 Motion compensation interpolation using bi-directional prediction . . . . 10

Figure 1.4 A simplified and high-level functional diagram of the MPEG-2 video

decoding process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Figure 1.5 Analyzing phases for MPEG-2 video decoder architecture design . . . . . 16 Figure 1.6 Data flow of the Block Level Pipeline processing scheme . . . . . . . . . 19 Figure 2.1 Data flow of the macroblock-level pipeline decoding scheme . . . . . . . 23 Figure 2.2 Data flow of the amended macroblock-level pipeline decoding scheme . . 24 Figure 2.3 Three typical memory mapping structures for the frame buffer . . . . . . 27 Figure 2.4 Storage structure of picture data in DRAM . . . . . . . . . . . . . . . . . 28 Figure 2.5 State diagram of data bus scheduling for distributed FSM scheme . . . . . 30 Figure 2.6 State diagram of data bus scheduling for polling scheme . . . . . . . . . . 31 Figure 2.7 Block diagram of the Lei-Sun VLD architecture . . . . . . . . . . . . . . 33 Figure 2.8 Block diagram of the Lee Motion Compensator architecture . . . . . . . . 37 Figure 3.1 Generic timing diagram for decoding non-intra macroblocks under

the BLP scheme and the proposed bus scheduling scheme . . . . . . . . . 44 Figure 3.2 Generic timing diagram for decoding non-intra macroblocks under

MB-level pipeline scheme and fixed-priority bus scheduling scheme . . . 47 Figure 3.3 Data bus utilization comparison for the BLP scheme and the amended

macroblock-level scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 3.4 Comparison of reading cycles for EDO DRAM and SDRAM

(source: adapted from Micron Technology) . . . . . . . . . . . . . . . . 52 Figure 3.5 Reference Macroblock storage configuration in 64-bit and 32-bit

data-word SDRAM and corresponding redundant data overhead . . . . . 54 Figure 3.6 VBV buffer data storing configuration and accessing pattern

in SDRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 3.7 Reference macroblock access for motion compensation . . . . . . . . . . 57

Figure 3.8 Interlaced macroblock-row memory mapping for the frame buffer . . . . 58 Figure 3.9 Reference macroblock access pattern under the interlaced

macroblock-row storage structure . . . . . . . . . . . . . . . . . . . . . . 59 Figure 3.10 Specifying the memory addresses of reference blocks . . . . . . . . . . . 61 Figure 3.11 Worst-case page-breaks during reference data access under

macroblock-level processing . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 3.12 Data flow model of the bus and internal buffers for an MPEG-2

video decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 3.13 State diagram of the proposed bus scheduling scheme . . . . . . . . . . . 71 Figure 3.14 Average number of filling requests for different VLD buffer sizes . . . . . 75 Figure 4.1 Proposed design methodology of the MPEG-2 video decoder . . . . . . . 80 Figure 4.2 Data flow block diagram of DVD-Video . . . . . . . . . . . . . . . . . . 81 Figure 4.3 Block diagram of the proposed DVD video decoder . . . . . . . . . . . . 86 Figure 4.4 The flow chart of BLP decoding process for non-intra macroblocks . . . . 88 Figure 4.5 The flow chart of BLP decoding process for intra macroblocks . . . . . . 89 Figure 4.6 Block diagram of the Variable Length Decoder . . . . . . . . . . . . . . 93 Figure 4.7 The FSM for VLD processing and error handling . . . . . . . . . . . . . 94 Figure 4.8 Block diagram of the Inverse Quantization Unit . . . . . . . . . . . . . . 95 Figure 4.9 Block diagram of the IDCT unit and word lengths for

interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 4.10 A novel read-write sequence for transpose RAM in the IDCT unit . . . . 99 Figure 4.11 Output timing diagram for the proposed IDCT unit . . . . . . . . . . . . 99 Figure 4.12 Outline of motion compensation . . . . . . . . . . . . . . . . . . . . . . 101 Figure 4.13 Block diagram of the Motion Vector Decoder . . . . . . . . . . . . . . . 102 Figure 4.14 Block diagram of the MC unit . . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 4.15 Data processing pattern, pipeline stages, and output timing diagrams

for MC processing of B- and P-type macroblocks . . . . . . . . . . . . . 105 Figure 4.16 Timing diagram of displaying order, decoding order, and the proposed

recovery mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Figure 4.17 Processing diagram of the proposed DVD video decoder performance simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Figure 4.18 Timing diagrams for I-, P-, and B-type macroblocks . . . . . . . . . . . . 113 Figure 5.1 Transport packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Figure 5.2 Structure of a video decoding approach using slice-level scheme . . . . . 122 Figure 5.3 Examples of dual memory bus interface and corresponding

data storage structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Figure 5.4 Reordering memory access sequences to avoid page-breaks and

latency of read/write switches . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure 5.5 Codeword-length tree for two-level concurrent decoding.

(Source: [Hsieh 96]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Figure 5.6 Architecture diagram for two-level concurrent-decoding VLD

(Source: [Hsieh96]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Figure 5.7 Block diagram of the Masaki Motion Compensator architecture . . . . . . 133 Figure 6.1 Basic set-top box architecture for DVB-T digital TV . . . . . . . . . . . . 139 Figure 6.2 Block diagram of the proposed HDTV video decoder architecture . . . . . 144 Figure 6.3 Flow chart of the controller setting the demultiplexor . . . . . . . . . . . 146 Figure 6.4 Flow chart of HDTV BLP decoding process for non-intra

macroblocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Figure 6.5 Flow chart of HDTV BLP decoding process for intra macroblocks . . . . 149 Figure 6.6 Block diagram of memory interface scheme . . . . . . . . . . . . . . . . 152 Figure 6.7 Block diagram of IDCT core processor for HDTV video decoder . . . . . 153 Figure 6.8 Writing and reading order in the transpose RAM . . . . . . . . . . . . . . 154 Figure 6.9 Output timing diagram for the proposed IDCT unit for

HDTV decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 6.10 Block diagram of the MC unit for the HDTV video decoder . . . . . . . . 156 Figure 6.11 Data processing pattern, pipeline stages, and output timing diagram for

the MC processing of B- and P-type macroblocks . . . . . . . . . . . . . 157 Figure 6.12 Processing diagram of the proposed HDTV video decoder performance

simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Figure 6.13 Average number of filling requests for different VLD buffer sizes

(The threshold for VLD buffer refilling is at 15 bytes) . . . . . . . . . . . 164

Figure 6.14 Average number of filling requests for different VLD buffer sizes

(The threshold for VLD buffer refilling is half the VLD buffer size) . . . 166 Figure 6.15 Timing diagram for I-, P-, and B-type macroblocks for Women.m2v . . . 168 Figure 6.16(a) Statistical distributions of macroblock decoding cycles for I-, P-, and

B- pictures for women.m2v . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 6.16(b) Statistical distributions of macroblock decoding cycles for I-, P-, and

B- pictures for flowers.m2v . . . . . . . . . . . . . . . . . . . . . . . 173

LIST OF TABLES Table 1.1 Parameter bounds of video streams for MPEG-2 five different

profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Table 2.1 Comparison of computational complexity of various IDCT algorithms

for an 8x8 point block . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 3.1 Characteristics of I/O processes on the memory bus . . . . . . . . . . . . 55 Table 3.2 Procedure for determining the memory address of reference blocks . . . . 62 Table 3.3 Comparison of average page-break occurrence under different

reference picture storage structures . . . . . . . . . . . . . . . . . . . . . 64 Table 3.4 Comparison of different internal buffer sizes under macroblock-level

decoding mode and the proposed BLP decoding mode . . . . . . . . . . . 73 Table 3.5 Average data amount per one macroblock within I-, P-, and

B-pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Table 4.1 DVD-Video parameters summary and comparisons with

MPEG-2 MP@ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Table 4.2 Sizes of internal buffers adopted for the simulation model for

the proposed DVD architecture . . . . . . . . . . . . . . . . . . . . . . . 112 Table 4.3 Number of decoding cycles per macroblock and bus utilization under

different VLD buffer sizes: Mobile.m2v bitstream @ 10 Mbps . . . . . . 116 Table 4.4 Number of decoding cycles per macroblock and bus utilization under

different VLD buffer sizes: Gi_bitstream.m2v @ 15 Mbps . . . . . . . . . 117 Table 4.5 Comparison of the proposed MPEG-2 MP@ML video decoder LSI and

other video decoder designs using macroblock level processing . . . . . . 118 Table 5.1 GA-HDTV video parameters summary . . . . . . . . . . . . . . . . . . . 121 Table 6.1 Upper bounds for picture resolution and allowable processing time for

each macroblock in MPEG-2 MP@ML and GA-HDTV . . . . . . . . . . 140 Table 6.2 Sizes of internal buffers adopted for the simulation model for

the proposed HDTV architecture . . . . . . . . . . . . . . . . . . . . . . 161 Table 6.3 Average data amount per one macroblock within I-, P-, and

B-pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Table 6.4 Bus utilization and percentage of MBs exceeding 221 decoding cycles

for the two video bitstreams . . . . . . . . . . . . . . . . . . . . . . . . . 175

Processing and Storage Models for MPEG-2 Main Level and High Level Video Decoding

Nien-Tsu Wang

Department of Computer Engineering

Santa Clara University, 2003

ABSTRACT

A novel MPEG-2 video processing model, termed Block-Level-Pipeline (BLP) processing scheme,

is introduced. Under BLP control, not only does each processing unit process video data on a block-by-

block basis in a video decoder, but also accesses external frame memory for motion compensation on a

block-by-block basis. Thus, the data bus width requirement and the associated internal buffer sizes can be

minimized because the BLP processing model can evenly distribute data bus traffic in time. Besides

providing the compact design of data bus and internal buffers, the BLP scheme also can simplify the

architecture design of each processing unit because their computation load can be relieved on this block-by-

block basis. The description of the BLP design methodology is complete and precise because it takes

processing model, resource management, and process management into account. This methodology can

provide valuable estimations for system requirements in the early design stages of MPEG-2 products.

An efficient interlacing frame memory storage organization and a deterministic fairness-priority bus

scheduling scheme are also presented. This simple storing pattern can efficiently lower the probability of

occurrence of page-break when accessing the external frame memory. Reducing DRAM access latencies is

an important issue in a limited bandwidth system design. Unlike other real-time systems, the bus scheduler

can be simplified to a deterministic and fairness-priority approach due to only one block of data being

conveyed on the bus at one time. With this short-duration data transfer approach, a complicated bus

scheduler to prevent starvation conditions is not needed.

Based on these work, two designs of MPEG-2 video decoders for DVD and HDTV applications are

demonstrated collectively in this dissertation.

CHAPTER ONE

Introduction

1.1 Introduction

In recent years, many significant improvements in algorithms and architectures for

signal processing of still image, video, and audio have allowed multimedia information to

be easily stored, transmitted, and manipulated. These improvements have also fertilized the

bloom of such multimedia industries as the telecommunications industry, the consumer

electronics industry, and the electronic games industry. However, the need to build a

common audio-video coding standard comes along with these multimedia applications.

With such a standard in place, all multimedia industries can accelerate their digital audio-

video technology development, reduce the costs associated with redundant development,

and, more fundamentally for users, guarantee the flow of content unrestricted by defensive

technical barriers [Chia97].

In 1988, in response to this growing need, the International Organization for

Standardization established the Moving Pictures Expert Group (MPEG) to develop

standards for the coded representation of moving pictures and associated audio information

on digital storage media. Because the goal of MPEG is to standardize audiovisual coding

for a wide range of applications, the MPEG-established standardization principles should

be generic, specifying minimum criteria. Thus, the MPEG standards do not specify an

encoding process. Instead, they only specify formats for representing data input to a

decoder and a decoding process. Therefore, every standards-compliant decoder should be

able to understand the syntax of an incoming bitstream and decode it. How to encode the

bitstream is irrelevant. This decoder-only specification provides enough flexibility for

manufacturers to design encoders of different complexities for different applications.

Although the MPEG standards specify the decoding process, this approach is

nevertheless different from specifying a decoder implementation. According to Haskell et

al. [Hask97], “The rules for interpreting the data are called decoding semantics; an ordered

set of decoding semantics is referred to as a decoding process.” Therefore, even after the

standards are established, manufacturers can still continually improve and optimize

decoding implementation algorithms or specific elements in a decoder if these

improvements comply with the semantics defined in the MPEG standards.

1.2 Overview of the Dissertation

Chapter 1 presents a brief description of MPEG-2 video decoding and its design

issues, including the hierarchy definition of MPEG-2 video stream data and an overview of

the video decoding process. Different types of architecture implementation for the video

decoder are mentioned. A well-defined analyzing paradigm for a decoder design is

proposed. This paradigm consists of four main design phases: processing model, process

management, resource management, and optimal architecture. Design considerations for

and characteristics of each phase will be discussed in detail. A general outline of the

research objectives is also introduced.

Chapter 2 features a review on published works for MP@ML video decoding. This

review will cover the processing models, memory storage organizations and interfaces, and

external memory access scheduling schemes. Under the video specification of MPEG-2, the

architecture designs of some of the functional units in a video decoder are straightforward,

with the design concept and design performance defined in the specification. Examples are

the Inverse Quantization Unit and the Motion Compensation Unit. Other functional units

can be implemented by using any of a variety of design approaches. Examples are the

Variable-Length Decoder and the Inverse Discrete Cosine Transform Unit. A review is also

given on existing architecture design approaches for the latter two major functional units.

The limitations of existing overall design approaches are then discussed, followed by a

look at the resulting motivations and challenges in developing a new, efficient, overall

design approach. Research directions for how to overcome the limitations of existing

approaches are also presented.

Chapter 3 shows a complete description of the framework for and techniques of the

proposed design approach, which is called the Block-Level-Pipeline (BLP) processing

scheme. The BLP scheme is a full-range design approach that consists of three major

techniques: an efficient data processing model, an efficient memory storage structure, and

an efficient bus scheduling approach for MPEG-2 applications. The strategic design

direction for each technique will be discussed in detail. A comparison of decoding

performance between the BLP approach and other decoding approaches is also presented.

In Chapter 4, a MP@ML application example (a DVD video decoder) is provided to

illustrate how the BLP scheme is applied to produce an efficient architecture design. First,

a guide for determining the efficiency of the video decoder architecture is discussed, which

takes advantage of the proposed analyzing paradigm to balance interrelated design factors

such as the width of the data bus, the sizes of internal buffers, and the degree of complexity

of the bus controller. Second, the proposed DVD architecture and the specific

implementation of the BLP scheme are presented. Third, the architectures of the key

functional units are illustrated. Finally, a simulation model and simulation results are

presented showing decoding performance under the proposed architecture with the BLP

scheme.

Chapters 5 and 6 then focus on an MPEG MP@HL application (an HDTV video

decoder). In Chapter 5, a review of existing design work is presented, which covers the

following design issues: processing models, external memory interfaces, external memory

access scheduling schemes, and architecture of functional units. The disadvantages and

limitations of these design approaches are discussed. The resulting motivation and a

research direction for overcoming the limitations are also presented.

Chapter 6 presents how the BLP scheme is applied to the decoding process for the

HDTV application under the proposed dual-decoder architecture. A novel external memory

interface is presented in order to adopt the minimum data bus width that can accommodate

the heaviest bus traffic. For the proposed HDTV decoder design, one of the important

advantages is that its functional units can re-use the designs for MP@ML applications

(presented in Chapter 4), reducing the manufacturing cost. However, due to the requirement

for high data processing throughput, the Motion Compensation Unit and the Inverse

Discrete Cosine Transform Unit need minor improvements. These circuit modifications and

the new corresponding timing diagrams will be presented. Also presented will be

simulation results showing the decoding performance under a low processing frequency

with the proposed dual-decoder architecture and the BLP scheme.

Finally, in Chapter 7, the advantages of the BLP scheme are further emphasized by

implementing an MPEG decoder on portable devices. In addition to decoder applications,

the BLP scheme can benefit MPEG encoder design. These benefits will be illustrated.

Future research directions are also discussed in this final chapter.

1.3 Terminology of MPEG

Until now, MPEG has developed three standards to meet the needs of different

applications. The first standard developed by MPEG, with the nickname MPEG-1 [ISO92],

is intended to code moving pictures and associated audio for intermediate data rates on the

order of 1.5 Mbits/sec. This standard is motivated by the need for storing video signals on a

compact disc with a quality comparable to VHS cassettes. The second standard, nicknamed

MPEG-2 [ISO94], is a syntactic superset of MPEG-1 that provides more input-format

flexibility, higher data rates (up to 18 Mbits/sec as required by high-definition TV), and

better error resilience. The third standard, nicknamed MPEG-4 [ISO99], is a coding

standard for very low bit rates (about 64 Kbits/sec or less). It is intended to support

interactivity (based on audiovisual data content), universal decoding downloadability, and

better coding efficiency.

This dissertation focuses on MPEG-2 video decoder implementation analysis and

architecture design. There are a number of basic terms that will be used throughout. A brief

explication of these terms may be helpful to further discussion. More extensive discussions

of these terms are found in Haskell, Mitchell, or Wiseman [Hask97, Mitc96, Wise98].

(1) Profiles/Levels: the MPEG-2 standard intends to satisfy a wide range of applications;

however, design complexity and cost will increase if a decoder is designed to meet the

requirements of all applications [Okub95]. Therefore, MPEG defines five distinct

profiles to specify subsets of the MPEG-2 video syntax and functionality for different

purposes. Each profile can be further constrained on its parameters (e.g., picture size)

by levels. Four levels are defined in MPEG-2. Table 1.1 shows a snapshot of level

bounds for different profiles. This dissertation will be only concerned with the

required functionality and parameter bounds on MP@ML (main profile at main level)

and MP@HL (main profile at high level).

(2) The Hierarchy of MPEG Video Stream Data: For ease of error handling, random

search and editing, and synchronization, MPEG video consists of several well-defined

hierarchical layers with header and data field as shown in Figure 1.1. The first layer is

known as the video sequence layer, which contains one or more groups of pictures.

The second layer from the top is the group of pictures (GOP), which is composed of

one or more groups of intra (I) pictures and/or non-intra Predicted (P) and Bi-

directional (B) pictures. I-pictures are coded separately by themselves, while P-

pictures are coded with respect to immediately previous I- or P-pictures, and B-

pictures are coded with respect to the immediately previous or immediately following

I- or P-pictures. The third layer is the picture layer itself, and the layer beneath that is

called the slice layer. Each slice is a contiguous sequence of raster-ordered

macroblocks. A macroblock (MB) is one 16x16 array of luminance (Y) pixels with

two 8x8 arrays of associated chrominance (Cb and Cr) pixels, which forms a 4:2:0

format. Two 8x16 arrays of chrominance pixels form a 4:2:2 format. The MBs can be

further divided into distinct 8x8 blocks for further processing, such as transform

coding.

High Profile(HP)

1920 pels/line1152 lines/frame60 frames/sec

100 Mbit/sec All 3 layers80 Mbit/sec Base + Middle

25 Mbit/sec Base layer

Spatially Scalable(SPATIAL)

SNR Scalable Profile(SNR)

4 Mbit/sec Both layers3 Mbit/sec Base layer

15 Mbit/sec Both layers10 Mbit/sec Base layer

Main Profile(MP)

4 Mbit/sec

15 Mbit/sec

60 Mbit/sec

80 Mbit/sec

Simple Profile(SP)

15 Mbit/sec

Table 1.1 Parameter bounds of video streams for MPEG-2 five different profiles

Group of Pictures

Video Sequence

Picture

macroblock

(4:2:2 format)

Cb(8x16)

Cr(8x16)

Y(16x16pels)

(4:2:0 format)

Y(16x16 pels)

Cr(8x8)

Cb(8x8)

(a) Video stream data hierarchy

Sequence layer Random access unit: context

Group of pictures layer Random access unit: video

Picture layer Primary coding unit

Slice layer Resynchronization unit

Macroblock layer Motion compenstation unit

Block layer DCT unit

Layers of Syntax Function

(b) Function of each layer

8 pels

(3) Discrete Cosine Transform: In general, neighboring pixels within an image tend to be

highly correlated, which is spatial redundancy. Therefore, MPEG uses invertible

block-based discrete cosine transform (DCT) coding of 8x8 pel blocks to decompose

the signal into underlying spatial frequencies for energy concentration and

decorrelation of image data [Ahme74]. The general idea in the computation of DCT is

that each block of image data is represented as a set of basis functions and scaling

factors. The results from the DCT computation process are called DCT coefficients. In

each block, the DCT coefficient located at the extreme upper left-hand corner is

called the DC coefficient, while other DCT coefficients are called AC coefficients. In

the video decoder, a similar computation of DCT is implemented in an inverse

discrete cosine transform unit (IDCT) to transform the DCT coefficients to the

original image data.

(4) Quantization: The human visual system is less sensitive to high frequency signals.

Hence, it is desirable that the DCT coefficients belonging to higher frequency parts

are more coarsely quantized in their representation. The process of quantization is

described as follows. Each DCT coefficient is divided by a corresponding quantization

matrix value that is supplied from an intra-quantization matrix. Each value in this

matrix is pre-scaled by multiplying by a single value, known as the quantiser scale

code. This quantiser scale is modifiable on a macroblock basis, making it useful as a

fine-tuning parameter for the bit-rate control. The goal of this operation is to force as

many of the DCT coefficients to zero as possible within the boundaries of the

prescribed bit-rate and video quality parameters. When a bitstream is decoded in a

decoder, the inverse processing of quantization is implemented in an inverse

quantization (IQ) unit.

(5) Run Length Coding and Zigzag Scanning Order: After quantization, most of the

energy (non-zero DCT coefficients) is concentrated within the lower frequency

portion (upper left-hand corner) of the matrix, and most of the higher frequency

coefficients have been quantized to zero. Hence run length coding can be used to

represent the large number of zero coefficients in a more effective manner, and a

zigzag scanning pattern can be used to maximize the probability of achieving long

runs of consecutive zero coefficients. This zigzag pattern is shown in the left portion

of Figure 1.2. An alternate scanning pattern defined in MPEG-2 is shown in the right

portion of the figure. This scanning pattern may be chosen by the encoder on a frame

basis, and has been shown to be effective on interlaced video images. This

dissertation will focus only on usage of the standard zigzag pattern.

(a) Normal Zig-Zag Scan. Mandatory in MPEG-1.

Optional in MPEG-2.

(b) Alternate Zig-Zag Scan. Not used in MPEG-1. Optional in MPEG-2.

For frame DCTcoding of inter-laced video, moreenergy existshere, so runlength coding ismore efficient.

Figure 1.2 Two methods for scanning DCT coefficients are available in MPEG-2

(6) Motion Compensation: In general, there are many similarities between adjacent

pictures, which are called temporal redundancy. MPEG-2 exploits this redundancy by

computing interframe differences relative to areas that are shifted with respect to the

area being coded. The whole process is known as motion compensation (MC) and the

interframe difference is called the prediction error. An example of motion

compensation is sketched in Figure 1.3. The encoder uses a motion estimation

technique to find the set of displaced MBs in the reference pictures that best matches

the current coded MB. The motion vectors (MV) that are then encoded and transmitted

to the decoder as part of the bitstream indicate the positions of these displaced MBs.

The prediction error is then transmitted using the DCT encoding technique as

described above. The decoder then knows which areas of the reference pictures were

used for each prediction, and adds the decoded prediction errors to this motion

compensation prediction to obtain the output.

Current B Picture

PreviousReference Picture

Future ReferencePicture

Current MB to becoded(aligned to MB grid)

MB GridPosition of "best match" MB(to half-pel accuracy -- neednot be aligned to MB grid)

Motion Vector(e.g., [-10.5, -5.5])

Figure 1.3 Motion compensation interpolation using bi-directional prediction. A displaced MB in the previous picture is used as one prediction of the coded MB in the current B picture, and a displaced MB in the future picture is used as a second prediction. One, or an average of both, can be used as the final prediction.

(7) Variable Length Coding: For reducing the coding redundancy, MPEG uses a Huffman

type entropy coding [Huff52] to encode a sequence of symbols, such as MB

addressing, MB type, motion vectors, and DCT coefficients, to the shortest possible

bitstream. The basic coding principle is shorter codewords assigned to more probable

symbols. Therefore, at the MPEG-2 decoder, there is a variable length decoder (VLD)

to recover these codewords, recreating the original data.

(8) Video Buffer Verifier: MPEG-2 has one important encoder restriction, namely, a

limitation on the variation in bits per picture, especially in the case of constant bitrate

operation. Hence, MPEG defines an idealized model of the decoder called the video

buffer verifier (VBV). The VBV is used to constrain the instantaneous bitrate of an

encoder such that the average bitrate is met without an overflow or underflow of the

decoder’s compressed data buffer.

1.4 Overview of the MPEG-2 Video Decoding Process As mentioned in Section 1.1, the MPEG-2 standard only specifies the decoding

process such that all decoders shall produce numerically identical results with the

exception of the IDCT [ISO94]. The IDCT is defined statistically in order for different

implementations of this function to be allowed. A simplified and high-level functional

diagram of the MPEG-2 video decoding process [Isnr98] is shown in Figure 1.4 and

described below:

1. A compressed video bitstream supplied from the system demultiplex is written to a

VBV buffer on an external DRAM through the channel FIFO and data bus.

2. This compressed video bitstream is then read from the DRAM into a bitstream-

parsing unit, extracting the fixed-length and variable-length coded data. The fixed-

length coded data belongs to the high layer syntax of a video stream, including the

sequence header, the GOP header, and the slice header. The variable-length coded

data includes MB headers and quantized DCT coefficients and will be decoded in a

VLD unit. The decoded DCT coefficients will then be transferred to IZZ, IQ, and

IDCT units for further processing.

3. If the current decoded MB is a non-intra MB, motion vectors are extracted from the

MB header by the VLD unit and sent to an addressing unit for deriving the actual

addresses of reference MBs. A video may be encoded in a progressive or interlaced

scanning pattern, while the reference pictures also can be stored in one of these

patterns. Therefore, the actual address computing depends on the field/frame

prediction signal [Puri93]. The MVs for chrominance pixels are derived from the

luminance MVs by a scaling that depends on the chrominance sampling density. For

example, the chrominance MVs of 4:2:0 video are derived from dividing both

horizontal and vertical components of the luminance MVs by two. If MVs are given in

the half-pixel boundary, the reference MBs need an interpolating computation.

Finally, if the decoded MB depends on more than one reference MB, their average is

used as the final prediction.

4. The quantized DCT coefficients are first de-zigzagged and inverse-quantized in the

IZZ and IQ units. These values are then forwarded to the IDCT unit to recover the

original pixels or residual values. Finally, if the current decoded MB is a non-intra

MB, the outputs are added to an anchor MB coming from an MC unit to produce a

reconstructed MB. The reconstructed MB is then stored into external DRAM. For an

intra-MB, the IDCT results directly compose an original MB that can be immediately

stored into DRAM.

5. After one picture is decoded, the reconstructed image may be re-read from the DRAM

to a display processor for displaying, or to the decoder chip again to become the

reference data.

VLD IZZ IQ IDCT

Parsing

DCTCoeffs

Zig-Zag Scan Mode

Quant Scale Factor & Quant Matrices

VLD Scaling forChroma

CombinePredictions

Half-PelPredictionFiltering

Motion Vectors

VectorPredictors

Half-Pel Info

ExternalDRAM

DRAMAddressing

Field/Frame Prediction Selection

Decoded Pixelsfor Displaying

MPEG-2Bitstream

Figure 1.4 A simplified and high-level functional diagram of the MPEG-2

video decoding process

1.5 The Design of an MPEG-2 Video Decoder MPEG-2 is targeted on the wide range of applications that may reside in workstations,

personal computers, and consumer products. Basically there are three types of

implementation for the decoding process [Lin95]:

(1) Generic processor

(2) Custom data path engine

(3) Application-specific processing engine

A generic processor may be based on reduced instruction set computing (RISC) or

other all-purpose PC-based architecture. Recently, the rapid progress of generic processor

technology [White93, Trem95, Saav95] has given new impulse to software MPEG decoding

[Ikek97, Hsiau97]. Software decoders usually have advantages in shortening the design

time compared to hardware decoders and providing versatile and adaptive functions

[Lee99]. However, unless special programming solutions are adopted [Bhas95, Bagl96], the

software decoders turn out to be extremely inefficient. This is due to the fact that there are

no compilers that can automatically detect and generate efficient machine code. Moreover,

the computing power of present generic processors still cannot satisfy the requirements for

a digital HDTV (high-definition television) video decoder performing real-time decoding of

MPEG-2 MP@HL pictures, which requires a computing power of about four to five billion

operations per second [Lee96].

Custom data path engines are special-purpose processors that are based on

application-specific instruction sets. Typical examples of such custom data path engines are

today’s digital signal processors (DSPs) [Veen94, Brin96]. Usually, a DSP core that is

refined to constitute a decoder needs a pixel I/O controller and a specific parallel

functional unit [Balm94]. The advantage of this approach is flexibility, since codec

functions are completely realized by microcodes. Therefore, developers can quickly

respond to changes and improvements in compression algorithms, even after the silicon is

built. However, the disadvantage is the cost in terms of silicon area. A DSP core can take

up to five times the area and dissipate more power than its dedicated hardware counterparts

[Ackl94]. The user-programmable part may also incur substantial software development

costs since these special instructions are less convenient to learn and are difficult to

optimize automatically.

Application-specific processing engines are hardware-dedicated data processing units

for specific functions. For example, an MPEG-2 decoder may be constituted of different

dedicated processing units, such as a VLD unit for decoding Huffman codes, an IDCT unit

for efficient IDCT processing, and an IQ unit for inverse quantization. The application-

specific processing engines, significantly differing from DSP cores and generic processors,

can move the processing along in hardware instead of demanding instruction cache and data

paths for decoding instructions; therefore, they are compact and highly efficient. This

property clearly leads to tradeoffs in efficiency, flexibility, and cost. However, such

applications as HDTV decoders are suitable for implementing with these dedicated

processing engines because the required computing power has to be very high and the cost

constraints are very low, and standards will be settled before products become available.

This dissertation is only concerned with the implementation of application-specific

processing engines to the video decoding architecture of MPEG-2 MP@ML and MP@HL.

Given the implementation approach of the MPEG-2 video decoder, there are many

ways to build a video processing IC for specific applications. To obtain an optimal design

for high-end applications, or to obtain a more cost-efficient design, an analyzing paradigm

can be derived from parts of a well-defined analysis model for multimedia operation system

proposed by Steinmetz [Stei95]. The proposed paradigm consists of four main phases:

processing model, process management, resource management, and optimal architecture.

The interdependence of these phases is depicted in Figure 1.5. Any design change or design

problem occurring in any phase may require designers to return to the preceding phases for

appropriate modification or clarification. A short discussion of the four phases is presented

below:

Processing Model

ProcessManagement

ResourceManagement

OptimalArchitecture

Figure 1.5 Analyzing phases for MPEG-2 video decoder architecture design

1. Processing Model: This phase presents a model of how a video decoder processes the

video data in an application. It depends closely on the characteristics of the video

data, such as frame size, frame rate, incoming bitrate, and the data processing rate of

each functional unit. Generally a processing model includes two different

viewpoints—implementation architecture and bitstream structure. From an

implementation architecture viewpoint, this model can be linear pipeline style or

parallel processing style. From the bitstream structure viewpoint, following the specs

of MPEG-2 bitstream hierarchical structure, this model may start from slice layer,

macroblock layer, or block layer.

2. Process Management: This phase shows how process management must take into

account the timing requirement imposed by the processing model and then apply

appropriate scheduling approaches. A proper scheduling scheme must consider timing

and logical dependencies, both internal and external, among different, related tasks

processed at the same time. Therefore, the responsibility of process management is

not only to be a guide for errorless computations in each functional unit in the video

decoder according to the specification, but also to direct the output of each

processing unit to arrive on time.

3. Resource Management: To accommodate timing requirements, resource management

treats each single component as a resource reserved prior to data processing. In the

MPEG-2 video decoder, the resources are frame memory, data bus, and internal

buffers. As described in Section 1.4, the frame memory is for storing an incoming

compressed bitstream and reconstructed pictures, while the data bus is for delivering

these video data. An internal buffer is associated with each processing unit for

buffering the processed data because the throughput rate of every processing unit is

different. Each resource has a capacity measured by a task’s ability to perform in a

given time-span using the resource. In this context, “capacity” refers to each

functional unit’s data processing rate, frequency range, or amount of storage.

4. Optimal Architecture: The main characteristic of a multimedia system is the need for

correct response time. For example, the playback of a video sequence in multimedia

is acceptable only when the video is presented neither too fast nor too slow.

Furthermore, research at IBM Heidelberg [Stei96] shows that users may not perceive

a slight jitter in a media presentation, depending on the medium and the application.

Therefore, the best method to achieve an optimal architecture in the MPEG-2 video

decoder system is not to focus on the architecture’s processing speed, but to ensure

that the most video data can be decoded by a specific deadline. Resolving the above

three issues will lead to an optimal architecture for the MPEG-2 video decoder

system.

1.6 Research Objectives Recently, visual communications is a rapidly evolving field for the media, computer,

and telecommunication industries. Hence, many decoding algorithms/implementations are

being proposed and developed for solving problems in these areas. Among these proposals,

the DCT/IDCT and VLD techniques are well developed for different applications. However,

proper processing models developed for decoding controllers, and resource analyses such as

mapping memory storage and determining internal buffer size, are still rare and have some

limitations (see Chapter 2 and Chapter 5). The main objective of this research is developing

a solid and efficient processing model, memory storage mapping organization, and bus

scheduling approach so that users can combine the proposed techniques with already well-

developed MPEG-2 video processing units, or derive an optimal architecture design for

every processing unit to better satisfy different applications. Essential for the introduction

of new video communication services is low cost. It is also our intention to minimize the

data bus width and internal buffer sizes so as to reduce the chip size and thus lower the

manufacturing cost.

The proposed processing model is called a Block-Level-Pipeline (BLP) processing

scheme [Ling98, Wang98, Wang99b] due to the fact that the video decoding process/path is

based on the block layer that is defined as the lowest partition unit in the hierarchical

syntax of the MPEG-2 video bitstream. Under BLP control, not only does each processing

unit process video data on a block-by-block basis, but also accesses external frame memory

for motion compensation on a block-by-block basis. Also developed is an interlaced frame

memory storage organization and a deterministic fairness-priority bus scheduling scheme to

cooperate with the BLP scheme. A video decoder designer can apply this BLP scheme to a

set of processing units to derive a suitable MPEG-2 video decoding architecture for any

application. Figure 1.6 depicts the data flow of this BLP scheme.

IQ/IZZ

Write-Back

Maximum time for a MB processing

block 5

block 4

block 5

block 3

block 4

block 2

block 3

block 1

block 2

block 1

block 2

block 0

block 1

block 0

Maximum time forprocessing one block

Figure 1.6 Data flow of the Block Level Pipeline processing scheme

The main objective of this research is to develop the BLP scheme to serve as a

decoding model. Given a set of video processing units, one can specify the design in six

areas. The first three areas apply to both MP@ML and MP@HL applications:

1. The description of this design is complete and precise because it takes processing

model, resource management, and process management into account.

2. The data bus traffic is more evenly distributed in time due to the block-by-block

access method for frame memory. Minimization of the bus width requirement is the

result.

3. The associated internal buffer size of each processing unit (IQ, IDCT, and MC units)

can be reduced to a minimum because each time, every processing unit only decodes

one block of video data.

The next three areas are for MP@HL applications such as HDTV, in which multiple

decoding paths for parallel processing are needed on account of the larger amount of video

data characterizing these kinds of applications. The BLP scheme can still be applied to this

parallel architecture design because every decoding path decodes alternating blocks while

the whole decoding process works on the same MB. The three design considerations are as

follows:

4. Frame memory I/O contention occurs when every decoding path needs to access

external frame memory for motion compensation at the same time. This contention

can be avoided with the alternate switching process. Frame memory access will

follow the same alternation pattern.

5. The synchronization among these decoding paths can be simplified because they work

on the same macroblock.

6. This processing model can balance the computation load of a video decoder because

each decoding path receives evenly dispensed blocks of data during decoding of one

macroblock. Therefore, the system performance of a decoder can reach an optimum.

The second objective of this research is to develop both an interlacing frame memory

storage organization and a deterministic fairness-priority bus scheduling scheme. These two

techniques can help BLP succeed in the following ways:

1. This simple frame-storing pattern can efficiently lower the probability of occurrence

of page-break when accessing the external frame memory. Reducing DRAM access

latencies is an important issue in a limited bandwidth system design.

2. Unlike other real-time systems, the bus scheduler can be simplified to a deterministic

and fairness-priority approach due to only one block of data being conveyed on the

bus at one time. With this short-duration data transfer approach, a complicated bus

scheduler to prevent starvation conditions is not needed.

Besides all the above, another advantage of the BLP scheme is minimizing power

consumption of the video decoder. During decoding of a macroblock, not all processing

units are always in operation; hence, a clock supply controller can be designed to supply

clock signals to processing units and the associated internal buffers only when they are

working. With the BLP scheme, this clock supply controller can easily predict which

processing units are going to be idle. In summary, the above objectives for and advantages

of BLP make it practical for the cost-effective design environment of the consumer

electronics industry.

CHAPTER TWO

Processing and Storage Models for MPEG-2 MP@ML Video Decoding — Review of Prior Art

2.1 Introduction

Visual applications, services, and equipment play important roles in people’s lives as

a preferred means of communication. As a core technology in the digital compression

system for consumer electronic devices, MPEG decoders are rapidly growing in popularity

as they are adopted by the consumer electronics industry. For the designer of a decoder, the

major issue is not only how to decode each received bitstream in real-time but also how to

reduce the silicon area of the decoder and how to integrate functional units on a single chip

with low power consumption. In the last decade, significant improvements in VLSI

technology have relieved hardware problems caused by hardware systems having high

complexity; but the high demands for video decoding still require special architecture

approaches adapted to the video-decoding scheme. As explained in Chapter 1, the MPEG

standards do not specify a decoder implementation but define the decoding process.

Therefore, much research on processing models, memory storage organization structures,

bus scheduling schemes, and hardwired functional units has been done. Before presenting

the proposal for the Block-Level-Pipeline (BLP) processing scheme, it is worthwhile to

review related MP@ML work from other researchers in order to easily point out the

differences between the BLP model and other processing models. However, a design does

not exist that provides "total" solution for all applications. The advantages or disadvantages

of a design vary with the different needs of different applications.

IQ/IZZ

Write-Back

Pixel-level pipeline

T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8

Maximum time forprocessing one MB

Y 0 Y 1 Y 2 Y 3 Cb Cr

Figure 2.1 Data flow of the macroblock-level pipeline decoding scheme

2.2 Review of Related Work

2.2.1 Processing Model

Due to the characteristics of the MPEG algorithm and the huge computational

demands for video processing, all MPEG video decoders adopt two-level or three-level

parallelism and pipeline structure in their design. For applications using MPEG-2 MP@ML,

such as DVD players, the macroblock-level pipeline scheme combined with a pixel-level

pipeline scheme is a common processing model adopted by designers [Fern96, Iwata97,

Toyo94, Yasu97].

As shown in Figure 2.1, the video data of a macroblock is processed in pipeline style

between functional units, and a pixel-level pipeline scheme is performed within each

functional unit. The pixel-level pipeline is obtained by means of a conventional pipeline

design, which optimizes the ratio of operations per second to silicon area by balancing the

throughput of all functional units. This macroblock-level pipeline is obtained by scheduling

operations and data between functional units and external DRAMs. Therefore, to maintain

the correct pipelining, the decoder needs to couple with a global pipeline controller to

delimit the processing time of each functional unit. The decoder also needs to be equipped

with many buffers possessing a reasonable size (usually holding two or three macroblocks)

associated with functional units for buffering data due to the difference of processing rate

between concatenate functional units.

Based on the above conventional macroblock-level-pipeline decoding scheme, Lin

proposed an amended macroblock-level-pipeline decoding scheme [Lin96], as illustrated in

Figure 2.2. In this decoding scheme, all functional units and I/O transactions still operate

on a macroblock basis, but they must wait for other functional units to finish their tasks

before beginning to decode a new macroblock. This scheme can minimize the problem of

huge internal buffer size in the conventional macroblock-level decoding scheme. But, from

a resource utilization viewpoint, it is not an efficient design because the functional units

are often in an idle state.

IQ/IZZ

Write-Back

T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8

Maximum time forprocessing one MB

MB 2MB 1

MB 1 MB 2

Figure 2.2 Data flow of the amended macroblock-level pipeline decoding scheme

In general, the processing models described above are fixed-capability models

because they are running at a specific constant clock frequency for decoding. The selection

of a clock frequency for decoding is made according to the worst-case performance

requirement among all stages within the process of pipeline decoding. Hence, system

resource utilization is sometimes low. To solve this problem, a frequency-scaling

processing model was proposed [Kim96]. This processing model adjusts the clock

frequency for decoding according to the amount of coded data in a picture. If the job load

is heavy, the clock frequency is increased to satisfy the performance requirements;

otherwise, the clock frequency is decreased to adjust the throughput and save on the power

consumption. However, this frequency-scaling scheme encounters two problems. First, the

amount of data in a picture is hard to predict; hence, a precise clock frequency is difficult

to choose in advance. The goals of this model, such as increasing system resource

utilization and lowering power consumption, are very difficult to reach. Second, the

hardwire design of each functional unit must be sophisticated because the wide range of

clock frequency will easily cause timing issues between gates. Therefore, this clock

frequency scaling model is rarely employed in industry products.

2.2.2 Memory Storage Organization and Interface

For MPEG-2 decoding systems, there are at least three frame memories required for

storing two reference pictures and an output B-picture. Therefore, three data streams

frequently cross the memory interface during the decoding process: reading reference

macroblocks, writing decoded macroblocks, and transferring display pictures. To cope with

such a memory access bottleneck, many new memory architectures to improve bandwidth

have been constructed [Prin96]. Among these new designs, conventional DRAM

architecture with Fast-Page mode [Lee95, Uram95] and synchronous DRAM (SDRAM)

architecture [Hama99, Onoy95, Taka99, Winz95] are widely adopted in MPEG-2 MP@ML

applications. Besides the architectural properties of DRAMs, the characteristics of picture

data access have to be taken into account. Thus, frame memory storage organization needs

to be concerned with reducing such access latency as bank pre-charge time and page-

breaks.

Scan-line storage organization is a simple and easy method for storing the picture

data. From the viewpoint of a display processor, it is straightforward for the display

processor to access the picture data because the display process is in scan-line style.

However, motion compensation, for reading pre-decoded reference data and for writing the

current decoded data, consumes more than 60 percent of memory bandwidth [Liu96]. This

reading and writing accesses the frame memory on a macroblock basis. There are three data

storage organization structures commonly proposed, which are based on the macroblock-

type accessing pattern. One structure sequentially stores macroblocks in a conventional

DRAM [Uram95], as shown in Figure 2.3(a). The pel data in a macroblock is put in a

memory page in order to take advantage of the features of Fast Page mode. The second

structure sequentially stores macroblock-rows in multiple banks of SDRAM [Winz95], as

shown in Figure 2.3(b). This storage structure takes advantage of the alternative bank

access feature in the special SDRAM architecture in order to reduce precharge latency. The

third structure sequentially stores 2x2 macroblock-sets in multiple banks of SDRAM

[Taki01], as shown in Figure 2.3(c). Figure 2.4 (a) shows an example of macroblock-basis

access, in which a DRAM word (e.g. 64 bits) contains either eight horizontal neighboring

luminance pels, or their corresponding four Cb and four Cr chrominance pels [Demu94].

For dealing with interlaced pictures, top-field and bottom-field separately stored into

different banks of a frame memory is a common approach [Onoy95, Winz95], as shown in

Figure 2.4 (b).

data word length

one macroblock row

EDO DRAMpre-decoded reference picture

(a) Sequential storage structure in EDO DRAM

SDRAMBank 0 Bank 1

pre-decoded reference picture

(b) Sequential storage structure in SDRAM

SDRAMBank 0 Bank 1

pre-decoded reference picture

(c) 2x2 macroblock-set storage structure in SDRAM

Figure 2.3 Three typical memory mapping structures for the frame buffer

Macroblock

Y YYYYYYY

8 bits

Luminance word (64 bits)

Cb CrCrCrCrCbCbCb

8 bits

chrominance word (64 bits)

Macroblock

(a) Storage structure of pixel data in DRAM

the bottom field of Y pixels in a reference

picture

the Cb and Cr pixels in a reference picture

the top field of Y pixels in a reference picture

the Cb and Cr pixels in a reference picture

(a) Storage structure for the interlaced picture format

bank 0 bank 1

Figure 2.4 Storage structure of picture data in DRA

The frame size and frame rate of the MP@ML applications, as indicated in Table 1.1,

are 720x480 at 30 frames per second. The memory size needs to be at least 11.9 Mbits,

which includes a frame buffer for storing two reference frames and one B-picture output

buffer. The VBV buffer size for MP@ML is 1.7 Mbits. Thus, 16 Mbits for the total DRAM

size will be enough. Under the macroblock-level processing model, the motion

compensation unit in a video decoder needs to read one or two macroblocks of reference

data from external DRAM. To speed bus response and avoid starvation of other functional

units, a 64-bit data bus is a common choice for the video decoder [Demu94, Lee95, Lin96,

Toyo94].

2.2.3 External Memory Access Scheduling

In MP@ML applications, the buffers storing the incoming compressed bitstream,

reference pictures, and display pictures are usually together in the external DRAM to

reduce the number of DRAMs and the number of pins. Because many functional units,

during video decoding, need to read or write data through the same data bus to or from

external DRAM, there must be efficient bus scheduling to arrange timely data delivery for

the functional units. Sharing the data bus also implies that the processing model is an

important factor in the bus scheduling issue. In the macroblock-level processing model, the

amount of data exchanged between external DRAM and the functional units of the decoder

would be one macroblock (16x16 pels) each time. A long duration of data transfer means a

bus with high utilization but slow response, which can either cause an increase of internal

buffer size or result in hardware idle. Hence, under a given processing model, a careful

design for external memory access scheduling is essential to balance the needs of bus

utilization and bus response time.

A traditional and straightforward data bus arbiter is the fixed-priority scheme

[Faut94]. Due to the heavy load on the data bus, given the algorithm characteristics of

MPEG-2 with ITU-R and higher resolution, this approach may cause functional units to

starve without large internal buffers for I/O buffering.

Ling’s and Uramoto’s priority schemes are a similar approach, but they add a

centralized controller [Ling97, Uram95]. Here, different priorities are assigned to the five

requests for memory access, and distributed finite state machines (FSMs) are assigned to

control individual requests on a cycle-by-cycle basis. The centralized controller

synchronizes the entire architecture on a macroblock-by-macroblock basis. This is shown in

Figure 2.5.

CentralController

FIFOstatus

DRAMstatus

Displaystatus

Baseline FSM MC FSM

Synchronization & communication

Figure 2.5 State diagram of data bus scheduling for distributed FSM scheme

A more sophisticated scheme to reduce the internal buffer requirement is a

combination of priority assignment and polling [Demu94]. As shown in Figure 2.6, five

requests for memory access are classified into three priority groups and a grant to use the

data bus is given corresponding to the priority group. If the requests from plural members

in the first priority group are issued, the grant is allocated by polling.

Reading fromdisplay buffer

Reading fromVBV buffer

Writing toVBV buffer

Polling

Writing toreference frame

buffer

Reading fromreference frame

buffer

Finished

* No request from first rank members. Request for writing the reconstructed picture.

** No request from the first rank members. No request for writing reconstructed picture. Request for reading reference frame.

Figure 2.6 State diagram of data bus scheduling for polling scheme

2.2.4 Variable-Length Decoder (VLD)

In an MPEG-2 bitstream, macroblock addressing, macroblock types, coded block

pattern, motion vectors, and DCT coefficients are all variable length codes. With the VLD,

the input bitstream is parsed and interpreted. In MP@ML applications, the maximum

throughput requirement for a VLC codec is as follows: 720 pixels/line x 480 lines/frame x

30 frames/sec, or 10,368,000 symbols(or pixels)/sec. Each pixel is comprised of luminance

and chrominance values, which are decoded separately. Thus, for the 4:2:0 format, the

maximum throughput is 15,552,000 symbols per second. From the MPEG standard,

codeword lengths vary from 2 bits to 16 bits and the average codeword length is 2.85

bits/symbol [Voor74]. The equivalent average throughput for compressed input video data

is about 15,552,000 x 2.85, or 44 Mbits/sec. Although, in practice, the bit rate should be

reduced by DCT and quantization processes, architecture design needs to consider the worst

case. Therefore, the maximum processing rate of a VLC decoder should be constructed to

have a capability of 44 Mbits/sec.

There are two classes of VLC decoders: constant-input-rate decoders and constant-

output-rate decoders. The constant-input-rate decoder, such as the tree-search-based

decoder, processes input bits at a fixed rate, but codewords are decoded at a variable output

rate. There are two kinds of implementations on this constant-input-rate decoder. One

implementation is a sequential decoding process that can be considered as traversing down

a path of the Huffman tree from the root, the path determined by the encoded bitstream

being input [Chang92, Mukh91, Park93]. This implementation has a processing throughput

of about 40Mbps. The other implementation is also a tree search, but traces multiple bits at

a time rather than one bit at a time. This implementation is based on dividing the Huffman

table into two parts, leading-0 bits and following bits [Hash94, Ooi94, Park99]. In other

words, a binary tree can be transformed into many clusters based on the number of leading-

0’s. Within a cluster, the following bits determine the offset of the decoded symbol. This

kind of implementation has the capability to enable 162 Mbps decoding.

The constant-output-rate VLD is a kind of lookup-table-based method that yields a

constant symbol-decoding rate. These codeword lookup tables are constructed at the

decoder from the symbol-to-codeword mapping table and can be implemented by

programmable logic arrays (PLAs), content-addressable memories (CAMs), read only

memories (ROMs), or random access memories (RAMs). Due to the trade-off between the

referencing speed and die size cost of the lookup table, many constant-output-rate VLD

constructions are based on the PLA implementation proposal from Lei and Sun, which uses

parallel operations to decode each codeword in one cycle regardless of its length [Lei91].

Figure 2.7 shows a block diagram of the Lei-Sun VLD that includes two data registers, a

barrel shifter, a set of VLC tables, and an adder. An incoming codeword is stored into the

upper and lower registers and then the decoder operates on these two registers

simultaneously. The barrel shifter is controlled by the adder, which accumulates the lengths

of the decoded codewords. At each cycle, the output of the barrel shifter is matched in

parallel with all the entries in the codeword table. When a match is found, the codeword

table outputs the corresponding source symbol and the length of the decoded codeword, and

then the barrel shifter is shifted to the beginning of the next codeword. If the adder

overflows, which indicates the upper register has been fully decoded, the content of the

lower register is transferred into the upper register. Then, the decoder loads new data into

the low register, and operations continue. Lei and Sun’s VLD has a constant output rate of

52 million codewords per second, which equals an average processing throughput of 145

Mbps when average codeword length is 2.8 bits.

Upper Register Lower Register

Barrel Shifter

Codeword Table

AND-Plane

Decoded symbolTable

OR-Plane

Symbol-length Table

OR-Plane

Data Input

Carry-out

Code-length

Codeword

Data OutputVLC Tables

Figure 2.7 Block diagram of the Lei-Sun VLD architecture

2.2.5 Inverse Discrete Cosine Transform (IDCT)

The discrete cosine transform (DCT) and the inverse discrete cosine transform

(IDCT) are transforms between spatial domains and frequency domains. Due to the

importance of spatial compression in digital image processing, many fast algorithms and

architectures have been proposed for their implementations. In this section, several fast

implementations of the IDCT algorithm for MPEG video decoders will be briefly reviewed.

Detailed comparisons and in-depth analysis can be found in the literature [Bhas96, Hung94,

Pirs95].

The 2-D IDCT applied to an 8x8 block is expressed as

0 16)12(

cos()16

)12(cos(

ljkilckcklyijx

ππ (2.1)

where xij denotes pixel data associated with spatial coordinates i and j (i, j = 0, 1,…, 7) in

the pixel domain, ykl represents DCT coefficients with respect to coordinates k and l (k, l =

0, 1,…, 7) in the transform domain, and the normalization coefficients, c(k) and c(j), are

defined as

0, if 1

0, if 2

1 )(),(

lklckc

The 2-D IDCT transformation can also be expressed in vector-matrix form as

(2.2) klij yTx =

where xij and ykl are denoted as the pixel data and DCT coefficients, respectively, and T is a

64x64 transform matrix whose elements are the product of the cosine functions defined in

Eq. (2.1).

Various implementation algorithms and architectures have been proposed for speeding

up this process. These approaches can be categorized into two main groups. The first group

includes algorithms from polynomial transforms. For example, Vetterli’s and Duhamel’s

algorithms mapped the 2-D DCT into a 2-D DFT plus a number of rotations [Duha90,

Vett85]. Although polynomial transforms require fewer multiplications, they have irregular

structure and complex interconnection schemes among processing elements. Therefore,

these algorithms may be suitable for a software implementation on a general-purpose

processor.

The second group includes techniques from linear matrix analysis and decomposition.

Most of the fast DCT/IDCT implementation algorithms in this group exploit the properties

of the transformation matrix T of Eq. (2.2). Basically, the matrix T can be factorized so that

T = T1T2…Tk, where each of the matrices T1, T2,…,Tk is sparse, which means most of the

elements of the matrix are zero. Thus, the calculation of Eq. (2.2) can be performed in a

sequential manner, xij = T1T2…Tkykl; and then, due to the sparseness property, the number of

operations for performing the 2-D IDCT can be reduced. Another important property of the

2-D DCT and IDCT transforms is separability. From Eq. (2.1), the 2-D IDCT formulation

can be expressed as

)12(cos()16

)12(cos(2

ππ kiljylckcxk l

+= ∑ ∑

7 ..., 1, 0, , )16

)12(cos(2

= ∑=

kljylczl

klkjπ (2.4)

denote the output of the 1-D IDCTs from the rows of ykl . Eq. (2.3) and Eq. (2.4) imply that

the implementation of 2-D IDCT can be obtained by first performing 1-D IDCTs on the

rows of ykl followed by 1-D IDCTs on the columns of zkj. This kind of implementation is

also called the row-column decomposition approach. Table 2.1 shows the computational

complexity of some of the most commonly used row-column and direct 2-D IDCT

algorithms. From the standpoint of computational demands caused by multiplication and

addition, the direct 2-D approaches are faster than the row-column decomposition

approaches. On the other hand, the 2-D approaches suffer the problem of irregular data

addressing, which introduces additional overhead from address calculations and leads to a

difficult wiring problem for VLSI design. Therefore, most VLSI architecture

implementations, including that of this dissertation, adopt the row–column approach for

their IDCT unit design.

DCT/IDCT Algorithms

Row-Column method Direct 2-D method

Chen’s [Chen77]

B.G. Lee’s [Lee84]

Loeffler’s [Loef89]

Kamangar’s[Kama82]

Cho’s [Cho91]

Feig’s [Feig92]

1-D 2-D 1-D 2-D 1-D 2-D 2-D 2-D 2-D

Multiplications 16 256 12 192 11 176 128 96 54

Additions 26 416 29 464 29 464 430 466 462

Total operations for an 8x8 IDCT

672 656 640 558 562 516

Table 2.1 Comparison of computational complexity of various IDCT algorithms for an 8x8 point block

2.2.6 Motion Compensator (MC)

For an intra macroblock, the prediction errors decoded from the IDCT unit directly

form a decoded macroblock. But, for an inter macroblock, a current decoded macroblock is

formed by adding the prediction errors and the previously decoded reference macroblocks.

Hence, the functionality of the motion compensator is to load reference macroblocks and

compute half-pixel accuracy according to the decoded header information; and then, to add

them with the prediction errors. The performance of a motion compensator depends mainly

on latency of loading reference macroblocks because the computations of the motion

compensator itself are very straightforward. Figure 2.8 shows an architecture design by

Yung-Pin Lee for a motion compensator [Lee95]. It includes some operations of adding and

shifting. The latency of loading reference macroblocks is in turn affected by the factors of

memory storage organization and memory access scheduling.

prediction errors

macroblock_motion_backward

macroblock_motion_forwardVH

AS block

Figure 2.8 Block diagram of the Lee Motion Compensator architecture

2.3 Motivations and Challenge

Digitizing audiovisual information for multimedia transmission is an efficient way to

exploit the bandwidth of delivery systems and an easy way to preserve the quality of the

original audio and/or video material. The appearance of MPEG provides a standard,

sophisticated source coding scheme for impeccably transferring audiovisual information

between various multimedia applications. Inevitably, for both encoders and decoders, this

source coding scheme will incur complicated hardware system design and increase the cost

of manufacture. When introducing new multimedia products or services, especially

consumer electronics, price is one of the most important factors. From a VLSI viewpoint,

reducing costs essentially relies on reducing system chip size and lowering system power

consumption. Reducing chip size is a direct way to decrease cost of manufacture, and

lowering power consumption makes a system more stable and safe without involving extra

design effort or cooling devices. Although much has been proposed in these two areas, most

of it has the following limitations:

1. For reducing chip size, most existing work only focuses on simplifying and refining the

algorithms applied to an individual functional unit. It does not provide system analyses

or evaluation models to determine applicability to whole video decoder design.

2. Most existing research do not present a bus architecture design that is related to data

transfer traffic, and bus arbitration strategies and scheduling algorithm analyses.

3. Most existing processing models for decoders adopt the macroblock-level pipeline

scheme. The naturally long duration for data-transfer under this scheme results in the

need for a wide data bus, which makes the 64-bit width common. The wide bus design

not only increases chip size but also increases power consumption.

Developing a new processing model for decoding, an efficient memory storage

organization, and reliable memory access scheduling are the main challenges taken on by

this dissertation. However, for most multimedia rendering applications, real-time decoding

is all-important. Hence, when an optimized processing model and an architecture design are

proposed to lower processing rate or reduce chip size and thereby decrease manufacturing

cost, the fundamental requirements of real-time decoding must still be provided for. In fact,

the current proposal provides for real-time functionality with ease.

2.4 Research Direction

Based on the objectives stated in Section 1.6, the current effort is to develop a cost-

efficient processing model and optimal architecture model for MPEG-2 MP@ML real-time

video decoding applications. At the same time, the limitations of many current techniques

need to be overcome, as follows:

1. This dissertation provides a complete and precise analysis paradigm for MPEG-2

application design and sound reasoning for verification. The analysis paradigm contains

a processing model, process management, resource management, and optimal

architecture, as described in Section 1.5.

2. The processing model is a high-level decoding process controller and fully conforms to

the MPEG-2 standard. Therefore, it cannot only lower the video decoder system

requirements, such as bus width and buffer size associated with functional units, but

can also be applied to the various existing architecture designs of functional units in

order to lower the design time and cost.

3. By taking advantage of all the unique features of SDRAM, the proposal provides an

efficient memory storage organization for reference pictures, which can reduce the

probability of occurrence of page-breaks when accessing the reference pictures, thereby

increasing the data transfer rate.

By following the objectives and skirting the limitations cited above, a strategic

direction has been chosen for developing an efficient processing model, an efficient

memory storage organization, and an efficient bus scheduling approach for MPEG-2

MP@ML applications. This strategy is called the Block-Level-Pipeline (BLP) processing

scheme [Ling98, Wang99a], since it is based on the block layer that is defined as the lowest

partition unit in the hierarchical syntax of MPEG-2 video bitstreams. A complete

description of the framework for and techniques of the BLP scheme is given in Chapter 3.

The corresponding simulations have been run to verify decoding performance under

different processing models, different buffer sizes for functional units, and different

memory storage organizations. These simulation results are also detailed in the next

chapter.

CHAPTER THREE

Block Level Pipeline Scheme for MPEG-2 MP@ML Video Decoding — Processing, Storage, and Scheduling

3.1 Introduction

In this chapter, the new video decoding model, a scheme named Block Level Pipeline

(BLP), is formally introduced. First, Section 3.2 introduces two key design factors, the

video decoding model and the external memory interface, affect the performance and cost

of a video decoder. Also introduced is a calculated value, data bus cycle utilization, for

measuring design efficiency. In Section 3.3, a semantic description and analysis of the

decoding process with BLP is presented. We also compare the difference between the

macroblock-level pipeline scheme and the BLP scheme. For implementing BLP, we need

not only efficient memory storage organization and a data locative profile, which are

explained in Section 3.4, but also a simple and virtuous bus scheduling for managing

functional units that access data from external memory, which is illustrated and proved in

Sections 3.5.1 and 3.5.2. In Section 3.5.3, the discussion turns to how the BLP scheme can

lower the bus width and size requirement of functional internal buffers in a video decoder.

3.2 Designing for Data Transfer Efficiency

For real-time playback, the decoding process of each macroblock(MB) should be done

within a specific time. This upper bound, which is measured in cycles, can be calculated as

follows:

rate)(display frame) ain smacroblock of (no.

decoder videoof rateclock cyclesin MB one decodingfor boundupper = (3.1)

Under this upper bound limitation of decoding one MB in real-time while keeping an

optimized video-decoder architecture, one of the key measurements is the decoding cycle

utilization. An optimized video decoding hardware solution is one where every dedicated,

hardwired functional unit, as well as the external memory interface, can utilize 100% of the

decoding cycles during each decoding of a macroblock.

For functional unit design, the desired data processing rate should not be concerned

only with the time for data processing itself, but also take into account the arrival delay

latency of related data, which results from waiting for transferred data from/to external

memory or waiting for results from other functional units. Otherwise, an efficient logic

design must increase the data processing rate of each functional unit in order to compensate

for data transfer latency and then meet the time requirement for real-time decoding. This

increase usually results in the need for complicated and large-size architecture design for

functional units. But the proposed block-level processing model can minimize the waiting

period while a functional unit is executing. Shorter waiting periods can relax the

performance constraints on the functional unit and allow every functional unit to have high

decoding cycle utilization with simple hardware module design and that still meet the real-

time decoding requirement.

An optimized external memory interface design which includes the factors of data bus

width, data storage organization in DRAM, and bus access scheduling must make sure the

data can be delivered to the right places on time in order to guarantee the performance of

each functional unit. This capability can be measured with a calculated quantity called the

decoding cycle utilization. This utilization also can be called data bus cycle utilization.

From a bus system viewpoint, this calculated value can be used to evaluate the overall

efficiency of the memory interface design. This data bus cycle utilization, Ubus, can be

estimated by the following equation,

++= ∑

ibreakiinit

MBbus CBC

CU (3.2)

• n denotes the number of tasks which need to access external DRAM for transferring

• CMB denotes the decoding cycles of each macroblock under the real-time decoding

limitation

• Mi denotes the amount of transferring data (in bytes) during each memory access

• Rbus denotes the transaction rate of a data bus (bytes/cycle), depending on the width

of the bus

• Cini t denotes the initial setup cycles at the beginning of each memory access

• Bi denotes the number of occurring page-breaks during each memory access

• Cbreak denotes the data access delay (in cycles) due to a page-break

Obviously, the total number of memory accessing cycles for all I/O tasks must be equal to

or less than the upper bound for decoding one macroblock in real-time. And, if the total

number of memory accessing cycles is closer to this upper bound, the memory interface

design is closer to an optimum condition because it means the design doesn’t waste system

resources such as data bus width and video decoding clock rate. Also from this equation, it

is clear that wider data bus width can reduce data transfer delay, and better data storage

organization can minimize page-breaks. These two factors play an important and explicit

role in memory interface design. However, the bus access scheduling scheme also plays an

important role, but an implicit one. A better bus access scheduling scheme should deliver

data from memory before a functional unit needs it. Thus, this scheme can increase the data

bus cycle utilization, and there is no need for a wider data bus. On the other hand, if a bus

access scheduling scheme is designed to deliver data after a functional unit requests it, the

waiting period will increase before a functional unit can process data, necessitating a wider

data bus design to reduce this waiting period. Data bus cycle utilization will still decrease.

Efficient data storage structures for all I/O tasks can minimize unnecessary page-

breaks, allowing all data transactions to be easily finished within the time limit for

decoding a macroblock in real-time. To make a suitable data bus width selection, a balance

between bus cycle utilization and bus response time must be achieved. A wider data bus can

provide a quick bus response for data delivery, which means decreasing bus cycle

utilization (but will also consume more power). On the other hand, a narrower data bus can

keep a data bus busy all the time, which means increased bus cycle utilization, but may

slow the bus response time, requiring an increase in internal buffer size to prevent

functional units from often being idle. As described in Section 1.4, a lot of internal buffers

are associated in a video decoder for buffering the processed data between functional units

in order to prevent the functional units from starving. However, large internal buffer

memories not only increase the chip size, but also consume more power. From the above

analysis, the design goal is to build a new video decoding model, new data storage

organization, and new external memory bus access scheduling in order to lower the

requirements for both data bus width and internal buffer size.

3.3 The BLP Processing Model

3.3.1 Semantics of the BLP Processing Model

The decoding semantics of BLP are based on the block layer syntax of the MPEG

standard. BLP processes data in each video-decoding functional unit one block at a time.

The detailed definitions for data partition in the MPEG video stream and video-decoding

functional units have been presented in Section 1.3. Figure 3.1 illustrates the generic

decoding timing diagram for memory bus activities and functional units in a video decoder

under the BLP scheme and the proposed memory bus access scheduling scheme. In this

figure, we can see how the BLP scheme applies to each functional unit.

1. In BLP, the lowest level of control of the video decoder is done by the block decoding

sequence in the variable length decoding (VLD) unit. According to the MPEG-2 video

stream syntax definition, each coded block will end with an EOB (end of block) VLC

symbol. Therefore, after VLD decodes the EOB symbol, it can directly or indirectly

(through the system controller) inform the inverse discrete cosine transform (IDCT) and

motion compensation (MC) units to continuously process this block. The VLD and

inverse zigzag and inverse quantization (IZZ/IQ) units do not decode the next coded

block until IDCT and MC finish their tasks.

Rn : reading reference blocks for MC of block n,Wn : writing decoded block n to frame buffer, * : processing time of each block in a functional unit depends on the amount of coded data, algorithms, and architecture design

VLD buffer reading data from VBV buffercompressed bitstream written to VBV buffer

display buffer reading datamacroblock header information decoding

calculating motion vectors and generating addresses of reference blocks

(time)

DRAMAccess*

MC*block 0 block 3 block 4 block 5block 1 block 2

IDCT*block 0 block 1 block 3 block 4 block 5block 2

IZZ/IQ*block 0 block 1 block 3 block 4 block 5block 2

VLD*block4 block5block0 MBn+1block1 block2 block3

MVdecoder*

R2W0 W1R3 W2 R4 W3R5 W4R1

Figure 3.1

2. A symbol deco

scanning order a

and run indicate

These nonzero

then enter the in

quantizer step

element. The we

the processes o

Generic timing diagram for decoding non-intra macroblocks

under the BLP scheme and the proposed bus scheduling schem

ded from the VLD unit is scanned based on the zigzag or alternate

nd is represented by run and level, where level denotes a nonzero value

s the number of successive zero entries preceding this nonzero value.

values are also called quantized DCT coefficients. These coefficients

verse quantizer, where a quantized DCT coefficient is multiplied by the

size that is the product of a quantizer scale and a weighting matrix

ighting matrix can be accessed in an inverse scanning order. Therefore,

f VLD and IZZ/IQ can be pipelined. In a nutshell, at the first stage a

symbol’s run and level are decoded from the VLD unit, and then, at the second stage,

the level is multiplied by the quantizer scale and the corresponding weighting matrix

element that can be found according to the value of run.

3. The decoding of header information attached to the next macroblock is performed

during IDCT and/or MC unit processing of the last coded block of the currently

decoding macroblock. Obviously, this scheme can raise the efficiency of the pipeline

decoding circuits because the VLD unit can analyze the next header information while

the pipelined IDCT and MC units are decoding the block data. Compared with the

conventional scheme of decoding the header along with the associated coded video data

of a macroblock, at least 7% of the operating cycles can be saved. This is especially

true when a coded bitstream has a lot of “stuff-bits” inside.

4. The MPEG-2 specification clearly defines that data processing in the IDCT unit be done

one block at a time. Thus, after the processing of the current block, the IDCT unit will

be in an idle state until the VLD unit or system controller informs it to process the next

block.

5. The main task of the motion compensation unit is to average each pair of adjacent pels

horizontally and/or vertically within each reference macroblock if the motion vector is

specified to an accuracy of one half sample, and to add the prediction errors decoded

from the IDCT unit to the reference macroblocks. Although a motion vector is pointed

toward a 16x16 macroblock, we can easily locate the position of each block in memory

by developing a special addressing formula to make the averaging process on a block-

by-block basis. On the other hand, the adding process is simply done block by block

because the prediction error output from the IDCT unit is in block mode.

6. A combination bus scheduler that consists of time-line scheduling and fixed-priority

scheduling is adopted to cooperate with the BLP decoding scheme. The above analysis

provides an in-depth knowledge of the data flow in a video decoder and the target

functions of each functional unit. From this, the bus accessing order can be scheduled

for each functional unit and the required data transfer duration can be allocated in

advance. This bus scheduling approach will create a bus system that acts as a dedicated

channel to a functional unit at a specific time. There is also detailed information about

the data location accessed by the functional units. Hence, on the memory bus accessing

order, a pair of I/O tasks can be arranged whose data may be in the same memory page,

in order to eliminate initial addressing setup cycles for the second one, such as when

reading reference blocks for motion compensation of block 1 and block 2. This

arrangement can save about 5% of the operation cycles. Under these sophisticated

designs, bus scheduling can improve data bus cycle utilization for a video decoder

system. The detailed description of the proposed bus scheduling scheme and its effect

on the different internal buffer sizes will be discussed in Section 3.5.3.

From the above analysis, we know that every functional unit in a video decoder is suitable

for the BLP scheme if we supply an addressing formula for an MC unit in order to easily

access reference data block by block. This addressing formula has to be straightforward and

simple without incurring extra computational effort, which will be discussed in the next

section.

3.3.2 Comparison with the Macroblock Level Processing Model

Figure 3.2 shows the generic decoding timing diagram for memory bus and functional

unit activity in a video decoder under the macroblock-level-pipeline decoding scheme and

the fixed-priority memory bus access scheduling scheme. Figure 3.2 (a) illustrates the

conventional MB-level-pipeline scheme [Fern96, Iwata97, Toyo94, Yasu97] where all

functional units and I/O tasks operate macroblock by macroblock through one slice of a

picture or through a whole picture. This figure shows that the conventional decoding

scheme allows all functional units and I/O tasks to almost fully utilize the decoding clock

cycles. However, the biggest disadvantage of this decoding scheme is that the size of each

internal buffer is huge. For example, a size of at least four macroblocks (1536 bytes) is

needed for the MC buffer in order to store reference data, and a two-macroblock size (768

bytes) is needed for storing results from the IQ process, which are waiting for IDCT unit

processing. Compared to the BLP scheme and the proposed bus scheduling scheme, only a

four-block size (about 298 bytes) is needed for the MC buffer, and only a two-block size

(128 bytes) is needed for the IQ process buffer.

(b) a amended macroblock level pipeline decoding scheme

IZZ/IQ*

DRAMAccess*

(time)

microprocessor*

(time)(a) a conventional macroblock level pipeline decoding scheme

IZZ/IQ*

DRAMAccess*

microprocessor*

Reading referencemacroblocks for MBn

Write backdecoded

data of MBn

Reading referencemacroblocks for MBn+1

MBn MBn+1

* : processing time of each block in a functional unit depends on the amount of coded data, algorithms, and architecture design

display buffer reading datamacroblock header information decodingcalculating motion vectors and generating addresses of reference blocks

Reading reference macroblocks forMBn

Write back decoded data of MBn

Figure 3.2

MB-level pipeline scheme and fixed-priority bus scheduling scheme

An amended macroblock-level-pipeline decoding scheme [Lin96] is illustrated in

Figure 3.2 (b). In this decoding scheme, all functional units and I/O tasks still operate on a

macroblock basis, but they must wait for other functional units to finish their tasks before

beginning to decode a new macroblock. This scheme can minimize the problem of huge

internal buffer size in the conventional decoding scheme. The IQ process buffer needs at

least 192 bytes of storage space (half macroblock size) and the MC buffer needs about 800

bytes (two macroblock size). Compared with the BLP decoding scheme, they are still

larger. From the viewpoint of data bus cycle utilization, this amended design has lower bus

utilization than the conventional decoding scheme or the BLP scheme. This phenomenon

results from a 64-bit width of data bus commonly adopted in the macroblock-level scheme.

In the macroblock-level scheme, the amount of transferred data is large because these I/O

transactions stem from requests by functional units for entire macroblocks. And, these I/O

requests may occur at the same time under the fixed-priority bus access scheme. Therefore,

the macroblock-level scheme needs a wider data bus to reduce this traffic jam. However,

under the BLP scheme, these I/O transfer data are just blocks. And, these I/O transactions

are scheduled in a specific order under the proposed bus access scheme, which can even

distribute the peak bandwidth in order to eliminate traffic jam conditions. Hence, the BLP

scheme only needs a narrow data bus.

For an I, P, or B picture, Figure 3.3 shows data bus cycle utilizations for each

macroblock. Figure 3.3 (a) shows I-, P-, and B-picture data bus cycle utilization of a video

decoder adopting 32-bit width of data bus under the BLP scheme and the proposed bus

accessing scheme while Figure 3.3 (b) shows data bus cycle utilization with 64-bit width of

data bus under the amended macroblock-level scheme and fixed-priority bus accessing

scheme. The B-frames in the test bitstream, mobile.m2v, consist mainly of intensive, bi-

directionally predicted macroblocks, and hence lay stress on the bus bandwidth test. Under

the BLP scheme, the average data bus cycle utilization is 0.86 for a B-frame, 0.71 for a P-

frame, and 0.54 for an I-frame. All values are higher than those under the amended

macroblock-level scheme, which are 0.59 for a B-frame, 0.47 for a P-frame, and 0.35 for an

I-frame. Basically, the data bus utilization during decoding of B-pictures is very high

because loading two reference macroblocks is often needed during the motion compensation

process. Data bus utilization during decoding of P-pictures is intermediate among the three

types of picture decoding because only one reference macroblock is needed for the motion

compensation process. Data bus utilization during decoding of I-pictures is lowest because

there is no motion compensation process in I-picture decoding. In conclusion, the BLP

scheme can more efficiently utilize the resource of a video decoder than macroblock-level

schemes.

mobile.m2v (15 Mbps)

MB number

I picture (u = 0.54) P picture (u = 0.71) B picture (u = 0.86)

(a) 32-bit wide of data bus under the BLP scheme and the proposed bus scheduling scheme

mobile.m2v (15 Mbps)

MB number

I picture (u = 0.35) P picture (u = 0.47) B picture (u = 0.59)

(b) 64-bit wide of data bus under the amended macroblock-level scheme and the proposed fixed-priority bus scheduling scheme

Figure 3.3 Data bus utilization comparison for the BLP scheme and the amended

macroblock-level scheme

3.4 Memory Storage Organization

3.4.1 Data Storing Profile

In conventional design of MPEG-2 decoders, external DRAM space is mapped into

various regions:

• VBV buffer a small region that stores the incoming compressed bitstream cyclically.

The size of the VBV buffer is specified as 1.75 Mbits in the Main Level

• Frame buffer a two-picture memory space is reserved to accommodate decoded

reference pictures (I and P pictures). They will be accessed for motion compensation.

These reference pictures can be stored in frame mode or field mode. The two-picture

space should be about 8 Mbits for 720x480 resolution and 4:2:0 sampling.

• Display memory one-picture memory space (4 Mbits) is reserved for decoupling of

the decoding and the display, the conversion of frame and field format, and frame rate

conversion. Usually this display memory is combined with the region of the frame

buffer to make a three-picture space.

Thus, 16 Mbits of total DRAM size should be enough. The memory interface design is

significantly influenced by the performance and capacity of a specific external memory

type. Hence, selection of DRAM must take into consideration a high data access rate and a

high storage capacity, while at the same time using only a few memory devices to ensure

low cost design. These demands can be met by 16 Mbit synchronous DRAM (SDRAM).

3.4.2 Features of SDRAM

SDRAM can latch onto I/O information from a processor under the control of the

system clock. The processor can be told how many clock cycles it takes for the SDRAM to

complete its task, so the processor can safely go off and do other tasks while the SDRAM is

processing its requests. Therefore, the performance of the whole system can be improved.

Besides that, SDRAM also offers substantial advances over DRAM operating performance,

including the ability to synchronously burst data at a high data rate with automatic column-

address generation, the ability to interleave between internal banks in order to hide

precharge time, and the capability to randomly change column address on each clock cycle

during a burst access.

Access time = 60 ns

ADDR row D

col N+1

col N+2

DN+1 DN+2

tRASP - RAS# pulse width (between 60 ns and 125,000 ns)tCSH - CAS# hold time (min 45 ns)tPC - EDO-PAGE-MODE READ or WRITE cycle time (min 25 ns)tRCD - RAS# to CAS# delay time (min 14 ns)tCP - CAS# precharge time (min 10 ns)tRAD - RAS# to column-address delay time (min 12 ns)tCAC - access time from CAS# (max 15 ns)tAA - access time from column address (max 30 ns)tRAC - access time from RAS# (max 60 ns)

bank A, row E

Active

CASLatency

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Precharge

bank B, row C

Active

col D col N

AED AED+1 AED+2 AED+3 BCN B CN+1 BCN+2 BCN+3

17 18 19 20 21 22

Precharge

tRCD - ACTIVE to READ or WRITE delay (min 30 ns)tRAS - ACTIVE to PRECHARGE command period (between 60 ns and 120,000 ns)tRP - PRECHARGE command period (min 30 ns)tRC - ACTIVE to ACTIVE command period (min 96 ns)

selectBank A

read Bank A

selectBank B

read Bank B

don't care

(a) Read cycle timing in Fast-Page-Mode of EDO DRAM

(b) Read cycle timing of two-bank operation in an SDRAM

(source: adapted from Micron Technology)

Figure 3.4 illustrates that, in SDRAM, all of the data in one row is accessible after a

single column address changes; on the other hand, in conventional DRAM, the column

address should be changed every time data is output. In SDRAM architecture, a READ

command can be initiated on any clock cycle following a previous READ command; thence,

full-speed random read access within a page can be performed. In alternating bank access, a

subsequent ACTIVE command to the other bank can be issued while the first bank is being

accessed, resulting in a reduction of precharge overhead and then providing seamless high-

speed random access operation. Therefore, it is an important goal in video decoder design

to develop an efficient data storing arrangement in SDRAM for practical use of these

SDRAM features in order to accelerate data access.

3.4.3 Data Storage Organization in SDRAM

While we discuss the data storing arrangement in SDRAM, there are two kinds of data

access overhead to be considered: redundant data transferring overhead and page-break

overhead.

First, the redundant data overhead is more serious when using a wider memory data

bus. If the memory data word is 64-bit length which means that each column (or data

word) can store 8 pels the memory bus often transfers 24 pel data for each macroblock

row while reading reference macroblock data, as shown in Figure 3.5. There are 8 pels of

redundant data out of 24 pels of retrieved data. This redundant data wastes about 30% of

memory bandwidth. For a 32-bit width memory bus, this overhead can be reduced to only

15% of memory bandwidth. From the depiction in Section 3.3, we know that one of the key

advantages of the BLP scheme is to spread out the peak memory bandwidth requirement.

Therefore, we can use a 32-bit memory bus to reduce the redundant data overhead.

Second, when succeeding data is on a different row of the same bank, DRAM will

suffer a page-break latency to close the current row and then precharge the other row for

subsequent data. A page-break latency will result in an average six-clock cycle delay

(estimated in a 27 MHz time domain). However, SDRAM provides multiple-bank

architecture to hide this precharge latency. Hence, we can minimize the page-break

overhead if we fully take advantage of multiple-bank architecture for organizing the stored

data. This storing organization needs to be capable of avoiding unnecessary page-breaks

and easily determining the data location for retrieval.

(a) 64-bit data word SDRAM

64-bit dataword

desired referencemacroblcok

MBmMBm+1

MBn MBn+1

desired data

redundant data

retrieved data

(b) 32-bit data word SDRAM

32-bit dataword desired data

redundant data

desiredreference

Macroblcok

MBm MBm+1

MBn MBn+1

desired referencemacroblcok

retrieved data

Figure 3.5

Reference Macroblock storage configuration in 64-bit and 32-bit data-word

SDRAM and corresponding redundant data overhead

Figure 3.1 clearly shows that there are five decoding processes having I/O

transactions with external DRAM during decoding of one macroblock. They are compressed

bitstream writing (bits_write) to VBV buffer, VLD FIFO reading (vld_read) from VBV

buffer, MC reference data reading (mc_read) from frame buffer, MC reconstructed data

writing (mc_write) to frame buffer, and display buffer reading (display_read) from frame

buffer. Each of these processes has been depicted in Section 1.4 in detail. The bandwidth

requirement and characteristics of each memory I/O transaction in the MPEG MP@ML

video decoder are listed in Table 3.1. The processes of bits_write and vld_read actually

access the same region in SDRAM, the VBV buffer, while the other three processes

(mc_read, mc_write, and display_read) access another region, the frame buffer. Therefore,

the analysis of data storage organization can be narrowed down to two categories.

Type of Memory I/O Process

Accessing Characteristics

Bandwidth (bytes/sec) Condition Notes

Compressed Bitstream Write Stochastic 1.875 M The upper bound of input

bitrate

Compressed Bitstream Read Stochastic 1.875 M The actual bitrate is to avoid

VBV buffer overflow and underflow

Reference Data Read for MC Process Deterministic 36.5 M The worst case in half-pel

prediction

Decoded Data Write Deterministic 15.5 M Constant value for write-back of one decoded macroblock

Display Data Read Deterministic 15.5 M Constant value to meet the 30 frames/sec display rate

Table 3.1 Characteristics of I/O processes on the memory bus

For convenience and simplification of the discussion, dual bank SDRAM architecture

can be adopted to illustrate the proposed data organization structure, where each bank

includes 1024 rows and each row has 256 columns (or data words) and each column is 32-

bit width. The same structure can be easily applied to quad-bank SDRAM architecture.

First, 1.7 Mbits of memory space can be allocated for the VBV buffer, which lies across the

two banks and uses circular addressing as shown in Figure 3.6. Under the proposed bus

access scheduling scheme, a sufficient time period is allocated to empty the compressed

bitstream FIFO and refill the VLD buffer during each macroblock decoding, where the two

I/O transactions will execute successively. Hence, the data for the bits_write and vld_read

processes is small each time and therefore the data accessed by the two processes should be

on the same memory page. Therefore, there is only one latency of row activation for

processing these two I/O transactions.

256 cols (data words) perpage

1024pages(rows)

32 bitsper data

VBV buffer VBV buffer

Bank 0 Bank 1

109pages

bitstream FIFO writeVLD buffer read

Figure 3.6

In the frame buffer, the data is read to the display buffer in scan-line style while the

processes, mc_read and mc_write, access data in macroblock style. From Table 3.1, the

bandwidth for the motion compensation process, including reading reference data and

writing decoded data, consumes about 60% of memory bandwidth. Therefore, in the frame

buffer, the data storage scheme should be considered from the viewpoint of the motion

compensation unit.

A 16x16 or 17x17 pel region (called the reference macroblock) in a reference picture

that has to be read for motion compensation depends on the motion vector. This reference

macroblock often belongs to four different macroblocks that may lie in different memory

pages, as shown in Figure 3.7.

pagebreak

MB n - 1

MB n + 1

MB m - 1

MB m + 1

motion vector

The two reference MBs may lie in different banks or different pages

reference MB

Figure 3.7 Reference macroblock access for motion compensation

bank 0

720 pel/line (45 MBs)

bank 1

ref. picture 1 - top field

the top field of reference picture with 45 x 30 field-MBs ( one field-MB is 16 x 8 pel)

the bottom field of reference picture

0 1 2 3 43 4446 47 48 8988

816 2324 31

41 42 43 44 86 87 88 89

VBV buffer VBV buffer

ref. picture 1 - bottom field

43 448988

046 47 48

233139

41 42 43 44 86 87 88 89

Figure 3.8 Interlaced macroblock-row memory mapping for the frame buffer

In this macroblock-type memory access, there is a high probability of suffering page-

break if the data is stored in scan-line structure. To relieve this page-break penalty from

macroblock-type memory access, an interlaced macroblock-row storage structure is derived

for the frame buffer. In this structure, even and odd macroblock-rows of a picture are

stored into different banks and pel data from each macroblock is stored in scan-line

structure (every 4 pels of data are stored in a data word), as shown in Figure 3.8.

MBn MBn+1

MBm MBm+1

c0 c1 d0

page break

differen

block 0 block 1

block 2 block 3

Figure 3.9 -

In the pro

reading one re

are both locate

described in S

block pattern f

0, the a0 part o

Reference macroblock access pattern under the interlaced macroblock

row storage structure

posed storage organization, a maximum of two page-breaks is necessary for

ference macroblock. As shown in Figure 3.9, macroblocks MBn and MBn+1

d in one bank while MBm and MBm+1 are located in the other bank. As

ection 3.3, under the BLP scheme the reference data is read in a block by

or the motion compensation process. Therefore, during the reading of block

f block 0 is read first; at the same time, the other bank is precharged. Thus,

after reading a0, a1 can be immediately read without a page-break occurring. The same

situation occurs in the reading of the rest of block 0, parts a2 and a3, and in reading the

next block (b0 and b1). Only two page-breaks occur, both associated with block 2 , one at

the beginning of c0 reading (because b1 and c0 are in the same bank but a different page),

and the other during reading of c0 and c1 (because c0 and c1 are likewise in the same bank

but a different page). In a word, the proposed storage structure takes advantage of the dual

banks in SDRAM to reduce page-breaks.

One important factor in making the BLP scheme successful stems from the memory

position of each reference block being easily located. A simple procedure for addressing

each reference block in the frame buffer can be outlined as follows.

In the proposed storage structure for storing reference pictures, the physical memory

address in the frame buffer is fixed for each reference-picture macroblock. Hence, the

memory address of the upper-left pel of every macroblock within a reference picture can be

considered as the base address, and then the relative address of other pels within a

macroblock can be simply represented as the column address denoted from 0 to 63 as shown

in Figure 3.10 (a).

A 32-bit data word length SDRAM configuration is adopted in this illustration; hence,

each column in SDRAM can store 4 pels. Figure 3.10 (a) shows an example of the

geographic relationship between a retrieved reference macroblock and the four reference-

picture macroblocks it lies within. In Figure 3.10 (b), the position of block 0 is the position

of the desired reference MB that can be decoded from the motion vector. The location of

block 1 is horizontally adjacent to block 0; thus, they are in the same bank. The row

address of block 1 depends on the value of x. If the value of x is less than or equal to

seven, block 0 and block 1 have the same row address; otherwise, when one memory row

stores eight MBs, and if the MBn mod 8 equals seven, a page-break occurs, and the row

address of block 1 is the row address of block 0 plus one. The column address of block 1

also depends on the value of x. Figure 3.10 (a) clearly shows that the column address of

block 1 is the column address of block 0 plus two if the value of x is less than or equal to

seven; otherwise, the column address of block 1 is that of block 0 minus two. As for the

position of block 2, its row address is the same as that of block 0 because it is vertically

adjacent to block 0. However, its bank location and column address depend on the value of

y. In a similar way, we can specify the position of block 3. After determining the relative

row and column address of each reference block, the physical memory address of that

reference block can be specified by adding the relative address to the base address of the

corresponding macroblock. The method of determining each block’s position in the video

memory is very straightforward, and we only use three comparisons and four additions to

determine the positions of the three additional blocks. Table 3.2 summaries the procedure

for deciding the position of each block in the frame buffer under the proposed data storage

organization.

0 1 2 34 5 6 78 9 10 1112 13 14 1516 17 18 1920 21 22 2324 25 26 2728 29 30 3132 33 34 3536 37 38 3940 4144 45 46 4748 49 50 5152 53 54 5556 57 58 5960 61 62 630 1 2 34 5 6 78 9 10 1112 13 14 1516 17 18 1920 21 22 2324 25 26 2728 29 30 3132 33 34 3536 37 38 3940 41 42 4344 45 46 4748 49 50 5152 53 54 5556 57 58 5960 61 62 63

0 1 2 34 5 6 78 9 10 11

12 13 14 1516 17 18 1920 21 22 2324 25 26 2728 29 30 3132 33 34 3536 37 38 3940 41 42 4344 45 46 4748 49 50 5152 53 54 5556 57 58 5960 61 62 630 1 2 34 5 6 78 9 10 11

12 13 14 1516 17 18 1920 21 22 2324 25 26 2728 29 30 3132 33 34 3536 37 38 3940 41 42 4344 45 46 4748 49 50 5152 53 54 5556 57 58 5960 61 62 63

MBn MBn+1

differen

Block 0 Block 1

Block 2 Block 3

(a) Relative column address of a MB in the frame buffer

(b) Relationship between reference blocks and motion vector

Figure 3.10 Specifying the memory addresses of reference blocks

Block 1 Block 2 Block3

Bank selection same as block 0

if (y ≤ 7) same as block 0; else the other bank;

same as block 2

Specifying row address

if (x ≤ 7) same as the row

address of block 0; else if ((MB# mod 8)==1) the row address of

block 0 + 1; else same as the row

address of block 0;

same as block 0 same as block 2

Specifying column address

if (x ≤ 7) the column address

of block 0 + 2; else the column address

of block 0 – 2;

if (y ≤ 7) the column address

of block 0 – 32;

if (y ≤ 7) the column address

of block 1 – 32;

Table 3.2 Procedure for determining the memory address of reference blocks

There are three typical data storage organization structures in macroblock-type data

access, which are depicted in Section 2.2.2. All three structures are designed for the

macroblock-level processing model, which involves loading whole reference macroblocks

and then activating the motion compensation process. Figure 3.11 (a) shows that a

maximum of three page-breaks will be encountered when accessing one reference

macroblock if the macroblocks of a reference picture are sequentially stored in

conventional DRAM [Uram95]. Clearly, this structure causes higher probabilities of

encountering page-breaks during data access than the proposed interlaced macroblock-row

storage organization. On the other hand, Figure 3.11 (b) shows that a maximum of one

page-break will be encountered if the macroblocks are sequentially stored into dual banks

of SDRAM [Winz95] or are stored into SDRAM in a 2x2 macroblock-set [Taki01].

Although there may be a lower probability of page-break occurrence when reading whole

reference macroblocks under the macroblock-level processing model, this probability will

increase if these storage structures are applied to the BLP scheme; in the worst case, as

many as six page-breaks may be encountered. Compared to the macroblock-level processing

model, under the BLP scheme the MC buffer only holds two blocks of data, instead of two

macroblocks, and the MC process can be activated as soon as reference block 0 data

arrives, instead of after whole macroblocks arrive.

page break

MBm MBm+1MBm-1

e break

MBm MBm+1MBm-1

e break

(b (a) Three page-breaks occurring during reading of reference data

Figure 3.11

Table 3.3 shows a field-based simulation

occurring during the reading of one reference

proposed storage structure, a scan-line structure, a

structures mentioned above. In the MPEG-2 specif

or frame-based. But the encoder and the display u

MPEG-2 applications. Therefore, for purposes of

stored in the frame buffer on a field-basis for the p

) One page-break occurring duringreading of reference data

level processing

of the average number of page-breaks

macroblock under five structures, the

nd the typical macroblock-type storage

ication, a picture structure can be field-

se field-basis for picture access in most

efficiency, the reference pictures are

roposed data storage organization.

Macroblock-level processing scheme BLP scheme

Type Scan-line structure

Sequential structure in DRAM [Uram95]

Sequential structure in SDRAM [Winz95]

2x2 MB-set structure [Taki01]

Proposed interlaced-MB structure

Luma (Y) 11.1 1.73 0.24 0.703 0.32

Chroma (Cb, Cr) 5.2 0.86 0.10 0.328 0.17

Table 3.3 Comparison of average page-break occurrence under different reference

picture storage structures

3.5 External Memory Access Scheduling

As described in Section 2.2.3, most designers have adopted fixed-priority scheduling

for external memory accessing. However, they have not provided a bus scheduling model to

explore such design issues as the effect of different priority level assignments, the effect of

task period transformation, and the effect of DRAM refresh periods. Without this system

level bus scheduling model, it is difficult work to optimize the decoder architecture at the

hardware level and also tedious work to develop a real-time controller at the firmware

level. The next sub-section reviews the relative amount work of involved in the fixed-

priority scheduling model. The second sub-section proposes a generic mathematical model

for fixed priority bus-scheduling in MPEG-2 applications, and addresses the size

requirements of different internal buffers under this model. Finally, the third sub-section

proposes a combination of fixed time-line scheduling and fixed-priority scheduling to

address problems of buffer size and traffic jams.

3.5.1 Review of Related Work

A generic fixed-priority assignment for single resource scheduling has been used for

a long time [Liu73, Leho89]. Let τ1, τ2, …,τn be a fixed-priority ordered task set with τn

being the lowest priority task. Each task τ i in this set is defined by a worst-case execution

time, Ci, a fixed priority level, Pi, a period, Ti, and a deadline, Di, where Di ≤ Ti. According

to the above researchers’ theorems, the task set can be scheduled if the following inequality

is met:

1min10

iDt Tt

∀i, 1 ≤ i ≤ n

This inequality is cumulative with respect to tasks of higher priority than the i th level task.

We evaluate each task τ i over its period, but only up to its deadline. For the i th level task, if

the minimum value of the cumulative workload normalized by the time is less than unity,

which means the cumulative work is less than the elapsed time, then task τ i is schedulable.

If all n tasks meet the inequality, then the entire task set is schedulable.

The above model assumes a perfect pre-emption with no overhead or blocking costs

associated with the mechanisms of an operating system or underlying hardware. Therefore,

some researchers expanded the above model by including non-ideal effects [Rajk89,

Katc93]. The generic scheduling model that incorporates these non-ideal contributions is

shown below:

1min10

=≤< tBlocking

tOverhead

tOverheadC i

iDt ∀i, 1 ≤ i ≤ n

The blocking effect, Blockingi, represents a delay time for starting to process a task τ i,

which is due to the execution of higher priority tasks. Two kinds of overheads are modeled.

Overheadj is due to the execution of a task τ j, such as context swapping and interrupt.

Overheadsys is not directly attributable to an individual task but to a system such as

arbitration and timer. Based on this model, Kettler presented a methodology for analyzing

bus structure, specifically a bus carrying continuous media and using fixed priority

scheduling [Kett94]. Kettler’s model introduced such analysis parameters as DRAM

refreshing periods, the effect of task period transformation, and the effect of different task

transaction modes.

3.5.2 Fixed Priority Scheduling Model

Kettler’s generic model, which describes fixed-priority scheduling in the personal

computer environment, can provide a well-defined foundation for external memory bus

scheduling in an MPEG-2 video decoder, but modifications must be made. In order to

acquire high data throughput and simplify the system design, the proposed scheduling mode

is constrained by the following assumptions:

(a) On the memory bus, all I/O transaction tasks are non-preemptive.

(b) The I/O transactions will access DRAM by burst in order to take advantage of the page

mode feature.

(c) Each of the I/O processes supports only one task whose characteristics remain fixed

through time.

(d) The priority level of each I/O process is static.

The notations used in the proposed bus scheduling model to be introduced in the

following sections are as follows:

n: the number of tasks in the task set

τ j: a task τ j belonging to a task setτ1, τ2, …,τn that is a fixed-priority ordered task set with

τn being the lowest priority task

Pj: the priority level of task τ j in the task set

Mj: the number of bytes to transfer for task τ j

Rbus: the transaction rate of a data bus (bytes/cycle), depending on the width of the bus

Cj: the cycles for transferring data for task τ j, calculated by Mj ⁄ Rbus

Dj: the deadline of task τ j

Tj: the cycles between two transactions of task τ j, with Dj ≤ Tj

Oj: overhead directly bound to the task τ j (In MPEG-2 applications, this overhead

includes bus arbitration, memory address generation, and DRAM page break)

Oref: overhead of the DRAM refreshing cycles

Bi: the worst-case blocking cycles associated with the i th level task

Li: the length (in bytes) of the internal buffer associated with task τ j

Ri: the filling or emptying rate (bytes/cycle) of the internal buffer associated with the task

THi: a threshold (in bytes) for refilling the internal buffer associated with task τ j

The following is the proposed MPEG-2 fixed-priority bus scheduling model:

1min10

=≤< tB

iDt (3.3)

Because of the above assumptions, this real-time scheduling model for I/O processes on the

memory bus is in small burst mode and in non-preemptive mode. For MPEG-2 video

decoding, there is a set of five I/O processes τ1, …,τ5 with priority level P1, …, P5, where

P1 > P2 >…> P5. Task τ i will not miss its deadline for any transaction release time under

fixed-priority scheduling if the above inequality is met.

Lbits_write

THbits_write

Rbits_write

Ldisplay_read

THdisplay_read

Rdisplay_read

Lvld_read

THvld_read

Rvld_read

MC bufferfor

reference datareconstructed data

Rbus Rbus

data bus

Compressed bitstream FIFO

Display buffer

VLD FIFO

Motion Compensation

buffer

Figure 3.12

Data flow model of the bus and internal buffers for an MPEG-2 videodecoder

Most MPEG-2 video decoders are designed with single bus architecture in order to

reduce the complexity of DRAM interface circuitry. Therefore, a suitable I/O buffer size

for each I/O process is needed in order to keep the associated functional unit from

starvation. Figure 3.12 shows the bus and internal buffer model for a generic MPEG-2

video decoder. The five buffers are associated with the five I/O processes, and these

buffers have different buffer size requirements and threshold settings which depend on the

characteristics of the applied MPEG-2 Profile@Level, the underlying bus scheduling, and

the data bus architecture. However, the motion compensation functional unit must read two

reference macroblocks under the amended macroblock-level scheme (or read four reference

macroblocks under the conventional macroblock scheme) and write back one decoded

macroblock. Therefore, its buffer size is fixed at about 884 bytes for storing reference data

under the amended macroblock-level scheme (or about 1768 bytes under the conventional

macroblock-level scheme) and 384 bytes for storing decoded data; both sizes are estimated

for the macroblock-level decoding model. The other buffers will be filled or emptied by

their associated functional units with a rate Ri and by external memory (through the data

bus) with a rate Rbus. While a buffer is filling or emptying, the buffer data level will

change. If this data level is lower or higher than the threshold of the buffer, this buffer will

generate a bus request to refill buffer data or remove data from the buffer, respectively.

From the above MPEG-2 fixed-priority bus scheduling formula Eq. (3.3), we can

derive a buffer model that includes a suitable buffer size and threshold setting for each I/O

process if all I/O processes can be scheduled.

First, the deadline Di (in cycles) of an I/O process τ i can be defined as the time when

the buffer associated with τ i enters an underflow or overflow condition. Then Di is:

THD = (3.4)

Secondly, the worst case blocking time, Bi (in cycles), of τ i can be defined as the maximum

execution time that includes overhead for all lower priority I/O processes:

( kknki

i OCB +=≤≤+1

max ) (3.5)

Lastly, let t = Dj and then, under the conditions of Eq. (3.4) and Eq. (3.5), the bus

scheduling model Eq. (3.3) can be re-written to:

+ ≤≤+−

Rearranging the terms, the buffer threshold THi for an I/O process can be estimated by:

≤≤+

max (3.6)

To avoid data loss or discontinuation of the decoding process, every I/O process buffer is

designed as a dual port buffer, which means sending data and receiving data can occur

simultaneously. Hence, the buffer size Li also can be estimated by:

ibusiii THRRCL +−= (3.7)

To provide a smooth viewing experience for the audience, the priority level of the display

buffer reading should have the highest priority level among the I/O processes. Under a

frame rate requirement of 30 frames/sec, viewers cannot easily notice a loss of a few

macroblocks. Hence, the priority levels of MC-reference data reading and MC-

reconstructed data writing can be set lower than other I/O priority levels. Therefore, the

priority level of the five I/O processes can be set to Pdisplay_read > Pbits_wri te > Pvld_read >

Pmc_read > Pmc_wri te. With this priority level settings, the buffer size estimation (Eq. 3.7), and

the buffer filling/emptying threshold estimation (Eq. 3.6), the lower bound of the buffer

size for I/O processing, compressed bitstream FIFO writing, VLD FIFO reading, and

display buffer reading can be determined by the following:

) 3.8 (

,,,max

________

readdisplay

writemcwritemcreadmcreadmcreadvldreadvldwritebitswritebits

busreaddisplayreaddisplayreaddisplay

OCOCOCOC

−≥

) 3.9 (

______

writebits

readdisplay

readdisplayreaddisplay

writemcwritemcreadmcreadmcreadvldreadvld

buswritebitswritebitswritebits

OCOCOC

−≥

) 3.10 (

readvld

writebits

writebitswritebits

readdisplay

readdisplayreaddisplay

writemcwritemcreadmcreadmc

busreadvldreadvldreadvld

−≥

In these three equations, the first term to the right of the inequality accounts for buffer

capacity resulting from filling and emptying via bursts of data going to and from external

memory and the associated functional unit. The second term accounts for buffer capacity

resulting from the blocking and overhead effects that are contributed by other I/O

transactions.

From the equations, the disadvantages of the macroblock-level-pipeline decoding

model and the fixed-priority bus scheduling scheme can be clearly seen. To acquire high

data throughput, each I/O transaction is in non-preemptive mode; hence, each I/O

transaction request is subject to blocking by other I/O requests. This blocking delay will be

proportional to the transfer data burst size. If the macroblock-level decoding model is

adopted, the delay is more serious because each data burst size is at a maximum. As a

result, the real-time decoding requirement may not be met because critical data may not be

delivered to decoding blocks in time. Therefore, increasing buffer size or increasing bus

width are the only two ways to solve this problem. Unfortunately, both larger buffer

memories and wider data buses not only increase chip size, but also consume more power.

3.5.3 The Proposed Bus Scheduling and Internal Buffer Size Reduction

According to the characteristics of the five I/O tasks, a bus-scheduling scheme is

proposed that combines time-line scheduling and fixed-priority scheduling. The time-line

schedule allocates fixed non-preemptable execution sequences for the deterministic I/O

tasks. The bus arbiter only needs to monitor a few stochastic I/O tasks, which will reduce

the blocking effect in Eq. (3.8), (3.9), and (3.10). Therefore, this scheduler can

accommodate larger data transaction length for each I/O process without the necessity of

increasing the associated buffer size.

request to emptybitstream FIFO

request to refillVLD FIFO

time for readingdisplay data

time for writingreconstructed datatime for reading

reference data

time for emptyingbitstream FIFO

time for refillingVLD FIFO

Transition line Description

request for refilled/empty buffer under fixed-priority scheduling

request for refilled/empty buffer under fixed time-line scheduling

transaction ends

Figure 3.13 State diagram of the proposed bus scheduling scheme

Conventional fixed-priority scheduling is a kind of pure stochastic scheduling

scheme. However, from Table 3.1, it can be seen that only the I/O processes of compressed

bitstream writing and VLD FIFO reading are stochastic; the other three I/O transactions,

MC-reference data reading, MC-reconstructed data writing, and display buffer reading, are

deterministic. And, these deterministic transactions dominate most of the transfer time on

the data bus. Therefore, an off-line schedule analysis can be made in advance for these

three I/O tasks during the decoding of one macroblock and for allocation of the required

duration and order of bus access for each task under the worst case data transfer condition.

Also allocated is a slot of time to fill or remove data from VLD FIFO and compressed

bitstream FIFO in order to reduce their interrupt possibility due to the occurrence of buffer

overflow/underflow. Figure 3.13 shows the state diagram of the bus scheduling scheme.

Normally, the bus arbiter monitors the requests from compressed bitstream FIFO and VLD

FIFO. When it is time for scheduled I/O transactions, the bus will be allocated to those

processes until the scheduled transaction ends. The bus arbiter will then resume monitoring

the requests from compressed bitstream FIFO and VLD FIFO.

Under the BLP decoding model and the proposed bus scheduling, the buffer sizes of

the five I/O tasks can be reduced, compared to the sizes under the two macroblock-level-

pipeline decoding models and fixed-priority bus scheduling. Assuming a video decoder

running at 27 MHz, 15 Mbits/sec bitstream input rate, 30 frames/sec display rate, and

720x480 frame resolution, the following Table 3.4 summarizes the differences in buffer

size requirements under the three decoding approaches. A 64-bit bus width and 64-bit data-

word SDRAM are used in this simulation for the two macroblock-level decoding modes,

while a 32-bit bus width and 32-bit data-word SDRAM are used for the BLP decoding

As shown in Figure 3.2, the buffer space for the MC process needs to store four

reference macroblocks under the conventional macroblock-level-pipeline decoding model.

Hence, it needs about 1768 bytes, which is large enough to include the redundant data

described in Section 3.4.3. An 884-byte MC buffer is needed for the amended macroblock-

level decoding scheme in order to store two reference macroblocks and the redundant data.

On the other hand, under the BLP decoding model, only four blocks of space (about 298

bytes) are needed, which is large enough to accommodate the smaller amount of redundant

data. Two blocks of space store reference data for MC processing of the current block, and

at the same time the other two blocks of space store reference data for the next block’s MC

processing. Writing back a reconstructed macroblock in the conventional macroblock-level

decoding mode requires a two-macroblock space, which is 768 bytes. (Or a one-macroblock

space for the amended macroblock-level mode.) In the BLP model, only two blocks of

space (128 bytes) are needed.

432384 384

256216

48 40 72

192128128

MC buffer Write-back

buffer

Display buffer Bistream FIFO VLD buffer IQ/IZZ buffer IDCT buffer

Conventional MB-level mode

Amended MB-level mode

Proposed BLP mode

Table 3.4 Comparison of different internal buffer sizes under macroblock-level decoding mode and the proposed BLP decoding mode

Under the conventional decoding model, the display buffer must theoretically read at

least 512 bytes each time, according to Eq. (3.8). In practice, this buffer size is usually

larger in order to guarantee smooth display; for example, a 768-byte display buffer is put in

Demura’s design [Demu94]. In the amended macroblock-level scheme, this size can be

reduced to 384 bytes because one time slot is allocated for data transfer to the display

buffer during each period of macroblock decoding. However, under the proposed bus

scheduling scheme, two time slots are allocated to the display buffer for reading data

during each period of macroblock decoding; hence, the display buffer only needs 192 bytes.

A time slot can be allocated for data transfer of compressed bitstream FIFO and VLD

FIFO. The length of compressed bitstream FIFO can be estimated by:

(longest macroblock decoding cycle) x Rbi ts_wri te (3.11)

Because of real-time display constraints, one macroblock must be decoded within 667

cycles at 27 MHz video decoder speed and Rbits_wri te must be 0.07 bytes/cycle at a 15

Mbits/sec bitstream input rate. Hence, the space for compressed bitstream FIFO only needs

to be 48 bytes. According to Eq. (3.9), both the conventional and amended macroblock-

level decoding modes need at least 96 bytes for bitstream FIFO. To avoid data loss, the

industry adopts larger buffer sizes such as 192 bytes [Demu94].

A large enough buffer size for VLD FIFO is one that can hold one macroblock of

data, in order to reduce the probability of issuing an interrupt to request VLD buffer

refilling during the period when decoding one macroblock. The three bitstream simulations

(mobile.m2v, flowers.m2v, and susie.m2v) in Figure 3.14 show the average number of extra

requests for refilling the VLD buffer during decoding of the I, P, and B pictures under

different VLD buffer sizes. The extra requests are in addition to the proposed regular

schedule of VLD filling at the beginning of decoding for each macroblock. Basically, the

larger the VLD buffer size chosen, the fewer the extra requests for refilling encountered.

But if a larger buffer is chosen, it will increase chip area and consume more power.

16 20 24 28 32 36 40 44 48 52 56 60 64

VLD buffer size (bytes)

I picture P picture B picture

16 20 24 2

Figure 3.14 Average num

(a) mobile.m2v (bit-rate: 15 Mbps

8 32 36 40 44 48 52 56 60 64

(c) susie.m2v (bit-rate: 15 Mbps)

er of filling requests for different VLD buffer sizes

Figure 3.3 has clearly showed that data bus utilization during decoding of B-pictures

is very high. Any extra bus access requests such as VLD buffer refilling may decrease the

performance of other functional units during B-picture decoding. Therefore, a suitable

choice of VLD buffer size is for it to be only big enough to contain one macroblock of B-

picture data in order to avoid having the normal bus access schedule disturbed by extra

requests for VLD buffer refilling. Table 3.5 shows the data characteristics for the three

picture types in the three test bitstreams. The I-type macroblocks contain the most data, the

P-type macroblocks contain less data, and the B-type macroblocks contain the least data.

The proposed VLD buffer size is 40 bytes, which is a large enough buffer size to contain

one macroblock of data for a B-picture. Although this buffer size is not large enough to

contain one macroblock of data for I- or P-pictures, frequent requests for VLD buffer

refilling will not affect decoding performance because data bus utilization in these two

types of picture decoding is low. Hence, VLD buffer refilling will infrequently delay the

delivery of data to other functional units. Under the macroblock-level decoding modes, a

theoretical 152-byte minimum space is derived by Eq. (3.10). To cover this minimum, a

larger size, 256 bytes, has been adopted by the industry [Demu94].

Avg. data per MB

% of MB data ≥ 40 bytes

Avg. data per MB

Mobile 48 bytes 100% 23 bytes 92% 6 bytes 0.51%

Flowers 28 bytes 98% 17 bytes 85% 7 bytes 0.13%

Susie 31 bytes 99% 22 bytes 90% 7 bytes 0.17%

Table 3.5 Average data amount per one macroblock within I-, P-, and B-pictures

In addition to the buffers described above, two internal buffers, the IQ/IZZ buffer and

the IDCT buffer, are also required during the video decoding process. DCT coefficients

after the inverse quantization process are going to be stored in the IQ/IZZ buffer to await

the IDCT process. Similarly, the data output from the IDCT functional unit is stored in the

IDCT buffer to await being added to the output of the MC unit. Under both macroblock-

level decoding modes, the two buffer sizes mainly depend on the data processing rate of the

IDCT and MC functional units. To accommodate worst-case processing rates, a two-

macroblock IQ/IZZ buffer (768 bytes) and a one-macroblock IDCT buffer (432 bytes) are

needed for the conventional macroblock-level mode. And for the amended macroblock-level

mode, a four-block IQ/IZZ buffer (256 bytes) and a three-block IDCT buffer (216 bytes)

are needed. On the other hand, in the BLP mode, each functional unit only processes one

block of data each time. Therefore, the two internal buffers can be reduced to 128 bytes for

the IQ/IZZ buffer and 72 bytes for the IDCT buffer. As shown in Table 3.4, an average

savings of 85% of internal buffer space can be achieved under the BLP scheme.

3.6 Conclusion

The BLP decoding scheme associated with special frame buffer storage organization

and data bus access scheduling has been discussed in detail here. In a word, the architecture

of a BLP MPEG-2 video decoder can be tailored to include a narrow data bus (32-bit wide),

much smaller internal buffers, and a simple bus access mechanism. Therefore, this decoder

has two main advantages: small silicon area and low power. The two advantages can not

only reduce the price of the product, but can also benefit the design of mobile applications.

In the next chapter, design guidelines for constructing a DVD video decoder under the BLP

scheme will be presented with simulation results to verify the efficiency of the scheme.

CHAPTER FOUR

Design of a Video Decoder for DVD: Block-Level Pipeline Scheme Application Example I

4.1 Introduction

The improved storage capacity of DVD (Digital Versatile Disc) finds wide

applications in both the computer and consumer electronics industries. As DVD is mainly

an application for the low-cost consumer market, it is important that its architecture be

efficient and cost effective. A limited version of MPEG-2 MP @ML is used in the DVD

video format.

However, due to the nature of inter-frame coding in the MPEG-2 algorithm, the

relatively low access speed for traditional DRAMs, and the large data transfer delay

between processing units and external DRAMs under the macroblock-level processing

model, video decoder architectures in most of the reported literature [Deum94, Fern96,

Li97, Lin96, Iwata97, Toyo94, Yasu97] use a 64-bit data bus to communicate with the

external DRAM, the display, and the incoming FIFO. Moreover, long data transfer

durations also delay functional blocks in the processing unit from accessing the bus, and

thus a more complex bus arbitration scheme combining priority assignment and polling has

been adopted to resolve conflicts on the bus [Demu94, Ling97].

To overcome the problem described above, a low-cost MPEG-2 video decoding

system is proposed for DVD, which uses a high performance single-chip MPEG-2 decoder

with a Block-Level-Pipeline (BLP) decoding scheme. Moreover, individual processes are

also classified as either deterministic or stochastic. The proposed BLP decoding model and

data bus scheduling controller allocates DRAM access for deterministic processes of

functional units according to a pre-determined schedule, and each time it only loads one or

two reference blocks from the DRAM for the motion compensator. As a result, the peak bus

bandwidth (Mbytes/sec) can be lowered and the system bus width can be reduced from 64

bits to 32 bits. Additionally, the controller complexity is significantly simpler than most

existing ones. Computation and I/O operation are also balanced to minimize the sizes of

embedded buffers in the decoder. Clock frequency is chosen to be 27 MHz, a simple

multiple of the video sampling rate, for low power consumption. The proposed architecture

also employs 2 Mbytes SDRAM running at 81 MHz to store two reference pictures, one B-

picture, and the incoming compressed bitstream.

4.2 Design Procedure

A DVD player is a consumer electronic device marketed at a low price, so it is

important to note that an efficient and condensed design for the video decoder architecture

directly affects the product price. The width of data bus, the sizes of internal buffer, the

degree of complexity of system and bus controller, and the performance constraints of

functional units determine the efficiency of the video decoder architecture. However,

among these factors, there often exists a reciprocal correlation between them; for example,

as described in Section 3.2, narrower data bus width can reduce the silicon area but may

require larger internal buffers. Therefore, the analysis paradigm for the MPEG system

depicted in Section 1.5 provides a complete design methodology in order to balance the

requirements of these factors and then implement an optimum video decoder. This analysis

paradigm, with its four interconnected but sequential phases, processing model, process

management, resource management, and optimal architecture, can serves a guide for design.

Figure 4.1 gives an overview of the design procedure for the MPEG-2 video decoder using

this analysis paradigm. In each phase, the main design issues are listed with the design

constraints. The arrows in this figure depict design iteration loops.

MB-level model or BLP model:

system level data transfer profile analysiseach functional unit processing rate analysisprocessing clock rate analysisdata bus width analysis

Constraints:

real-time decodinglow memory bandwidthlow power consumptionsmall chip size

Selection of processing models

decoding path configurationdecoding process controller designexternal memory access scheduling scheme

Constraints:

target clock ratetarget processing ratelow control complexitysmall chip size

Process management evaluation

external memory space estimationexternal memory configuration evaluationdata storage organizationexternal memory clock rateinternal buffer sizes

Constraints:

real-time decodinglow memory bandwidthlow page-break latencysmall chip size

Resource management evaluation

different algorithms evaluationrequirements analysisevaluation of design trade-offs

Constraints:

real-time decodingtarget clock ratetarget processing ratesmall chip sizelow power consumptionVLSI technology

Optimal architecture design

Performance simulationtool

8/16demodulation

errorcorrection

buffermemory demux

variable dataratemax : 10.08 Mb/s

(audio + video +subpicture)

26.16 Mbps

13.08 Mbps 11.08 Mbps

audio decoder

video decoder

sub-picturedecoder

SystemController

<= 6.144 Mbps

<= 9.8 Mbps

<= 3.36 Mbps

Figure 4.2 Data flow block diagram of DVD-Video

First, the processing model phase is about designing an efficient video data

processing model. This model will of course affect the other three phases of design. As

described in Section 3.2, the processing model will affect the choice of data bus width, and

some internal buffer sizes such as the IQ buffer and MC buffer. It also will affect the

performance constraints of functional units, which can alter the degree of architectural

complexity of functional units. Hence, an in-depth knowledge of the system data transfer

requirement is required in order to tailor an efficient design. The video specification of

DVD-ROM from the DVD Forum states that MP@ML video in MPEG-2 is used for DVD

video format. There are two main differences between the MP@ML format and the DVD

format, the incoming video bitstream rate and VBV buffer size. The maximum input bit-rate

of compressed video data is less than or equal to 9.8 Mbits/sec, instead of the 15 Mbits/sec

defined in the original MP@ML specification. The VBV buffer size in the DVD

specification is 1.8535 Mbits, which is a little larger than the 1.75 Mbits defined in the

original MP@ML specification. Figure 4.2 shows a simplified DVD-Video data-flow block

diagram. Compared to MP@ML, DVD has more restrictions on the incoming bitstream rate.

DVD-Video Parameters Comparison Notes

Coded Representation

MPEG-1

MPEG-2 (MP@ML)

Frame Rate 29.97 fps or 25 fps MP@ML also includes

23.976, 24, and 30 fps

TV System 525/60 or 625/50

Aspect Ratio 4:3 (for all frame size)

16:9 (for all frame size except 352 pels / line)

MP@ML also includes

2.21:1

Display Mode Pan Scan or Letterbox

Coded Frame Size

525/60 : 720x480, 704x480, 352x480, 352x240

625/50 : 720x576, 704x576, 352x576, 352x288

(MPEG-1 is only allowed in 352x240 or 352x288)

GOP Size Max 36 fields or 18 frames (NTSC)

Max 30 fields or 15 frames (PAL)

MP@ML has no GOP size

restriction

VBV Buffer Size MPEG-2 : 1.8535008 Mbits

MPEG-1 : max 327689 bits

MP@ML is 1.75 Mbits

Transfer Method MPEG-2 : VBR or CBR

MPEG-1 : CBR

Maximum Bitrate 9.8 Mbit/sec MP@ML is 15 Mbits/sec

Table 4.1 DVD-Video parameters summary and comparisons with MPEG-2 MP@ML

When the video stream is 9.8 Mbits/sec, the bit-rate for the audio stream should be lower.

Table 4.1 then summarizes the MP@ML subset that comprises the video data specifications

that are defined by the DVD Forum. The BLP decoding scheme proposed in Chapter 3 can

be applied to the DVD video decoder because the DVD specifications are a subset of

MP@ML. Under this processing model, the corresponding system architecture of the video

decoder can be specified, which will be presented in Section 4.3.

The second phase, process management, is about how the processing model manages a

system operation profile for functional units and the corresponding external memory access

scheduling. A microprocessor is needed in a decoder to control data flow and operations of

the functional units. According to the decoding sequence in a processing model, the

microprocessor will activate a functional unit at the right time. According to external

memory access scheduling, the microprocessor will also schedule external memory access

for a functional unit or handle an interrupt signal for data access from a functional unit.

Therefore, straightforward process management and external memory access scheduling

will simplify this microprocessor on the hardware and firmware levels. Section 4.4 will

present the decoder operation profile for functional units under the BLP decoding scheme.

The proposed external memory access scheduling has been discussed in Section 3.5.3.

The third phase, resource management, is about the requirements of external memory

space and internal buffer sizes, and the selection of data bus width. Resource management

must take into account the first two phases, especially the specifications for a given

application, the processing model, and external memory access scheduling. Because the

DVD video decoder takes its the specifications from MPEG-2 MP@ML, external memory

space should be 16 Mbits as described in Section 3.4.1. The requirements for internal

buffer sizes under the BLP scheme have been discussed in Sections 3.3.2 and 3.5.3. They

will be applied in Section 4.7 (Performance Simulation Model) to the DVD application.

Data bus width affects data bus utilization (an important factor in determining the

efficiency of decoder architectures) and internal buffer sizes, as described in Section 3.2.

Under the BLP scheme and the proposed bus access scheme shown in Figure 3.3 (a), data

bus cycle utilization of a video decoder adopting a 32-bit data bus is higher than other

macroblock-level decoding schemes. For video decoder architectures, the more important

thing is real-time decoding, which means each macroblock in a picture should be decoded

within the time limitation for decoding one macroblock (Eq. 3.1). Simulation results

demonstrating successful resolution of real-time decoding issues under the proposed video

decoder architecture and BLP processing model are presented in Section 4.8.

The fourth phase of the analysis paradigm, optimal architecture, is about deriving the

processing rate of each functional unit after taking into account the determinants of

processing model, process management, and resource management. The proposed

architecture design for each functional unit will be presented in Section 4.5.

4.3 Overall Decoding System

A block diagram of the proposed decoder architecture is shown in Figure 4.3. The

architecture consists of one external memory device, an SDRAM interface, a 32-bit wide

data bus, a microprocessor, and one baseline unit. The functionality and configuration of

key units in this decoding system are briefly discussed as follows:

• One external memory device accommodates two required reference pictures, one picture

size of display memory, and the video buffer verifier (VBV) buffer (for incoming

compressed bitstream). SDRAM can be used for this memory device and this adopted

SDRAM is internally configured as a dual bank with 32-bit wordlength for each bank.

Total memory size for this video decoder is 2 Mbytes.

• The SDRAM interface is an external memory interface circuit for SDRAM access

operations. It includes one set of data pins and one set of address pins. Its functions

are, firstly, to automatically generate a row address strobe/column address strobe

(RAS/CAS) for accessing or refreshing memory cells; and secondly, to buffer data

transactions under two different clock speeds – that of the SDRAM and that of the

video decoder. The decoding simulation runs at 27 MHz and uses 81 MHz SDRAMs.

• The microprocessor takes the responsibility of setting up decoding parameters, such as

the current macroblock types and addresses, and calculates the actual motion vectors.

Another important function: the microprocessor is like a controller, directing the flow

of operations among the functional units as well as the flow of data to/from the DRAM.

• The baseline unit consists of a VLD, the IQ/IZZ unit, the IDCT unit, the MC unit, and

the associated internal buffers.

To simplify the discussion but without losing generality, NTSC-format bitstreams are used

as the decoding target in this dissertation. The data specifications of the NTSC-format are a

frame size of 720x480 pels (1,350 macroblocks in a frame) and a display rate of 30 frames

per second. Therefore, under real-time playback restriction, each MB should be decoded

within 667 cycles at a 27 MHz video decoder clock rate, and 1,334 cycles at a 54 MHz

clock rate. Obviously, the video decoder can have a larger margin of decoding time if it is

running at 54 MHz; but it will consume 10 % more power in comparison with using the

lower clock speed of 27 MHz. As described in Section 2.3, power consumption is one of the

key factors for consumer products. The proposed MPEG MP@ML video decoder will adopt

the 27 MHz clock rate to contribute to the low power consumption of the whole

architecture.

32 Bit Data Bus

Command Bus

Microprocessor

InstructionCache

DataCache

RegisterFile

Host Interface

DisplayData

Buffer

2 MByte SDRAM

32 bit

SDRAM Interface

Dispaly Interface

Display Engine

fferMC

UnitIQ /IZZ

IDCTIQ/IZZBuffer

IDCTBuffer

VLDBuffer

VLDBaseline Unit

Block Decoding Engine

BitstreamFIFO

Coded Bitstream In

DataBuffer

SDRAM Controller

SchedulingController

AddressGenerator

MV Decoder

V and HScalingFilter C

Sub-PictureDecoder

Figure 4.3 Block diagram of the proposed DVD video decoder

4.4 BLP Controller Mechanism

The pipelined processing flow chart for decoding P-type or B-type macroblocks under

the proposed decoder architecture (Figure 4.3) is illustrated in Figure 4.4. The flow chart in

Figure 4.5 then shows the pipelined decoding process for an I-type macroblock. The only

difference between the two flow charts is that there is no motion compensation process

during the decoding of I-type macroblocks. The decoding of a B-type macroblock requires

more processing than any other type and is thus used here to illustrate the controller

technique. In this discussion, the BLP scheme is illustrated by the 4:2:0 chroma sampling

bitstream, where each macroblock consists of 6 blocks, designated as numbers 0 to 5. The

BLP decoding scheme can be easily extended and applied to a 4:2:2 chroma sampling

bitstream.

Macroblock decoding in MPEG-2 follows a specific sequence. The required tasks (in

order) are the Bitstream FIFO write, VLD buffer read, VLD process, IQ/IZZ process, and

IDCT process. If motion compensation is required, the MC task is also scheduled after the

VLD unit has decoded the macroblock header. However, as described in Section 3.3.1, the

decoding of header information attached to the current macroblock is performed during

IDCT and/or MC unit processing of the last coded block of the previously decoded

macroblock. With this decoding order, at least 7% of the operation cycles can be saved. The

results from IDCT and MC units are then combined to form decoded data and written back

to the memory for display and as future reference if necessary. The controller synchronizes

the MC and the IDCT units on a block basis and also manages the synchronization of the

tasks between blocks. In summary, the baseline functional units process video data in a

pipelined fashion and on a block-by-block basis. After finishing the decoding of one block

according to the fixed schedule, the functional units in the baseline then begin the decoding

process for the next block. The strategy of the BLP decoding scheme is to take advantage

of this sequence and impose two fixed schedules on the data bus transactions to minimize

buffer requests and waiting cycles.

Bitstream FIFO write

VLD buffer read

Load reference blocks for block 0's MC

VLD and IQ/IZZ units process block 0

IDCT and MC units process block 0

Display buffer read

Write back decoded block 0

Display buffer read IDCT and MC units process block 5

The next step begins after all the proceding units are completed

VLD unit processes next MB's Header Info

SDRAM interfacedecodes MVs

Bitstream FIFO write

VLD buffer read

IDCT unit processes block 0

Display buffer read

VLD and IQ/IZZ units process block 2Write back decoded block 0

Write back decoded block 3 VLD and IQ/IZZ units

process block 5IDCT unit processes

block 4

Display buffer read IDCT unit processes block 5

The next step begins after all the proceding units are completed

VLD unit processes next MB's Header Info

SDRAM interfacedecodes MVs

Figure 4.5 The flow chart of BLP decoding process for intra macroblocks

With the BLP decoding scheme, two fixed schedules, time-line scheduling and fixed-

priority scheduling, are adopted to create a scheduling scheme for data transfer between

functional units and external memory, as described in Section 3.5.3. The time-line schedule

allocates non-preemptable execution sequences for the tasks of video decoding because

these tasks are deterministic. Therefore, the bus-scheduling program in the SDRAM

controller performs the I/O process arbitration sequence as shown in the DRAM access

steps in Figures 4.4 and 4.5. For example, when it is time to read reference data for the MC

task, the bus-scheduling program allocates the data bus to that task until this transaction

ends. Under this time-line scheduling scheme, the SDRAM controller only needs to monitor

Compressed Bitstream FIFO overflow and VLD buffer underflow. If overflow and

underflow occur at the same time, the proposed fixed-priority scheduling handles the two

I/O requests in the order Pbits_wri te > Pvld_read, where P refers to the priority. When reading

reference data for the MC task for block 0, for example, if the fullness of the Bitstream

FIFO is over its threshold and the VLD buffer is under its threshold, both buffers request

data transfer to/from external SDRAM at the same time, and the SDRAM controller will act

according to the following sequential order. It will first finish reading reference data for

block 0, then transfer FIFO data to SDRAM, then transfer data from SDRAM to the VLD

buffer, and then go on to read reference data reading for the MC task for block 1.

4.5 Architectures of Video Processing Units

In a hierarchical MPEG-2 bitstream, the DCT coefficients in a block layer, and the

header information in a macroblock layer (such as addressing, macroblock types, coded

block pattern, and motion vectors) are all variable-length codes. The other header

information above the macroblock layer is encoded in fixed-length codes. These variable-

length codes have no explicit boundaries between codewords. Within the VLD unit, the

input bitstreams can be parsed and then codewords can be interpreted. As described in

Section 2.2.4, there are two kinds of implementations for the VLD unit: constant-input-rate

decoders and constant-output-rate decoders. The proposed VLD architecture is based on

one of the constant-output-rate approaches, Lei-Sun’s design [Lei91], and then adds header

information decoding and error recovery mechanism circuits for MP@ML applications.

Figure 4.6 shows the block diagram of the proposed VLD architecture. It shows a number

of key components. There is a bitstream feeder. There are VLC tables for interpreting all

variable-length data. There is a header analyzer which includes a start-code detector for

searching various start-codes, and look-up tables for interpreting fixed-length data. And

there is a finite state machine (FSM) to monitor the variable-length and fixed-length

decoding processes, and to indicate the current processing position in a bitstream.

The bitstream feeder feeds aligned 32-bit compressed data that comes from external

DRAM and is kept in the VLD buffer. The upper and lower registers, which are each 32 bits

wide, contain the current data to be processed. The barrel shifter operates like a sliding

window on the contents of these two registers. The window size (the length of output from

the shifter) is 32 bits, the same length as the start codes of the sequence layer, the GOP

layer, the picture layer, and the slice layer, which is the maximum code-length among the

codewords. During VLC decoding, the output of the barrel shifter is matched, in parallel,

with all entries in the VLC tables that consist of three PLAs containing the decoded data

and corresponding length information. When a match is found, the corresponding source

symbol and the length of the decoded data are output. The barrel shifter is then shifted to

the beginning of the next codeword according to the accumulated code length. When the

carry-out signal goes to “high,” it indicates that the upper register has been fully consumed.

The content of the lower register is transferred to the upper register and a new 32-bit data

unit is loaded into the lower register. The decoded macroblock header information is sent to

the finite state machine and the microprocessor while the motion vectors are sent to the

SDRAM interface for DRAM address generation. The decoded DCT coefficients are sent to

the next pipeline stage of IZZ/IQ for further processing. Consequently, the VLD unit can

achieve the decoding capability of one symbol per cycle, which is the performance required

for an MP@ML video decoder with a 27 MHz clock speed, even in a worst-case situation

involving decoding of long symbols. The longest-length symbol is 28 bits in MPEG-1 and

is 24 bits in MPEG-2.

The proposed VLD unit design also implements decoding of fixed-length header

information. At each cycle, the FSM needs to determine which tables, variable-length or

fixed-length, are to be used according to the current position in a bitstream. Figure 4.7

illustrates a simplified FSM for error handling and determining decoding tables. When the

VLD unit detects an invalid variable-length symbol or illegal header parameter, it will

issue an error signal to the FSM. Then, depending on the current position, the FSM

commands the Start_code Detector to look for an appropriate start-code. This start-code

searching process is an error handling mechanism that can be called re-synchronization.

The sequence_start_code, GOP_start_code, picture_start_code, and slice_start_code in the

MPEG-2 bitstream syntax are used for re-synchronization. With this VLD architecture, no

external control is needed to achieve re-synchronization, which results in minimum

recovery time from error detection.

VLD Buffer

UpperRegister

LowerRegister

64 to 32 Barrel Shifter

Variable-LengthCoding

Tables

Fixed-LengthCoding

Tables

Start_cod

etector

ControlLogic A

decoded data to IZZ

symbol length to A

status info to FSM

icroprocessor

symbol length to A

status info to FSM

icroprocessor

symbol length to A

status info to FSM

icroprocessor

decoded symbol length

Carry-out

Finite StateMachine

status info from VLC tables, header analyzer,

and microprocessor

Header Analyzer

Bitstream Feeder

VLC Tables

Figure 4.6 Block diagram of the Variable Length Decode

Decoding Sequence start_codeand related header information

Any error ?

Decoding GOP start_codeand related header information

Any error ?

Decoding Picture start_codeand related header information

Any error ?

Decoding Slice start_codeand related header information

Any error ?

Decoding MB header information

Any error ?

re-loading video data

Start_code Detectorsearching next GOP

start_code

Start_code Detectorsearching next

Picture start_code

Start_code Detectorsearching next Slice

start_code

Start_code Detectorsearching next Slice

start_code

i <=block_count

Decoding DCT coefficientsin blocki

Any error ?

nBegin decoding of next MB

EOB ?n

Variable-lengthdecoding

Fixed-lengthdecoding

Figure 4.7 The FSM for VLD processing and error handling

4.5.2 Inverse Quantization Unit (IQ)

The process of IQ defined in the MPEG specification is straightforward. Therefore,

the design of the proposed IQ architecture (as shown in Figure 4.8) possesses an elegant

structure. A quantized DCT coefficient is multiplied by the quantizer step size that is the

product of a quantizer scale (Q) and one element from the weighting matrices (W). The

user-defined weighting matrix is stored in a RAM, and the default weighting matrix is

stored in a ROM. These RAMs and ROMs are accessed in the inverse scanning order used

for the inverse-zigzag process. The inverse quantized DCT coefficients are saturated in the

range from –2048 to 2047. Then, these saturated coefficients of a block enter the correction

process called mismatch control in order to produce DCT coefficients.

user defined Interweighting matrix

default Interweighting matrix

user defined Intraweighting matrix

default Intraweighting matrix

quantizer_scaletable

quantizer_scale_code x 2

q_scale_type

Q x W W ' x QF' / 32

Saturation(-2048 ~ 2047)

mismatchcontrol

MultiplierQF x 2 + k

k= 1, 0, or -1

sig (QF)

~ (intra)

QF[0][0]intra_dc_mult

accessed in inverse -scanning order

load_intra_q_matrix

load_nonintra_q_matrix

4.5.3 Inverse Discrete Cosine Transform Unit (IDCT)

As described in Section 2.2.5, there have been many implementation algorithms and

architectures for the DCT/IDCT process. In the proposed DVD decoder, the IDCT

architecture is based on Chen’s algorithm [Chen77] due to its regularity, its reduced

arithmetic operations, and its ability to retain accuracy for limited word lengths [Li99].

These attributes are suitable for VLSI implementation. The eight-point 1-D IDCT algorithm

is described in Eqs. 4.1 and 4.2.

(4.1) 21

−−−

−−−+

−−−

−−=

DEFGEGDFFDGE

CABABACABACA

(4.2) 21

−−−

−−−−

−−=

DEFGEGDFFDGE

CABABACABACA

where A = cos(π⁄4), B = cos(π⁄8), C = sin(π⁄8), D = cos(π⁄16), E = cos(3π⁄16), F = sin(3π⁄16), G = sin(π⁄16)

From the matrix operation, the eight-point IDCT results are easily obtained. Figure 4.9

shows the overall architecture for 2D-IDCT operations using row-column decomposition

from 1-D IDCTs.

The proposed IDCT architecture is implemented by a multiplier-adder architecture

[Li99], rather than a distributed arithmetic architecture that needs to be operated at a higher

clock rate than the 27 MHz clock rate of the proposed system. Furthermore, a higher clock

frequency means more register stages are required and power consumption is increased

proportionally. The cycle time at 27 MHz is enough for the data to go through a multiply-

accumulate operation. A systolic multiplier-accumulator (MAC) array is used to carry out

the IDCT implementation. Compared to a design with an SIMD (single instruction multiple

data) approach, this systolic array avoids complex wiring and results in a highly regular

and modular chip design. The word lengths of interconnecting buses in the IDCT

architecture are determined in order to minimize the hardware cost under the IEEE standard

accuracy test.

Cosine ROM

X1X3X5X7

Adder andShifter

MACMAC

Adder andShifter

Roundand Clip

Roundand Clip19

Transpose RA

Cosine ROM

MAC 20

MACMAC

Adder andShifter

Roundand Clip

Row-wise 1-D IDCT

Column-wise 1-D IDCT

MAC Yout

Y Yout = X * C + Y

: 1 cycle delay

Figure 4.9 Block diagram of the IDCT unit and word lengths for interconnections

The proposed design for an eight-point 1-D IDCT uses four MACs to form a systolic

array. Data is pumped into this computing array in a special sequence on every clock cycle.

The X’s (inputs), C’s (cosine values from the cosine ROM), and 0’s form three data streams

flowing into this array, where the intermediate results of the inner products are transferred

to the next, neighboring MACs and accumulated along the MAC paths. Determination of

optimal word length for different computing stages in the IDCT operation is important in

order to minimize hardware cost while maintaining maximum accuracy to satisfy the IEEE

specification. The optimum word length determined by Kim [Kim98b] is adopted in the

proposed IDCT design. The width of the interconnection bus between MACs and the output

from the cosine ROM in the row-wise 1-D IDCT process are 19 bits wide and 13 bits wide,

respectively. For the column-wise 1-D IDCT process, the widths are 20 bits and 12 bits,

respectively. The word length of the intermediate results from the row-wise 1-D IDCT is 15

bits after the rounding and clipping process. The word length of the final results from the

IDCT unit is 9 bits.

A pair of intermediate results flows out every other cycle after the fourth cycle from

the beginning, and then is written into the transpose RAM. The transpose RAM in the

proposed design is a dual-port RAM, where reading and writing operations can be active at

the same time. For the separable 2-D IDCT implementation, the transpose RAM is used to

keep the intermediate results from the first 1-D IDCT unit, which will then be processed by

the second 1-D IDCT unit. The first 1-D IDCT unit writes to the transpose RAM in a

row/column-wise manner and the second 1-D IDCT unit reads from the transpose RAM in

the column/row-wise manner. The earliest time for the second IDCT unit to fetch data from

the transpose RAM is determined in order to reduce the total latency of 2-D IDCT. The

read-write sequence is shown in Figure 4.10. The earliest time for the second 1-D IDCT

unit to read data from the transpose RAM is the fiftieth writing cycle of the first 1-D IDCT

unit. Through correct timing and sequencing, the read/write operations for each unit are

carried out in the manner of chasing each other without destroying the data in the transpose

RAM or getting incorrect data from it. The data output timing diagram is shown in Figure

4.11. The processing time for the first block is 120 cycles, and is 64 cycles for the

remaining blocks.

: read cell: write cell: no data in cells

Writing direction(1st 1-D DCT)

Reading direction

(2nd 1-D

Reading direction

(2nd 1-D

Writing direction(1

st 1-D D

: read cell: data of current block: write cell (next block)

Writing direction(1

st 1-D D

Reading direction

(2nd 1-D

: read cell (current block): write cell (next block)

4 cy1st 1

Figure 4.10 A novel read-write sequence for transpose RAM in the IDCT uni

cles for writing -D IDCT results Trans. RAM

4 50 4 62

After 50 cycles,2 nd IDCT unit can read

from Trans. RAM

4 cycles for first 2 results output from

2 nd 1-D IDCT unit

After 62 more cycles,all 64 results

in block 1 are output

The results in block 2 are output

4.5.4 Motion Compensation Unit (MC)

As described in Chapter 1, P- and B-pictures in an MPEG bitstream use macroblock-

based motion compensation to reduce inter-picture temporal redundancy, where motion

estimation is used to search for the spatial difference between the predicted macroblock and

the reference macroblock, and the DCT is used to compress the content difference

(prediction error). The spatial differences are specified by the motion vectors that are

coded differentially with respect to previously decoded motion vectors. A video decoder

then constructs a predicted macroblock pixel by pixel from one or two reference

macroblocks from within one or two previously decoded pictures. Figure 4.12 shows an

outline of the motion compensation process in the proposed decoder. Motion vectors (MV)

from the VLD are sent to an MV Decoder to reconstruct the original motion vectors. These

original motion vectors are then sent to an Address Generator in order to generate the

physical SDRAM address in order to fetch reference macroblocks in block by block order.

The reference data from SDRAM is read out to the MC unit, where temporal interpolation

and half-pixel manipulations are performed if necessary. The output from the MC unit is

combined with prediction errors from the IDCT unit in order to obtain the reconstructed

blocks, which are then sent to SDRAM for further reference or display.

Figure 4.13 illustrates the proposed architecture for the motion vector decoder. As

mentioned above, MVs are coded differentially with respect to a previously transmitted MV

called the Prediction Motion Vector (PMV). In order to decode the MVs, the MV decoder

must maintain four motion vector predictors (each with a horizontal and vertical

component) denoted PMV[r][s][t], where r represents the first or the second MV in a MB, s

represents forward or backward MV, and t represents the horizontal or vertical component.

In a straightforward process, the parameters in the bitstream such as motion_code and

motion_residual are calculated in order to derive differential motion vector, delta, which

must lie in the range of [low:high]. The final reconstructed motion vector, vector’[r][s][t],

is then derived for the luminance component of the MB. It is then scaled depending on the

sampling structure (e.g., 4:2:0 or 4:2:2) for each chrominance component.

MVDecoder

AddressGenerator

DataBuffer

memory interface

MC Input Buffer

MC Unit

MC Output Buffer

MV information from VLD

control signals from Microprocessor(e.g. MB type, and half-pel)

prediction errors from IDCT Unit

off chip

on chip

Figure 4.12 Outline of motion compensation

PMVcompute

|delta|, range,high, and low

MUXMUX

mv_type && picture_structure

x (-1)

| delta |

sign (motion_code)

motion_code

motion_residual

f_code

delta prediction

if vector ' < low vector ' += rangeif vector ' > high vector ' -= range

vector '

vector '[r][s][t]

After all o

necessary to up

fewer than the m

that may be use

they are subsequ

The propo

hardware cost an

Figure 4.14 show

two data paths f

prediction, leadi

Figure 4.13 Block diagram of the Motion Vector Decode

f the motion vectors present in the MB have been decoded, it is sometimes

date other motion vector predictors because, in some prediction modes,

aximum possible number of MVs are used. The remainder of the predictors

d in the following decoding process must retain “sensible” values in case

ently used.

sed MC unit is based on a pipelined architecture that can minimize the

d also meet the requirements of the NTSC and PAL systems at MP@ML.

s the proposed architecture for the motion compensation unit. There are

rom the MC input buffer, one for forward prediction and one for backward

ng to the F-register set and the B-register set. Each set is 4 pixels wide and

serves as a data pool for pre-loading pixels from the MC buffer. If a motion vector has

half-pel precision, spatial interpolation is performed in add/shifter units. In the case of bi-

directional prediction, the results from both the forward prediction and backward prediction

paths are added and shifted to obtain the reference pixels. Finally, the reference pixels are

combined with the prediction errors that come from the IDCT unit to form the reconstructed

Figure 4.15 illustrates the timing diagram for the MC unit. There are two kinds of

pipelines in the MC unit, 4-stage and 3-stage for bi-directional prediction (B-type)

macroblocks, and one 2-stage pipeline for prediction (P-type) macroblocks. For B-type

macroblock processing, the first pixel of each row in a reconstructed block is produced in

the 4-stage pipeline. The first stage of this pipeline is a loading stage which reads 4 pixels

at a time to the forward register. Likewise, the second stage is a loading stage which reads

4 pixels to the backward register. The last two stages are computing stages. The remaining

pixels of each reconstructed row are produced in the 3-stage pipeline. The first stage is the

only loading stage. It reads 2 pixels to the forward register and 2 pixels to the backward

register. There is a latency of two cycles between rows in a block. This latency results from

loading a new row of data from the MC input buffer to the F- and B-registers. In a 2-stage

pipeline for P-type macroblock processing, the first stage is for loading reference pixels

and computing half-pel precision, and the last stage is for producing the reconstructed

pixels. Because a P-type macroblock only references one frame (I- or P-picture), only one

pixel-loading stage is needed, and the bi-directional prediction computing stage can be

removed. With the pixel-level pipeline, parallelism is achieved, which provides high

throughput. The processing time for each reconstructed block is 74 cycles for B-type

macroblock data, and 65 cycles for P-type macroblock data.

MC inputbuffer

2-pixelReg

Adder &Shifter

results from IDCT Unit

F-reg. B-reg.

2-pixelReg

Adder &Shifter

Reconstructed pixel

Shifter

buffer

ControlLogic

control signals fromMicroprocessor

from Control Logic

Figure 4.14 Block diagram of the MC uni

loading 4 pixels toForward Reg. &

computing half pixelprecision

loading 4 pixels toBackward Reg. &

computing bi-directionalprediction

reconstruction

loading 2 pixels toForward Reg. &

loading 2 pixels toBackward Reg. &

reconstruction

One row of data generated One row of data generated

loading 4-pel (1st stage) in the 4-stage pipeline

(a) block-level data processing pattern in the proposed MC unit

computing stages in the 4-stage pipeline

(b) pipeline stages and timing diagram of the proposed MC unit for bi-directional non-intra (B-type) MB processing

Figure 4.15(a)(b) Data processing pattern, pipeline stages, and output timing diagrams for MC processing of B- and P-type macroblocks

One row of data generated

loading 4 or 2 pel to F-Regcomputing half pixel

precision

reconstructing

(c) pipeline stages and timing diagram of the proposed MC unit for non-intar (P-type) MB processing

One row of data generated

Figure 4.15(c) Data processing pattern, pipeline stages, and output timing diagramsfor MC processing of B- and P-type macroblocks

4.6 Display Model

In the DVD format, the restrictive GOP sequence (compared to MPEG-2) is an

IBBPBBPBBP…. sequence. However, the picture order for decoding is different from the

order for displaying. Figure 4.16 illustrates the picture order relationship. The first I

picture, I1, is decoded, followed by P4, B2, then B3. The display order is I1, B2, B3, P4, etc.

As described in Section 3.4, there is one extra picture space in the frame buffer to store B-

pictures for display in addition to I- and P-pictures for prediction. The extra space allows

for decoupling of decoding and displaying.

In NTSC format, the display rate is a constant 30 pictures per second, so the display

time of one frame (T) is fixed at 33 ms. However, the decoding time varies according to the

picture type and characteristics. The decoding time for a B-picture is the longest, followed

by that for a P- and then an I-picture. During decoding of a picture, decoding time

sometimes exceeds the real-time decoding constraint of 33 ms. This is usually caused either

by long duration due to page-breaks associated with transferring large quantities of

reference data for motion compensation, or caused by delays from processes requesting

DRAM access due to various internal buffer underflow/overflow conditions occurring at the

same time. This means that a lot of macroblocks in a picture require more than 667 cycles

to process. A recovery mechanism must be adopted in order to guarantee a smooth display.

There are two common mechanisms [Stei96]. First, dropping the currently decoding picture

and then continuing with decoding of the next picture. Second, dropping the currently

decoding picture and then repeating the previous decoded picture.

A new recovery mechanism is proposed here. Taking advantage of the DVD format,

where two B pictures occur between I- and P-pictures, or between two P-pictures, the

decoding and display order can be synchronized with a single set of three pictures. The real

time decoding constraint changes to tP4 + tB2 + tB3 < 3T instead of tP4 < T, tB2< T, and tB3< T

(where tP4, tB2 and tB3 refer to the decoding times for the P4, B2 and B3 frames,

respectively). If a picture exceeds the real-time decoding constraint, this overhead can be

easily absorbed into the time left over from decoding the other two pictures with this

scheme. With the two conventional recovery mechanisms, when the decoding time for a P-

picture exceeds 33 ms, the P-picture may be dropped, degrading the quality of the two

following B-pictures. However, with the proposed recovery mechanism, the excess time for

decoding a P-picture may be absorbed by decoding the two following B-pictures. Even if

the excess time cannot be absorbed because the decoding tasks for the following two B-

pictures are also heavy, only the last B-picture is dropped and the overall video quality can

be kept higher.

Display Order I1, B2, B3, P4, B5, B6, P7, ...

Decoding Order I1, P4, B2, B3, P7, B5, B6, ...

B2 P4 B5 P7

tP4 tB2 tB3

Figure 4.16

Timing diagram of displaying order, decoding order, and the proposedrecovery mechanism

4.7 Performance Simulation Model

After the architecture design is completed, a performance simulation model is

required in order to evaluate whole system performance. Performance metrics include clock

rate determination, various buffer usage statistics, buffer overflow/underflow conditions,

data bus task scheduling determination, bus bandwidth utilization, hardware/software

module utilization, and excessive decoding time frequency analysis. Good performance on

many of these metrics is necessary to ensure smooth display appearance. The simulation

model should be simple to build, flexible for changing simulation targets, and fast in

showing results. Thus, designers can quickly adjust architecture design and resolve

performance issues in an early design stage.

As described in Section 4.2, a real-time decoding system is complex, with many

design factors having reciprocal correlativity. If analysis of the performance simulation is

performed with paper and pencil, the results are neither precise nor reliable. If the

simulation is done by RT level in an HDL, the simulation process is very slow and costly,

and usually focuses on verifying the performance of hardware modules themselves, not the

performance of the whole system.

The proposed performance simulation model is a C-code software simulator. This

simulator monitors and operates the decoding process at the level of block data transfers

according to the proposed architecture and controller scheme that are described above.

Figure 4.17 shows the block diagram of the proposed simulation model. It actually parses

bitstreams to build an accurate timing model for data dependent operations such as

processing time of VLD and IQ/IZZ units. From the proposed architecture configuration of

the VLD and IQ/IZZ units and the amount of compressed data in each macroblock, the

decoding cycle time used by the two functional units can be measured. Some functional

units, such as IDCT and MC, consume a fixed processing cycle time for processing data,

which is independent of the actual amount of data in a macroblock. These modules are

assigned fixed delay times that are estimated from the corresponding architecture designs

and timing diagrams that have been described above.

VLD & IQ/IZZunits

IDCT unit

MC unit

MV information

MB type

decoding cycles

decoding cyclecounter

internal buffer spacecapacity monitor

(for VLD, IQ, IDCT, andMC buffers)

SDRAM data transfercycle counter

bus arbiterSDRAMconfiguration

data storagestructure

data transfer cycles

input bitstream

codeword length

MV information

data transfer latencybuffer refilling

request

Figure 4.17 Processing diagram of the proposed DVD video decoder performancesimulation model

Data, including picture types, motion vector information, and coded DCT

coefficients, is precisely interpreted during bitstream parsing. The reason is that some

information, such as picture types and the coded EOB symbol, is critical to governing the

operation of the decoder under the BLP scheme, as described in Sections 3.3.1 and 4.3. On

the other hand, some information helps in constructing an accurate timing model of data

transfer between the video decoder and external DRAM. An example of this is motion

vectors. The transfer cycle time (including page-break and SDRAM refresh latencies) used

for loading the reference macroblocks for the MC process can also be measured for a given

SDRAM configuration, storage organization of the frame buffer, and decoded motion

vectors.

In addition to processing time estimation for data routing among functional units and

utilization estimation for data bus bandwidth, this simulation model also provides a means

to verify the space sufficiency of various internal buffers, and the efficiency of bus access

scheduling. For example, the bitstream parser of the simulation model can monitor the extra

requests for refilling the buffer for a specific size of VLD buffer during decoding of each

macroblock. The decoder performance delay resulting from these refilling requests can be

estimated by combining analyses of the information from data bus access scheduling and

information from the data transfer timing model.

In real hardware implementation, the activities of all functional units and memory

accesses are in parallel. However, the C programming language is sequential. Hence, one of

the greatest difficulties in the design of this simulation model is simulating this parallel

working style with a sequential tool. Another design option for the model is to use an HDL

language, Verilog or VHDL. One advantage of these HDL languages is that they already

contain the hardware parallel mechanisms. But, in the initial design stage, engineers may

want to evaluate different algorithms or implementation approaches for each functional unit

under different system environments, such as different internal buffer sizes or memory

configurations. Under these complicated conditions, HDL languages will require longer

coding time than C language, especially when each implementation approach is not yet

fully understood. Therefore, the performance simulation model uses C language not only to

provide extremely fast simulation (several frames a second), but also to allow designers to

quickly obtain estimations for such critical system level issues as clock rate, data bus

width, buffer sizes, and storage structures. These estimations can help designers to make

design trade-offs.

4.8 Simulation Results

A 2MByte 81 MHz SDRAM and a 27 MHz video decoder processing rate are used in

the proposed performance simulation model. Input bitstreams are simulated at 10 Mbps, a

real-world worst case. The sizes of internal buffers adopted in the performance simulation

are listed in Table 4.2. The detailed analyses for determining different buffer sizes are

depicted in Section 3.5.3. The decoder performance is evaluated with various sizes of VLD

buffer.

Bitstream FIFO VLD buffer IQ/IZZ

buffer IDCT buffer

MC buffer

Write-back buffer

Display buffer

48 bytes 16 bytes, or 24 bytes, or 40 bytes

128 bytes 72 bytes 298 bytes 128 bytes 192 bytes

Table 4.2 Sizes of internal buffers adopted for the simulation model for the proposed DVD architecture

The tested bitstream, Mobile.m2v at MP@ML, has 150 frames and each frame has

1320 macroblocks. This movie has 11 I frames, 40 P frames, and 99 B frames. Figure 4.18

shows the decoding timing diagrams for I-, P-, and B-type macroblocks produced by the

proposed performance simulation model.

(b) Timing diagram for decoding the 61th MB in picture 1 (P-picture). Amount of compressed data in each block: block 0 having 86 bits, block 1 having 182 bits, block 2 having 95 bits, block 3 having 191 bits, block 4 having 41 bits, block 5 having 19 bits

(time)

DRAMAccess*

IZZ/IQ*

MVdecoder*

53W021

W321 53

block 5

block 4

block 3

VLD buffer refill VLD buffer refill

block 0

block 1

VLD buffer refill

block 3

block 4

block3

block 330block 2

block 2

164 236 308 380 452 542

block0

block 013

block1

block 130

block2

block 215

block4

block 47

block 5

block5

block 57

(a) Timing diagram for decoding the 61th MB in picture 0 (I-picture). Amount of compressed data in each block: block 0 having 146 bits, block 1 having 60 bits, block 2 having 126 bits, block 3 having 119 bits, block 4 having 19 bits, block 5 having 27 bits

(time)

DRAMAccess*

IZZ/IQ*

MVdecoder*

53W021

W321 53

block 5

block 4

block 3

block 0

120block 0

block 1

64block 1

VLD buffer refill

block 3

block 4

block 2

64block 2

174 246 318 390 462 552

block0

block 033

block4

block 47

block 5

block5

block 58

block1

block 115

block2

block 227

block3

block 325

Figure 4.18(a)(b) Timing diagrams for I-, P-, and B-type macroblocks

Rn : reading reference blocks for MC of block n, Wn : writing decoded block n to frame buffer, * : processing time of each block in a functional unit depends on the amount of coded data, algorithms, and architecture design

compressed bitstream written to VBV buffer

display buffer reading data

calculating motion vectors and generating addresses of reference blocks

VLD buffer reading data from VBV buffer

macroblock header information decoding

(time)

DRAMAccess*

IZZ/IQ*block 0 block 4

VLD*block4block0 MBn+1

MVdecoder*

53R150

W321 53

block 5

block 4

block 3

6VLD buffer refill VLD buffer refill

block1

block 116

block 0

120block 0

block 1

64block 1

block 229

block2

block 3

block 4

block3

block 331

block 2

64block 2

186 260 332 404 476 566

(c) Timing diagram for decoding the 22th MB in picture 2 (B-picture). Amount of compressed data in each block: block 0 having 90 bits, block 1 having 71 bits, block 2 having 145 bits, block 3 having 119 bits, block 4 having 4 bits, block 5 having 0 bits

Figure 4.18(c) Timing diagrams for I-, P-, and B-type macroblock

Simulation results, such as average decoding cycles per macroblock and bus

utilizations under different VLD buffer sizes, are shown in Table 4.3. The results clearly

show that even with a VLD buffer of 16 bytes, bus utilization is still well below 0.85 for all

three types of pictures. The reason is that the BLP controller can efficiently arrange I/O

tasks. From system design viewpoint, the data bus utilization for video part should be less

than 0.9 because the remainder of 0.1 must be reserved for the audio and other system data

transfer. Average decoding cycles are well below the 667-cycle upper bound for real-time

decoding. Less than 1% of the macroblocks exceed 667 decoding cycles. The exceeding

macroblocks are easily absorbed into the time left over from other, less process-intensive

macroblocks, and thus cause no delay in real-time decoding. Further simulation shows that

none of the frames take more than 30 msec to decode. The simulation also shows that the

Bitstream FIFO buffer does not overflow at 48 bytes.

Another test bitstream, Gi_bitstream, is also used to test the robustness of the

proposed decoder architecture, even though its input rate of 15 Mbps far exceeds the upper

bound of the DVD specification. For B frames, this bitstream consists entirely of predicted

macroblocks, unlike those of Mobile.m2v. This bitstream composition means that the

motion compensation process is needed for every macroblock (there is no SKIP mode);

hence, this bitstream can provide a strenuous test of the efficiency of the proposed bus

interface and scheduling scheme. On the other hand, the amount of compressed data in each

macroblock is smaller than that in Mobile.m2v. Less data means there are fewer requests

for extra VLD buffer refilling. The results are shown in Table 4.4.

I picture P picture B picture Avg. Max. Avg. Max. Avg. Max.

VLD : 40 bytes

Macroblock decoding cycles 565 742 545 753 550 750

Avg. bus utilization 0.42 0.71 0.84

Avg. amount of compressed data stored in Bitstream FIFO during decoding one MB: 30 bytes % of macroblocks in I frames exceeding 667 decoding cycles = 0.04 % % of macroblocks in P frames exceeding 667 decoding cycles = 0.17 % % of macroblocks in B frames exceeding 667 decoding cycles = 0.03 %

VLD : 24 bytes

VLD : 16 bytes

Table 4.3 Number of decoding cycles per macroblock and bus utilization under different VLD buffer sizes: Mobile.m2v bitstream @ 10 Mbps

I picture P picture B picture Avg. Max. Avg. Max. Avg. Max.

VLD : 40 bytes

VLD : 24 bytes

VLD : 16 bytes

Table 4.4 Number of decoding cycles per macroblock and bus utilization under

different VLD buffer sizes: Gi_bitstream.m2v @ 15 Mbps

The results also clearly show that even with a VLD buffer of 16 bytes, bus utilization

is still well below 0.86 for all three types of pictures. Even though 1% of the macroblocks

exceed the 667 decoding cycles required in the worst case B frames, this overhead was

easily absorbed into the time left over from other, less process-intensive macroblocks. This

can be seen in the maximum decoding time for worst-case B frames being only about 32

msec. This value comes in just under the 33 msec real-time decoding limit, thus causing no

delays for real time display.

One of the important advantages of the BLP scheme is the power consumption

reduction achieved through smaller internal buffer sizes and smaller bus width. The

specifications of the proposed video decoder are summarized in Table 4.5. This decoder is

implemented by a 0.25 µm triple metal CMOS process technology. This table also shows

two other MP@ML video decoder designs, both using macroblock level processing. The

advantages of the Block-Level scheme, a small gate count of about 1 million gates and low

power consumption of about 800 mW, are readily apparent.

Proposed BLP decoder

Yasuda’s [Yasu97]

Uramoto’s [Uram95]

Technology 0.25 µm CMOS 0.35 µm CMOS 0.5 µm CMOS

Transistor count

About 1 million (including test circuit) About 1.8 million About 1.2 million

Video processing rate 27 MHz 54 MHz 27 MHz

Power supply 3.3 V 3.3 V 3.3 V

Power consumption About 800 mW About 1 W About 1.4 W

Table 4.5 Comparison of the proposed MPEG-2 MP@ML video decoder LSI and other video decoder designs using macroblock level processing

CHAPTER FIVE

Processing and Storage Models for MPEG-2 MP@HL Video Decoding — Review of Prior Art

5.1 Introduction

Today, television provides more than just entertainment, but such applications as

video conferencing, desktop video, telemedicine, and distance learning. These newer

applications and viewing habits are exposing various limitations of the current television

system. Therefore, the Advisory Committee on Advanced Television Service (ACATS) was

formed in 1987 to advise the United States Federal Communications Commission (FCC) on

the technology and systems that are suitable for delivery of high definition television

(HDTV) service over terrestrial broadcast channels. There were 23 systems proposed to

ACATS. After rigorous tests, only 5 proposals survived [Chal95]. Finally, the proponents

of these proposals agreed to combine their efforts, and the resulting organization was called

the “Digital HDTV Grand Alliance”. The GA-HDTV system is based on the MPEG-

2MP@HL video compression standard with several enhancements to the encoder. The

characteristic of MPEG compatibility has resulted in this standard HDTV video decoder to

be adopted not just by North America, but also by most countries in the world.

5.2 Overview of the Grand Alliance HDTV System

The Technical Subgroup of the ACATS has approved the specifications of the Grand

Alliance HDTV system [Grand94]. The input video conforms to SMPTE proposed standards

for the 1920x1080 system or the 1280x720 system. In either case, the number of horizontal

picture elements, 1920 or 1280, results in square pixels because the aspect ration is 16:9.

With 1080 active lines, the display rate can be 60 fields per second with interlaced scan, or

30 frames or 24 frames per second with progressive scan. With 720 active lines, the display

rate can be 60, 30, or 24 frames per second with progressive scan. Video compression is

accomplished in accordance with the MPEG-2 MP@HL video standard. The reason for

adopting the MPEG-2 syntax is that it has been accepted worldwide, which can smooth the

path of the HDTV specification toward computer and multimedia compatibility. Audio

compression is accomplished using the AC-3 system [ATSC95], which includes full

surround sound.

The video and audio encoder output is packetized in variable-length packets of data

called Packetized Elementary Stream (PES) packets. The video and audio PES packets are

presented to a multiplexer. The output of the multiplexer is a stream of fixed-length 188-

byte MPEG-2 Transport Stream packets, as shown in Figure 5.1. At the receiver side, a

demultiplexer sorts the encoded video and audio data to the video and audio decoders. The

video decoding procedures are the same as those described in Chapter 1. A summary of the

video specifications of the Grand Alliance HDTV system is listed in Table 5.1 [Hopk94].

Video AuxiliaryInfo Video Audio Video Video Audio Video

188-byte packet

184-byte payload (includes optional adaptation header)

4-byte packet header includes:

Packet synchronizationType of data in packetPacket loss/misordering protectionEncryption controlPriority (optional)

Adaptation header (variable length)includes:

Time synchronizationMedia synchronizationRandom-access flagBit-stream splice point flag

payload

Video Parameters Format 1 Format 2

Active pixels 1280 (H) x 720 (V) 1920 (H) x 1080 (V)

Total samples 1600 (H) x 787 (V) 2200 (H) x 1125 (V)

Frame rate

60 Hz progressive /

30 Hz progressive /

24 Hz progressive

60 Hz interlaced /

30 Hz progressive /

24 Hz progressive

Chrominance sampling 4 : 2 : 0

Aspect ratio 16 : 9

Data rate Selected fixed rate (10 – 45 Mbits/sec) / Variable

Picture coding type Intra coded (I) / Predictive coded (P) /

Bi-directionally predictive coded (B)

Picture structure Frame Frame /

Field (interlaced only)

Coefficient scan pattern Zigzag Zigzag / Alternate zigzag

DCT modes Frame Frame /

Field (interlaced only)

Motion Compensation modes

Frame Frame /

Field (interlaced only) / Dual Prime (interlaced only)

Motion vector precision ½ pixel precision

DC coefficient precision 8 bits / 9 bits / 10 bits

Film mode processing Automated 3:2 pulldown detection and coding

Maximum VBV buffer size 8 Mbits

Intra / Inter quantization Downloadable matrices (scene dependent)

VLC coding Separate intra and inter run-length tables

Table 5.1 GA-HDTV video parameters summary

5.3 Review of Related Work

5.3.1 Processing Model

Compared to standard definition television (SDTV), digital HDTV provides

significantly better visual and audio resolution at the expense of higher bandwidth

requirement and decoder cost. One of the most commonly adopted HDTV formats is the

1920x1080 lines interlaced at 30 frames per second. To perform MPEG-2 MP@HL, video

decoding involves processing six times as much data as when performing MP@ML. There

exist two common design approaches to meet this high computational requirement.

SystemController

An HDTV Video Picture

A Video Decoder Module

VideoDecoder

Figure 5.2 Structure of a video decoding approach using slice-level scheme

The first approach is a multiple-decoder design where multiple MP@ML video

decoders are used for decoding the data of one video picture and each decoder is

responsible for decoding multiple macroblock slices of a picture [Cugn95, Lee96, Duar97,

Yu98]. This design approach can be described as a slice-level processing scheme between

decoding engines with a macroblock-level decoding scheme in each decoding engine, as

described in Chapter 2. A simplified architectural configuration of this decoding scheme is

shown in Figure 5.2. The main disadvantage of this design approach is a large FIFO buffer

in each decoding engine needed for storing compressed video data for a single or multiple

slice bars. Likewise, each decoding engine needs a large local memory for storing reference

macroblocks in order to reduce bus contention. Additionally, a sophisticated system

controller is needed for synchronizing internal processor communication and memory I/O

scheduling. All of these requirements result in increasing decoder die size and consumption

of more power.

The second approach is a single-decoder design. Within the single decoder, there are

two processing models: macroblock-level-pipeline architecture, and dataflow architecture.

In the macroblock-level-pipeline processing model [Masa95, Sita98, Deis98,

Yama01], each functional unit has to be redesigned to have a higher data processing rate

than that in a regular MP@ML decoder. Moreover, Sita’s macroblock-level-pipeline design

is different from others using the same model. In his proposal, the MB-level-pipeline is

implemented with a concurrent VLD that can decode two source symbols at a time and with

dual decoding paths that are each composed of an IQ, an IDCT, and an MC functional unit.

When decoding a macroblock, the even and odd pixels are separately decoded into either

decoding path. The even-odd separation enables the two IDCT units to be implemented as

two independent partial IDCTs, each of which is realized using distributed arithmetic

techniques; and the IDCT units are followed by a special reformatter memory that combines

the even and odd pixels into a macroblock and then produces two parallel outputs in a 16x8

format for motion compensation processing.

In the other single-decoder design, data flow architecture [Kim98a, Wang01b], the

basic idea is that the operation of each functional unit for each macroblock is automatically

executed as soon as all the needed data have arrived. Data availability detection is done by

comparing the data tags that indicate which current data belongs to which macroblock.

Although the data flow architecture can eliminate the need for a central controller, every

functional unit introduces an extra delay for tag matching [Hwan93]. Also, a local finite-

state-machine logic circuit and a buffer-status-checking logic circuit must be associated

with each functional unit in order to synchronize the operations among functional units.

No matter which of the three processing models is adopted, each functional requires

complex algorithms and elaborate architecture design in order to reach the high computing

power requirement.

5.3.2 Memory Storage Organization and Interface

Due to the demands of high-resolution pictures, the memory bandwidth required in

the MPEG-2 MP@HL decoding system is about 720 Mbytes/sec [Kim98a]. To relieve this

serious memory bus traffic, a dual-memory-bus design is an almost unavoidable choice,

with either dual 64-bit buses [Onoy96, Duar97], dual 32-bit buses [Yama01], or a 64-bit

bus coupled with a 32-bit bus [Kim98a].

Some designers have used a kind of downscaling algorithm to reduce the frame size

so that they can still adopt a single-bus solution [Sita98, Peng99]. The idea is that the

reference frame size is decimated in the horizontal lines so that the bandwidth requirement

of the motion compensation process can be lowered. However, this approach includes an

extra-complicated up-scaling process in the MC unit or display.

Figure 5.3 shows examples of dual-bus design with dual port memory devices. The

pixel data from forward and backward reference pictures is transferred to the MC via a pair

of bi-directionally accessed buses followed by a pair of MC buffers.

Bottom-fieldTop-field

Frame Memory

Bottom-fieldTop-field

Frame Memory

I/O ControllerPixel I/OBuffer

Pixel I/OBuffer

off chip

on chip

bus 2bus 1

(a) Top-field or bottom-field video data accessed by each bus

Luma (Y)Luma (Y)

Frame Memory

Chroma(Cb, Cr)

Frame Memory

I/O ControllerPixel I/OBuffer

Pixel I/OBuffer

off chip

on chip

bus 2bus 1

(a) Luma or chroma video data accessed by each bus

Figure 5.3 Examples of dual memory bus interface and corresponding data storage structure

To reduce the probability of occurrence of page break and therefore maintain high-

speed data transfer, all video data storage structures described in Section 2.2.2 also are

adopted in the designs of HDTV decoders. Some designers take advantage of dual bus

architecture to make both buses simultaneously load top-field and bottom-field video data

[Onoy96, Duar97], as shown in Figure 5.3 (a), or simultaneously load luma and chroma

video data [Kim98a, Yama01], as shown in Figure 5.3 (b).

5.3.3 External Memory Access Scheduling

The fixed-priority schemes that are outlined in Chapter 2 are also commonly used in

the HDTV application [Lee96, Bruni98, Duar99]. However, memory access scheduling for

multiple decoding paths, as described above, is more complicated than for a single

decoding path. The reason stems from this multiple-path being a kind of MIMD scheme

(Multiple Instruction Multiple Data) [Flynn66]. Pure MIMD architectures provide a private

control unit and data memory for each data path so that there is no performance degradation

due to a limited number of instructions, or data, issued in parallel. In reality, because of

cost considerations, multiple decoding paths need to share an external DRAM that stores

compressed bitstreams, reference frames, and display frames. Therefore, the implicit

asynchronous nature of MIMD increases the difficulty of developing a memory accessing

arbiter. Duardo’s research has shown that distributed internal storage of sufficient size is

needed for buffering data to keep processes active until the SDRAM bus is granted. And, a

careful priority-order selection can make some contribution to minimizing the storage size

requirement for each functional unit [Duar97]. But such priority-order selection again

increases the complexity of the arbiter.

A dynamic memory arbitration mechanism has been developed in order to reorder

memory access sequences dynamically to avoid page breaks and latency of read/write

switches [Taki01]. Figure 5.4 shows reordering of memory accesses can minimize access

overhead for different functional units. The basic strategy of this mechanism is to assign

multi-level priority to each memory access task.

bank 0 bank 0 bank 1 bank 1

bank 0 bank 0bank 1 bank 1

Case 1 (worst)

Case 2

Case 3 (best)

read write read write

Case 1 (worst)

read readwrite write

Case 2

read read write write

Case 3 (best)

: page-break : latency of read/write switch

Figure 5.4 Reordering memory access sequences to avoid page-breaks and latency of read/write switches

The priority level of the current task will rise when the prior task addressed a

different bank from the current task, and/or the prior task has the same read/write access as

the current task. However, this priority level is also affected by the waiting duration for

accessing memory in order to avoid overflow or underflow of the corresponding buffer.

Obviously this bus scheduling mechanism is not easy to implement because the decoder

needs to be equipped with a sophisticated timer. Alternatively, one could use larger buffers

to avoid overflow/underflow conditions. Furthermore, the access sequences sometimes

cannot be changed because the decoding process must follow a fixed sequence, for

example, in MC processing, when the write-back data task must come after the read

reference data task.

In the HDTV application, the largest picture size is 1920 pixels/line x 1080

lines/frame = 2,073,600 pixels. When the frame-display rate is 30 frames/sec, the video

decoder must be able to output 62.2M pixels per second. Thus, for the 4:2:0 format, the

maximum throughput for the VLC-IQ/IZZ decoder is 93.3M symbols per second. In

practice, additional time is required for video syntax parsing and control information

decoding from the input bitstream. Also, the idle time of the VLD for globally

synchronizing with other functional units in the video decoder also needs to be considered.

Based on these requirements, a minimum sustained VLD-IQ/IZZ decode rate for HDTV

could be 100M symbols per second [Park95, Sita98]. The equivalent average throughput is

about 285Mbits/sec assuming that the average codeword length of the source symbols is

about 2.85 bits, which has been discussed in Chapter 2. Therefore, a VLC decoder should

be constructed to have a 100M symbols per second minimum capability.

Based on the above estimation, many designers have suggested the importance of

concurrent decoding for the VLC decoder, in which it can decode two source symbols at a

time under best-case conditions [Lin92, Hsieh96, Bae98, Sita98]. As in the discussion in

Chapter 1, VLC coding is based on the probability distributions of input source symbols;

that is, more common data receives shorter codewords, and shorter codewords therefore

appear more often. Making use of this outcome, concurrent VLC decoding algorithms match

two or more shorter codewords concurrently in order to speed up the decoding process.

Figure 5.6 presents a simplified concurrent-decoding VLD architecture diagram designed

by Hsieh and Kim [Hsieh96]. The basic operation is shown in Figure 5.5. It is a general

two-level codeword matching tree for decoding two codewords concurrently. Each level in

the tree represents one codeword to be decoded. At the first level, the decoding process is

the same as regular bit-pattern matching. Thus, the input length at the first level is the

maximum length in a codebook (k bits in Figure 5.5). The decoding path for the second

level of matching is chosen by following the matched node of the first level. The matching

length range for the second level is determined by the tradeoff between system performance

and hardware cost. Usually the codes with shorter length (higher probability) are preferred

in order to minimize extra silicon area. Other concurrent-decoding VLDs adopt similar

methods, using different grouping approaches. Figure 5.6 shows the problem of concurrent-

decoding VLD architectures suffering extra silicon area from the second level matching

blocks.

1st level matching

2nd level matching

the number in each node is the codeword length

Figure 5.5 Codeword-length tree for two-level concurrent decoding.(Source: [Hsieh 96])

Alignment Buffer

binational Logics and Barrel Shifters

SymbolRAMand

Buffer

SymbolRAMand

Buffer

2-bit to 16-bit Codewords'Matching Block

2-bit to 10-bit Codewords'Matching Block

bit0 - bit19

bit0 - bit16

bit2 - bit11

bit3 - bit12

bit10 - bit19

grpupnumber

(1st level)1st

2ndlevel

group andlength data

shifting length(combined 2 levels) output

remainder(1st level)

grpupnumber

(2nd level)

remainder(2nd level)

Figure 5.6 Architecture diagram for two-level concurrent-decoding VLD (Source: [Hsieh96])

In addition to the concurrent decoding approach, there exist improved parallel

decoding approaches that are based on Lei and Sun’s design, which has been discussed in

Chapter 2 [Park95, Wei95]. The improvements are made in algorithms and circuit

implementation in order to raise the decoding throughput so that it can support an HDTV

5.3.5 Inverse Discrete Cosine Transform (IDCT)

In the MPEG video decoder system, the IDCT process that needs a lot of

multiplications and additions continuously attracts a considerable amount of research

attention to find more efficient ways of reducing calculation complexity. In the HDTV

application, the IDCT process faces two more technical challenges: the high input rate for

pixel processing (minimum 100M symbol/sec as described in the above section), and the

rigorous computational accuracy requirement. Most of the related research focuses on

advanced algorithm improvements and corresponding architecture implementations for

handling the most stringent part of the high processing rate requirement of the HDTV

application. In this section, some advance IDCT technologies for the HDTV application are

reviewed.

The most commonly proposed high-speed IDCT implementations are based on a

distributed arithmetic Look-Up-Table (LUT) architecture which can result in accurate,

high-speed performance. Sun et al. were among the first to present a DCT implementation

based on this architecture [Sun87]. Uramoto et al. developed a fast algorithm along with

the distributed arithmetic LUT technique to achieve a 100 MHz processing rate [Uram92].

Matsui et al. also used this technique with low-swing differential logic to achieve higher

speed (200 MHz) and smaller size [Mats94]. Choi et al. proposed a new distributed

arithmetic architecture featuring a predefined addition architecture that exploits the multi-

bit coding of the IDCT cosine weighting matrix coefficients in order to eliminate the need

of LUT ROMs [Choi97]. These ROMs have been replaced by hardwired matrix

multiplication. Hence, the silicon area of the IDCT core can be reduced by almost 40%

relative to other distributed arithmetic architectures and the processing speed can be pushed

to 400Mpixels per second.

In the traditional design, the IDCT process cannot begin the calculation until the

whole block of DCT coefficients is available. As, for zero-value coefficients, the

traditional IDCT process still has to process them and cannot effectively utilize them to

reduce its computation. Yang et al. have developed a novel direct 2-D IDCT algorithm and

architecture based on coefficient-by-coefficient implementation for HDTV receivers

[Yang95], which can eliminate the disadvantages of the traditional methods. This algorithm

divides the original 8x8 cosine weighting matrix into four 4x4 matrices. A DCT coefficient

decoded from the VLD unit is sent to one of the four processing kernel units according to

its run length information. The throughput performance of this algorithm depends on the

number of non-zero DCT coefficients, but average performance is six orders of magnitude

higher than that of row-column based methods. However, its architecture is more

complicated; for example, it needs eighty adders that are six orders higher than row-column

based methods.

The above research literature only focuses on the performance improvement so that a

single IDCT unit can reach the high throughput requirement of HDTV applications.

However, from the whole decoder design viewpoint, each functional unit should have

relatively similar processing capacity along a decoding path. If only one or two functional

units have such outstanding processing capacity, they do not benefit decoder performance,

and designers must still make extra effort to smooth processing flow (usually by increasing

internal buffer sizes and adopting a sophisticated controlling scheme).

5.3.6 Motion Compensator (MC) Just as the designs of other functional units had to be enhanced for HDTV

applications, the MC design must have an increased the data processing rate in order to

produce decoded video pictures in real time. Figure 5.7 shows a high-throughput MC

architecture [Masa95].

frame memory

pixel buffer

pixelgeneratorpixel

generatorpixelgeneratorpixel

generator

predicted picture generator

motion vectordecoder

off-chip

on-chip

bitstream

motion compensator

(a) Outline of motion compensator

Stage 1 Stage 2 Stage 3

(b) Block diagram of pipelined pixel generator

This MC

design,

not take

Figure 5.7 Block diagram of the Masaki Motion Compensator architectur

is constructed of four pixel generators for decoding four pixels in parallel. In this

the high processing rate is clearly at the expense of gate count. This design does

advantage of reusing input pixels for reconstructing decoded pixels.

5.4 Motivations and Challenge

The HDTV video decoder research reviewed above can be grouped into two distinct

decoding approaches: the multiple-decoder approach and the single-decoder approach.

In the multiple-decoder approach, each picture is divided into a number of horizontal

bars by a bitstream parser using the slice layer syntax of the MPEG-2 standard. The bars

are then dispatched to multiple decoding engines. The engines are either sets of MPEG-2

MP@ML ASIC functional units or programmable digital signal processors (DSPs). With

either ASIC design or DSP design, every decoding engine processes data in the

macroblock-level processing model. The four shortcomings that emerge from this approach

are listed below [Ling03].

1. Memory I/O contention: Every decoding engine needs to access external memory for

motion compensation. This causes serious memory bus contention and increases

decoding delay.

2. Computation load balancing: For the sake of performance, each decoding engine should

be kept as busy as possible. However, the processing time for slice bars can vary

significantly. It is difficult to balance the computational load among these decoding

engines.

3. Synchronization problem: The tasks in each decoding engine are executed

independently and in parallel. Additional overhead is necessary for the sophisticated

system controller that needs to synchronize inter-processor communication and memory

I/O scheduling when an interrupt occurs.

4. High local memory and embedded buffer requirement: A large FIFO buffer is needed in

each decoding engine to store the compressed video data of a slice bar. Likewise, each

decoding engine needs a large local memory for storing reference macroblocks in order

to reduce bus contention.

In the single-decoder approach, there are two adopted decoding architectures: re-

modified macroblock-level-pipeline architecture and dataflow architecture. As described

above, complex hardware and firmware design is involved in either architecture. These

sophisticated designs negatively impact engineering design cycles and manufacturing costs.

Compressed video streams for both DVD and HDTV applications are in the MPEG-2

format. The only differences are input bit rates and picture resolutions. Therefore, if the

video decoder design for DVD applications can be re-used for HDTV applications, design

cycles and costs can be reduced. But, as described in Section 5.3.1, a dual-MP@ML-

decoder architecture configuration is an unavoidable choice under the re-use framework.

Therefore, if the current proposal can overcome the four disadvantages that emerge from

such multiple decoding engines, the main design challenge can be met: constructing a

simple, low cost, but high-efficiency video decoder.

5.5 Research Direction

The details of the BLP decoding scheme for MP@ML applications has been

discussed and validated in Chapters 3 and 4. But, for HDTV applications, how to map the

BLP decoding scheme to a dual-MP@ML-decoder architecture configuration is the first

research issue. Applying the BLP decoding scheme to a dual-decoder configuration should

not involve a lot of hardware changes to original MP@ML decoder design. Moreover, it

should eliminate the need for a large FIFO buffer to store compressed data under the slice-

level decoding scheme.

The second research issue is how to simplify the memory access scheduling scheme

for a dual-decoding path configuration. Section 5.3.1 clearly shows that traditional dual-

decoding path configurations for HDTV applications need various large internal buffers in

order to avoid performance degradation resulting from the starvation condition of

functional units. The main cause of the starvation condition is that the decoding process on

each decoding path is independent. In other words, each decoding process accesses memory

only according to its own decoding status. If the two processes access memory at the same

time and both are in a critical situation, one of them must be sacrificed. As a result, the

whole system performance must be affected. Therefore, the research direction should focus

on putting both decoding paths under control of the same decoding process. Consequently,

the complexity of memory access scheduling can be simplified and internal buffer size can

be reduced.

The third research issue is to derive an optimal architecture for each functional unit.

The essence of optimal architecture means every functional unit having an appropriate data

processing rate for accomplishing the fundamental requirements of real-time decoding in

HDTV applications. Faced with processing much larger data in HDTV video streams (six

times the size of MP@ML video streams), designers must construct a real-time HDTV

video decoder while keeping minimal power consumption and chip size. A review of prior

art reveals that traditional designs usually adopt two methods at the same time to reach the

high speed decoding requirements of HDTV. The first method is using complicated

architectures to increase the data throughput rate of each functional unit, which causes

increased decoder gate count. The other method is running the decoder and its memory at

high frequency rates. Both methods use more power and produce more heat. Power

consumption in general is an important safety issue for television applications because a

television is usually in a continuous-use environment that can easily lead to the problem of

high heat. Many components in a television system consume power and produce heat,

including the display device, the video and audio decoders, the graphic engine, and the

sound system. For an HDTV video decoder, heat generation is no less a problem. As

described in Chapters 3 and 4, many design considerations affect decoder power

consumption, including such interrelated factors as the operation frequencies of the video

decoder and memory, the size of internal buffers, and the data bus configuration. To

evaluate the trade-off among these factors, the proposed analysis paradigm described in

Section 1.5, and the design procedure for the DVD application described in Section 4.2, are

a good starting point.

Based on the proposed BLP scheme, a novel dual-path architecture for HDTV video

decoding is detailed in Chapter 6. An efficient dual-bus configuration for lowering external

memory operation frequency rate is also discussed. Complete simulations have been run to

verify decoding performance under the new architecture. These simulation results are also

presented in the chapter.

CHAPTER SIX

Design of a Video Decoder for HDTV: Block Level Pipeline Scheme Application Example II

6.1 Introduction

With the establishment of the European Digital Video Broadcasting (DVB) standard

and the American Advanced Television Systems Committee (ATSC) standard, digital TV

(DTV) broadcasting is now a reality in several countries. In Japan, much effort has been

put into the development of the Integrated Services Digital Broadcasting (ISDB) standard.

The same thing has happened in China. The Chinese government defined a new DTT HDTV

standard in 2003. They have also scheduled the broadcasting system to be deployed in

2005, and the HDTV channels to start ahead of the 2008 Beijing Olympics. Terrestrial

broadcast DTV programs are transmitted in either the standard definition television (SDTV)

format or in the high definition television (HDTV) format. HDTV provides significantly

better visual and audio resolution at the expense of higher bandwidth requirement and set-

top box cost.

A typical set-top box system architecture for DVB-T (terrestrial) is presented in

Figure 6.1. Orthogonal frequency division multiplexing (OFDM) signals are demodulated

using the fast Fourier transform (FFT) technique [Fres99]. The resulting channel signals are

decoded by the convolutional (inner) and Reed-Solomon (outer) decoders, are

demultiplexed by MPEG-2 transport demultiplexors, and finally, are decompressed to

generate video and audio information for display by the MPEG-2 video and audio decoders.

The ATSC set-top box is generally similar to that of Figure 6.1, with the FFT replaced by a

vestigial side band (VSB) demodulator and the MPEG-2 audio decoder replaced by a Dolby

AC-3 audio decoder.

Both the U.S. Grand Alliance HDTV standard and the European DVB standard adopt

the MPEG-2 MP@HL video standard [ISO94] for the encoding and decoding of video

information. One of the most commonly adopted HDTV formats is the 1920x1080

pixels/frame format interlaced at 60 fields (or 30 frames) per second, with a compressed

bitstream from 18 to 20 Mbps. This bitstream has six times the amount of video data as

SDTV, which is MPEG-2 MP@ML. A major task for researchers in building HDTV

receiving and decoding systems is thus to design a set of video decoding chips that can

handle this heavy data load in real time, but still be competitive in terms of manufacturing

cost and power consumption [Ling02].

De-randomizer

Tuner SAW A/D

IF AGC

Ref.oscillator

FFTIQDemod. (

Carrierrecovery

FFTframing

Carrierrecovery(coarse)

RSDecoder

ViterbiDecoder

InnerDe-interleaver

Removepilot

Detectpilot

SymbolTiming(coarse)

Demapper.ConvoDe-interleaver

MPEG-2Syst.

MPEG-2 VideoDecoder

MPEG-2 AudioDecoder

Data / controlinformation

SymbolTiming

( fine)

Equalizer

Figure 6.1 Basic set-top box architecture for DVB-T digital TV

6.2 Overview of the Proposed Decoding Approach

An ideal solution for constructing an HDTV video decoder would be based on an

existing MP@ML video design in order to reduce design cost. Table 6.1 shows the upper

bounds for picture resolution, display rate, macroblock processing rate, and allowable

processing time for each macroblock in MPEG-2 MP@ML and GA-HDTV format 2 (as

described in Table 5.1). Due to the large quantity of data in an HDTV picture, multiple-

decoder parallel processing is necessary to achieve the required computing performance

and to lower processing frequency.

MP@ML GA-HDTV format 2

Picture resolution 720x480 1920x1080

Display frame rate (frames/sec) 30 30

Macroblock processing rate (macroblocks/sec) 40,500 244,800

Allowable processing time for each macroblock, in µs 24.7 4.08

Allowable processing time for each macroblock, in cycles 667 @ 27 MHz

111 @ 27 MHz 221 @ 54 MHz 333 @ 81 MHz

Table 6.1 Upper bounds for picture resolution and allowable processing time for each macroblock inMPEG-2 MP@ML and GA-HDTV

Multiple-decoder design often includes three or four MP@ML decoders, but from a

consideration of such manufacturing costs as increased chip area and extra cooling devices

for the increased power consumption, a dual-decoder configuration can be excellent

compromise between the computing power requirement and manufacturing cost

considerations. However, Table 6.1 clearly shows HDTV pictures require six times more

processing performance than MP@ML pictures. Hence, just doubling the computing

performance by using dual existing MP@ML decoders cannot provide the high data

processing throughput of the HDTV application. These MP@ML decoders still need some

modifications.

The dual-decoder configuration will face the same disadvantages as the multiple-

decoder approach described above if its controller mechanism also adopts the macroblock-

level processing model. Therefore, the proposed dual-decoder configuration adopts the BLP

scheme, where both decoding paths work on the same macroblock while each decoding path

processes a block. The proposed design approach has four improvements are listed below

[Wang99a, Wang01a].

1. No memory I/O contention: The BLP scheme can eliminate the memory I/O contention

problem because both decoding paths work on the same macroblock, after which

memory I/O scheduling can be the same as that of the DVD decoder (a combination of

fixed time-line scheduling and fixed priority scheduling) described in Chapter 4.

2. Good computational load balancing: The computational load in both decoding paths

can be kept largely the same because the data quantity between blocks in the same

macroblock varies much less than the data quantity between different slices.

3. No synchronization problem: With both decoding engines working on the same

macroblock and no memory I/O contention, both decoding paths can be synchronized

under a simple control mechanism.

4. Low local memory and embedded buffer requirement: The BLP approach can relieve the

requirements of high internal buffer sizes and wide data-bus width because memory I/O

contention has been eliminated and data are captured in each decoding path in a block

manner.

In addition to adopting the BLP scheme, a writing-back scheme is designed to

separate the bus traffic of retrieved picture data for the display process from the bus traffic

of other decoding processes. This scheme can furthermore minimize data bus loading and

allow the proposed decoding architecture to operate at 54 MHz with a 64-bit data bus.

6.3 Overall Decoding System

Figure 6.2 shows the proposed dual-path decoder architecture. The architecture

consists of two external memory devices, an SDRAM interface, a 64-bit wide data bus, a

micro-controller, a variable-length decoder (VLD), a one-to-two demultiplexor (DEMUX),

and two sets of baseline units. The functionality and configuration of the key units in this

decoding system are as follows:

• Two groups of external memory devices, one for storing two pictures for display and

the other for storing two required reference pictures separately and accommodating the

video buffer verifier (VBV) buffer (basically for incoming compressed bits).

Synchronous DRAM (SDRAM) can be used for these memory devices, with each

SDRAM internally configured as a dual bank and a 32-bit wordlength for each bank.

Total memory size for this video decoder is 13 Mbytes.

• The SDRAM interface is the external memory interface circuit for SDRAM access

operations. It includes two sets of data pins and one set of address pins. Its functions

are, firstly, to automatically generate row address strobe/column address strobe

(RAS/CAS) for accessing or refreshing memory cells; and secondly, to buffer data

transactions under two different clock speeds – that of the SDRAM and that of the

video decoder. A 54 MHz decoding speed and 162 MHz SDRAMs are adopted in the

proposed architecture.

• The microprocessor takes the responsibility of setting up decoding parameters, such as

the current macroblock (MB) types and addresses, or calculating the actual motion

vectors. Another important microprocessor function is to synchronize the processing of

the two baselines and to trigger the processing of the Inverse Discrete Cosine

Transform (IDCT) and Motion Compensation (MC) units in each baseline.

• The responsibility of the VLD is to decode variable-length coded data from macroblock

headers and quantized discrete cosine transform (DCT) coefficients.

• Each baseline unit consists of three functional units: the IQ/IZZ (for inverse

quantization and inverse zigzag ordering), the IDCT (for inverse DCT operation), and

the MC (for motion compensation). The basic internal structure of each functional unit

was discussed in Chapter 4.

To simplify the discussion but without losing generality, GA-HDTV format 2

bitstreams are used as the decoding target in this dissertation. The data specifications of

format 2 (as described in Table 6.1) are a frame size of 1920x1080 pels (8,160 macroblocks

in a frame) and a display rate of 30 frames per second. Therefore, under real-time playback

restriction, each MB should be decoded within 221 cycles at a 54 MHz video decoder clock

rate, or 333 cycles at an 81 MHz clock rate. Obviously, the video decoder can have a larger

margin of decoding time if it is running at 81 MHz; but it will consume 15 % more power

in comparison with using the lower clock speed of 54 MHz. As described in Section 2.3,

power consumption is one of the key factors for consumer products. The proposed HDTV

video decoder will adopt the 54 MHz clock rate to contribute to the low power consumption

of the whole architecture.

Real-time decoding is the all-important consideration in this architecture design. The

proposed performance simulation model is used to ensure this low frequency rate will not

sacrifice real-time decoding requirements.

64-Bit Data Bus

6 MByte SDRAM

7 MByte SDRAM

Dispaly Interface

Microprocessor

InstructionC

RegisterFile

SDRAM Interface

Data Buffer

Display B

ufferHost Interface

Display Engine

Command Bus

VLDBuffer

DEMUX10

Block Decoding Engine

MC / WBBuffer

MCUnit

Baseline Unit 0

IQ/IZZBuffer

IDCTBuffer

IQ /IZZ

Baseline Unit 1

IQ/IZZBuffer

IQ /IZZ

IDCTBuffer

MC / WBBuffer

MCUnit

SDRAM Controller

SchedulingController

AddressGenerator

V and HScalingFilter

SpaceC

onvertor

OSDDecoder

Overlay

Controller

32-bit 32-bit

BitstreamFIFO

Coded Bitstream In

Figure 6.2 Block diagram of the proposed HDTV video decoder architecture

6.4 BLP Controller Mechanism

6.4.1 Overall Controller Scheme

Macroblock decoding in MPEG-2 follows a specific sequence. The required tasks (in

order) are Bitstream FIFO writing, VLD buffer reading, VLD decoding, IQ/IZZ, and IDCT.

If motion compensation is required, the MC task is also scheduled after the VLD unit has

decoded the MB header. The results from IDCT and MC units are then combined to form

decoded data and written back to the memory for display and as future reference if

necessary. The proposed HDTV decoder takes advantage of this sequence and applies the

BLP scheme to the dual-decoding path for HDTV decoding. The two baseline units in the

dual-decoding path process video data on a block-by-block basis in parallel. Within each

baseline, the functional units process data in a pipelined fashion. In a word, the BLP

scheme synchronizes the dual-decoding paths between blocks and also manages

synchronization of the decoding tasks on a block basis. In addition to the BLP scheme, the

two fixed schedules (as described in Chapter 3) are also applied to bus transactions in order

to minimize buffer requests and waiting cycles. However, due to the large amount of video

data in an HDTV picture, an efficient memory interface scheme is proposed to reduce

processing time. The detailed description of the memory interface scheme is discussed in

Section 6.5.

According to the syntax definition of an MPEG-2 video stream, each coded block will

end with an end of block (EOB) symbol. The proposed controller takes advantage of this

feature to synchronize the two baseline decoding paths. In other words, each baseline unit

decodes video data on a block-by-block basis in the manner shown in the demultiplexor

mechanism of Figure 6.3. For a non-coded block, the proposed controller is signaled in

advance through information in CBP (coded block pattern), which is included in the MB

header. Therefore, the controller can still assign a decoding path for the motion

compensation of this non-coded block without losing the overall decoding pattern.

VLD decodes a symbol in block i .

Controller is signaled andchecks (i += 1) >

block_count ?

Controller checkspattern_block [ i ] = 1 ?

Controller reverses Demux to 1 / 0

Start same procedure for next macroblock

STARTDemux = 0

Is the symbol anEOB ? Demux = ?

IQ/IZZ and IDCT in Baseline 0 process block i

IQ/IZZ and IDCT in Baseline 1 process block i

MC unit in Baseline 0 processes block i

Demux = ?

MC unit in Baseline 1 processes block i

Figure 6.3 Flow chart of the controller setting the demultiplexor

6.4.2 Memory I/O Scheduling

In each baseline, two fixed schedules (time-line scheduling and fixed priority

scheduling) are adopted, as in the descriptions in Chapters 3 and 4. The following

paragraph summarizes information from Sections 3.5.3 and 4.4.

The time-line schedule allocates fixed non-preemptable execution sequences for the

tasks of video decoding because these tasks are deterministic. When it is time for reading

reference data for the MC task, for example, the bus-scheduling program allocates the data

bus to that task until this transaction ends. Under this time-line scheduling scheme, the

SDRAM controller only needs to monitor Compressed Bitstream FIFO overflow and VLD

buffer underflow (display data transfer has been excluded from the I/O process of

macroblock decoding). If these two situations occur at the same time, fixed priority

scheduling is adopted to handle these I/O requests in the order of Pbits t ream FIFO > PVLD buf fer,

where P refers to the priority. During the period when reading reference data for the MC

task for block 0, for example, if the fullness of Bitstream FIFO and VLD buffer are

over/under their respective thresholds and both request data transfer to/from external

SDRAM at the same time, the SDRAM controller will act according to the following

sequential order. It will first finish reference data reading for block 0, then transfer FIFO

data to SDRAM, then transfer data from SDRAM to the VLD buffer, and then continue with

reference data reading for the MC task for block 1. The order of the processing between

functional units is thus maintained, which eliminates the need for complex bus arbitration

schemes.

Figure 6.4 shows the flow chart of the proposed deterministic controller schedule for

two processing baselines decoding a non-intra MB (a non-intra MB is an MB that is

decoded using a reference MB from a reference picture). The MC buffer and the Write-back

buffer are filled/emptied from/to the bus according to the fixed schedule scheme. The VLD

unit immediately decodes the header information of the next macroblock after decoding all

of the DCT coefficients in the current macroblock. With this decoding order, at least 25%

of the operation cycles can be saved.

Baseline 0 Baseline 1

Load ref.blocks for block 0's

IQ/IZZ processes

block 0

trigger IDCT to process block 0

IQ/IZZ processes

block 2

IDCT MC

process block 0

IQ/IZZ processes

block 4

IDCT MC

process block 2

Controller is signaled block 0 complete

IDCT MC

process block 4

IQ/IZZ processes

block 1

Demux = 0 ?ny

IQ/IZZ processes

block 3

IDCT MC

process block 1

Write back block 2 i to external DRAM

the block 2 icompleted ?

2 (i+1) >block_count ?

IDCT MC

process block 3

IQ/IZZ processes

block 5

IDCT MC

process block 5

Write back block 2 i +1 to external DRAM

the block 2 i +1completed ?

2 (i+1) + 1 >block_count ?

VLD decodes block i

(i+1) >block_count ?

VLD/FLD decode next MB header

information

bitstream FIFO write

Figure 6.4 Flow chart of HDTV BLP decoding process for non-intra macroblocks

Baseline 0 Baseline 1

IQ/IZZ processes

block 0

IQ/IZZ processes

block 2

IDCT process block 0

IQ/IZZ processes

block 4

Write back

block 0

Write back

block 2

Write back

block 4

IQ/IZZ processes

block 1

Demux = 0 ?ny

IQ/IZZ processes

block 3

Write back

block 1

IQ/IZZ processes

block 5

Write back

block 3

Write back

block 5

VLD decodes block i

(i+1) >block_count ?

VLD/FLD decode next MB header

information

bitstream FIFO write

A similar method is also applied to decoding an intra MB (an intra MB is a MB that is

decoded without referencing any MB in the reference picture), as shown in Figure 6.5. The

main difference from the former case is that each reconstructed block can immediately be

written back to external SDRAM because the data bus is usually in the idle state when

decoding an intra MB. Under the tight time constraints of HDTV video decoding, the

design strategy for memory I/O scheduling is to avoid any factor that may cause data access

delay. For example, unlike the write-back scheme during non-intra MB decoding in the

DVD application (in Section 4.4), this scheme is amended to group together the six I/O

processes of writing back reconstructed block data in order to eliminate unnecessary

memory-page open latency.

6.5 Memory Interface Scheme

During MPEG-2 video decoding, there are many required DRAM accesses such as

compressed bitstream FIFO writing, VLD buffer reading, reference macroblock reading,

and reconstructed video data writing. Among these memory I/O transactions, the required

bandwidth for reference macroblock reading is much higher than the bandwidth

requirements of other I/O transactions. One of the advantages of the BLP decoding

approach is that it can spread out the peak bandwidth requirement for reference macroblock

reading, owing to the fact that only one or two blocks of anchor data are loaded each time.

Hence, a 64-bit wide data bus is sufficient.

In conventional design of MPEG-2 decoders, external RAM space basically has to

accommodate two reference pictures, one display picture, and a VBV buffer for the

incoming compressed bitstream. The DRAM size for a typical HDTV application is thus

about 10 Mbytes. In earlier designs [Duar97, Geib97], when DRAMs were still relatively

expensive, display and reference pictures occupied the same physical external DRAM.

Therefore, during the decoding process, the display engine had to compete with the decoder

for the bus and external DRAM in order to extract a picture for display. Such

implementation uses up a lot of bus cycles and is quite inefficient for real-time HDTV

decoding.

In the proposed memory interface scheme, as shown in Figure 6.6, the display

memory is physically separated from the anchor memory, and is added to the display

engine, not to the decoder-memory bus. The size of SDRAM for the video decoder is 7

Mbytes (6 Mbytes for storing two reference pictures, 1 Mbyte for the VBV buffer). The

display memory takes 6 Mbytes. The two-picture space included in the display memory is

needed because the decoding order and the displaying order are different in MPEG. With

the low cost of DRAMs today, such sizes are acceptable. In the memory interface scheme,

the reconstructed macroblocks belonging to reference pictures (I and P) are sent to both the

SDRAM frame buffer that stores reference pictures and (at the same time) to the SDRAM

display memory (for immediate or later display). On the other hand, the reconstructed

macroblocks of B-pictures are only written to the display memory, and not to the frame

buffer. Although this scheme requires more memory space, the separation of display and

reference memories reduces bus traffic and contention between the decoder and the display

engine, as compared to many conventional schemes. It saves about 60 clock cycles per

macroblock (measured at 54 MHz), which is 27% of video decoding cycles. This result is

worthwhile for real-time HDTV decoding.

The proposed storage organization for the VBV buffer and for reference pictures (the

interlaced macroblock-row memory structure) in the frame buffer is the same as the

description in Section 3.4.3. The average probability of page-break occurrence when

reading reference macroblocks is only 1.5% during decoding one picture. The storage

organization for the display buffer can be the same scan-line pattern used by the display

device.

VideoDecoder

DisplayEngine

referencepicture

displaypicture

I / PI / P / B

Display MemoryFrame Buffer

VBVbuffer

Figure 6.6 Block diagram of memory interface scheme

6.6 Architecture of Video Processing Units

The architecture designs of the VLD and IQ/IZZ functional units for DVD

applications can be directly applied to the proposed HDTV decoder without any hardware

changes. Because two parallel decoding paths with IQ/IZZ functional units follow the VLD

unit, the data processing rate of the VLD unit becomes 108M symbols per second at a 54

MHz decoding frequency rate. This output rate meets the VLD processing rate requirement

of 100M symbols per second for HDTV applications (Section 5.3.4).

Therefore, this section focuses on the architecture designs of the IDCT and MC

functional units. The proposed HDTV video decoder is running at 54 MHz, which is a very

low processing rate compared to other designs. For real-time decoding of such a large

quantity of data as HDTV pictures, the architectures of the two units need to be amended in

order to increase data output rate from one decoded pixel per cycle (for DVD applications)

to two decoded pixel per cycle. In order to minimize design cost, the proposed

modifications only involve minor changes to the original designs.

6.6.1 Inverse Discrete Cosine Transform Unit (IDCT)

The simplified overall 2-D IDCT architecture being proposed is shown in Figure 6.7.

Basically, this architecture is the same as the proposed IDCT unit for the DVD decoder

described in Section 4.5.3, except for the transpose RAM. An 8-point 1-D IDCT uses 4

MAC’s to form a systolic array in order to carry out IDCT calculation.

To reduce the latency from the read/write operations in the transpose RAM, the I/O

port of RAM needs a minor modification, such that read and write operations can be active

at the same time and two intermediate results can be transferred in each operation. The

read-write sequence is shown in Figure 6.8. In this way, a pair of intermediate results are

put into the transpose RAM every cycle after the fourth cycle from the beginning. Then, the

earliest time for the second 1-D IDCT unit to read data from the transpose RAM is the 25th

waiting cycle of the first 1-D IDCT unit. Therefore, the total processing time for 3 blocks

(in one baseline) is 128 cycles. The timing diagram is shown in Figure 6.9.

Transpose R

1-D IDCT processing unit

output

Cosine ROM

MAC MAC

Cosine ROM

MAC MAC

Figure 6.7 Block diagram of IDCT core processor for HDTV video decode

writing sequency reading sequency

6 10 14 18 22 26 30

8 12 16 20 24 28

2 24 4

7 76 68 8

10 1012 12

14 1416 16

18 1820 20

22 2224 24

26 2628 28

30 3032 32

5 9 13 17 21 25 29

7 11 15 19 23 27 31

1 13 3

9 911 11

13 131515

17 1719 19

21 2123 23

2527 27

29 2931 31

Figure 6.8 Writing and reading order in the transpose RAM

* In pipeline processing, a pair of results can be output every cycle. Hence, blocks other than the first one are decoded every 32 cycles.

4 cycles for writing 1st 1-D IDCT results

to Trans. RAM

4 25 4 31

After 25 cycles,2 nd IDCT unit can read

from Trans. RAM

4 cycles for first 2 results output from

2 nd 1-D IDCT unit

After 31 more cycles,all 64 results

in block 1 are output

Figure 6.9 Output timing diagram for the proposed IDCT unit for HDTV decoder

6.6.2 Motion Compensation Unit (MC)

Basically, this MC architecture is the same as the proposed MC unit for the DVD

decoder described in Section 4.5.4, except for the F- and B-register size. The simplified

overall MC architecture being proposed is shown in Figure 6.10. Two paths that begin from

the MC buffer separately connect to the F- and B- registers, which serve as forward

prediction and backward prediction. Each register is expanded to 6 pixels wide. The

adder/shifter unit immediately following the register sets is for interpolating the half-pel

precision of a motion vector. If bi-directional prediction is needed, the results from both

the forward- and backward-prediction paths are added and shifted to obtain the reference

pixels.

The timing diagram is shown in Figure 6.11. The MC process still follows the two

kinds of pipeline model (4-stage and 3-stage) for bi-directional non-intra (B-type)

macroblocks, and one 2-stage pipeline model for non-intra (P-type) macroblocks. In a 4-

stage pipeline for B-type macroblock processing, the first two stages are for loading

reference pixels and computing half-pel precision. The third stage is for bi-directional

prediction computing. The last stage is for producing the reconstructed pixels by adding the

prediction errors from the IDCT unit. In a 2-stage pipeline for P-type macroblock

processing, the first stage is for loading reference pixels and computing half-pel precision.

Computing bi-directional prediction is not needed because the MC process of P-type

macroblock only references one frame (I- or P-picture). The last stage is for producing the

reconstructed pixels. Therefore, the total number of cycles needed to process 3 blocks (in

one baseline) of B-type macroblock data is 122 cycles, and 95 cycles for 3 blocks of P-type

macroblock processing.

from Control Logic

Adder &Shifter

MC inputbuffer

F-reg. B-reg.

2-pixelReg

results fromIDCT Unit

Reconstructed pixel

Shifter

buffer

ControlLogic

control signals fromMicroprocessor

Adder &Shifter

Shifter

buffer

results fromIDCT Unit

from Control Logic

Reconstructed pixel

Figure 6.10 Block diagram of the MC unit for the HDTV video decode

loading 4 pel to F-Regcomputing half pixel

precision

loading 4 pel to B-Regcomputing half pixel

precision

computingbi-directional

prediction

reconstructing

loading 6 pel to F-Regcomputing half pixel

precision

loading 6 pel to B-Regcomputing half pixel

precision

reconstructing

(a) block-level data processing pattern in the proposed MC unit

(b) pipeline stages and timing diagram of the proposed MC unit for bi-directional non-intra (B-type) MB processing

Figure 6.11(a)(b) Data processing pattern, pipeline stages, and output timing diagram

for the MC processing of B- and P-type macroblocks

loading 6 or 4 pel to F-Regcomputing half pixel

precision

reconstructing

(c) pipeline stages and timing diagram of the proposed MC unit for non-intar (P-type) MB processing

Figure 6.11(c) Data processing pattern, pipeline stages, and output timing diagram for the MC

processing of B- and P-type macroblocks

6.7 Performance Simulation Model

Software to simulate and monitor the proposed HDTV video decoder has been

developed. This simulator is an extended version of the performance simulation model

(Section 4.7) that has been designed for MP@ML single decoding-path applications such as

the DVD video decoder. The processing diagram of this simulation model is shown in

Figure 6.12.

There are three processing blocks to which minor changes have been made to suit the

proposed HDTV architecture. The first block is the decoding cycle counter, which has been

extended to monitor the decoding cycle input from dual decoding paths. The second block

is the internal buffer capacity monitor, which has been extended to monitor the internal

buffers (IQ, IDCT, and MC) of two baselines. The last block is the bus arbiter, in which the

memory accessing order of the functional units has been re-arranged in order to minimize

the data transfer delay caused by opening different memory pages. The general bus arbiter

scheme is described in Section 6.4.2. This access order change does not increase the

complexity of the SDRAM data transfer cycle counter. Although the two decoding paths in

the proposed HDTV architecture must access memory, this counter performs the same

operations as it does in the simulation model for the MP@ML single decoding-path

architecture because the two decoding paths work the same macroblock and the memory

accesses can be activated one by one just like the single decoding-path BLP architecture.

Compared to the complicated external memory access scheduling schemes for multiple

decoding paths proposed in other research work (described in Section 5.3.3), the BLP

decoding model achieves an elegant simplicity.

MV information

MB type

decoding cycles

decoding cyclesdecoding cyclecounter

IDCT unit

MC unit

baseline 0

data transferring cycles

codeword length

MV information

VLD unit

input bitstream

internal buffer spacecapacity monitor

(for VLD, IQ, IDCT, andMC buffers)

SDRAM data transfercycle counter

bus arbiterSDRAMconfiguration

data storagestructure

data transfer latency

IQ/IZZ unit

IDCT unit

MC unit

baseline 1

IQ/IZZ unit

decoding cycles

codeword numbers

buffer refilling request

Figure 6.12 Processing diagram of the proposed HDTV video decoder performance

simulation model

6.8 Simulation Results

Two 162 MHz SDRAM and a 54 MHz video decoder processing rate are used in the

proposed performance simulation model. The sizes of internal buffers adopted in the

performance simulation are listed in Table 6.2. The table shows that the buffer sizes for the

IQ/IZZ, IDCT, MC, and Write-back buffers are to be equipped on one decoding path (i.e.,

one baseline). Due to the operations of the proposed dual decoding-path being much like

the proposed single decoding-path BLP architecture, the IQ/IZZ, IDCT, and MC buffers are

the same size as those for the proposed DVD video decoder (described in Section 4.8).

Because of the rigid time constraints for decoding HDTV video, the I/O processes of

writing-back reconstructed block data are grouped together in order to reduce memory-page

open delay. Therefore, the Write-back buffer in each baseline has to be expanded to the

capacity needed to hold three blocks of data (192 bytes).

Compared to the proposed DVD video decoder, the bitstream FIFO size is reduced by

half and the VLD buffer size has a double increase. According to Eq. 3.11, the size of

bitstream FIFO can be derived from the FIFO refilling rate multiplied by the maximum

number of macroblock decoding cycles. In the HDTV standard [Grand94] and the digital

television standard [ATSC01], the video bit rate in the high data-rate mode is about 40

Mbps. Hence, the FIFO filling rate is about 0.1 bytes/cycle. Under real-time display

constraints, one macroblock must be decoded within 221 cycles at 54 MHz video decoder

speed. Therefore the space for compressed bitstream FIFO only needs to be 24 bytes.

Bitstream FIFO VLD buffer IQ/IZZ

buffer IDCT buffer MC buffer Write-back buffer

24 bytes 88 bytes 128 bytes 72 bytes 298 bytes 192 bytes

Table 6.2 Sizes of internal buffers adopted for the simulation model for the proposed

HDTV architecture

The main reason for increasing VLD buffer space is to allow for complete elimination

of VLD buffer refilling requests during B-frame type macroblock decoding, the decoding

process with heavy bus traffic during reading of reference macroblocks. This means that all

of the bus bandwidth can be devoted to, and efficiently used for, the regular schedule for

data transfer, and the decoding process for one macroblock can be finished within the

stringent time constraint, 221 cycles. Design of a larger VLD buffer requires consideration

of both data transfer scheduling for VLD buffer reading and a suitable threshold setting for

VLD buffer re-filling. The implementation approaches are different from those for the

smaller VLD buffer design in the DVD application (Chapter 4). Detailed discussion of

these approaches is below.

To find a suitable VLD buffer size, two test bitstreams (women.m2v and flowers.m2v)

have been input to the performance simulation model. These bitstreams separately have bit

rates of 18 and 22 Mbps, the second of which is well above the stated maximum of 19.4

Mbps for terrestrial broadcast applications [ATSC01]. This second bitstream makes a good

test of the robustness of the proposed HDTV dual-path decoder configuration. Table 6.3

shows the data characteristics for the two picture types in these bitstreams. The proposed

VLD buffer size is 88 bytes, which is a large enough buffer size to contain most of the

macroblocks of data for B-pictures. Although this buffer size is not large enough to contain

one macroblock of data for I- or P-pictures, frequent requests for VLD buffer refilling will

not affect decoding performance because bus traffic in these types of picture decoding is

low. The bitstream “women.m2v” consists mainly of process intensive, bi-directionally

predicted MBs, and hence lays stress on the bandwidth test. The bitstream “flowers.m2v” is

used to test the VLD unit against large header information, including a frequent slice

pattern. Therefore, in addition to the efficiency test of the proposed hardware architecture,

both bitstreams can be used for efficiency testing of the bus scheduling scheme.

Avg. data per MB

Flowers (22 Mbps) 116 bytes 61.86% 67 bytes 34.99% 42 bytes 12.73%

Women (18 Mbps) 102 bytes 55.89% 55 bytes 18.21% 25 bytes 1.1%

Table 6.3 Average data amount per one macroblock within I-, P-, and B-pictures

The number of filling requests for various buffer sizes of the VLD buffer is shown in

Figure 6.13. In the figure, the x-axis shows the various VLD buffer sizes from 24 bytes to

136 bytes, while the y-axis shows the average number of filling requests per macroblock.

As expected, when the size of the VLD buffer is larger, the number of VLD buffer filling

requests is lower. Fewer requests means less wasting of limited decoding time for data

transfer. For the same reason, the schedule for VLD buffer refilling is also changed, from

fixed being at the beginning of each macroblock decoding process in the DVD application,

to being dynamically requested to fill when the remaining data in the VLD buffer under a

certain threshold. Figure 6.13 clearly shows that the number of filling requests cannot be

significantly reduced when VLD buffer size is greater than 88 bytes. Of course, larger

buffers mean increased chip size and power consumption. Therefore, the proposed VLD

buffer size is set at 88 bytes. According to the performance simulation model, the proposed

88-byte VLD buffer size averages making one request for every three B-type macroblock

decoding processes for the 18 Mbps bitstream, and one request for every two B-type

macroblock decoding processes for the 22 Mbps bitstream. And, the simulation results

show that this size is big enough to support real-time HDTV pictures decoded under the

proposed architecture. After the VLD buffer makes a refilling request, the bus scheduler

will fill the request according to the memory I/O scheduling scheme as described in Section

6.4.2.

0123456789

1011121314

24 32 40 48 56 64 72 80 88 96 104 112 120 128 136

0123456789

1011121314

24 32 40 48

(a) Women.m2v (bit rate @ 18 Mbps

56 64 72 80 88 96 104 112 120 128 136bytes

(b) Flowers.m2v (bit rate @ 22 Mbps)

(The threshold for VLD buffer refilling is at 15 bytes)

The approach for determining the VLD buffer threshold setting in the HDTV

application is also different from that in the DVD application. An appropriate VLD buffer

threshold setting must reserve enough data to make the VLD unit process continuously until

the VLD buffer is refilled. Determination of proper threshold position must take into

account the memory I/O scheduling scheme. The proposed memory I/O scheduling scheme

is the mix of time-line scheduling and fixed priority scheduling described in Section 6.4.2.

In the worst case, the VLD buffer is going to be refilled after 40 cycles (30 cycles for

loading two reference blocks for the motion compensation of block 0 and ten cycles for

bitstream FIFO writing). During this 40-cycle period, the VLD unit will decode 40

codewords (the processing rate of the proposed VLD architecture is one codeword per

cycle). As described in Section 5.3.4, the average length of one codeword is 2.85 bits.

Hence, the 15-byte threshold setting is big enough to avoid starvation of the VLD unit. The

average filling requests number shown in Figure 6.13 is based on this 15-byte threshold

setting. Figure 6.14 also shows the average number of filling requests per macroblock

under different VLD buffer sizes. However, the threshold for refilling is set to half the size

of the VLD buffer. Compared to Figure 6.13, the number of filling requests has increased

between 25% and 80% for both test bitstreams. For example, when the VLD buffer is 88

bytes, the number of filling requests increases 66%. Frequent filling requests will disturb

normal data transfer scheduling and make worse the problem of heavy data bus traffic.

0123456789

1011121314

24 32 40 48 56 64 72 80 88 96 104 112 120 128 136

0123456789

1011121314

24 32 40 48 5

(a) Women.m2v (bit rate @ 18 Mbps

6 64 72 80 88 96 104 112 120 128 136

(b) Flowers.m2v (bit rate @ 22 Mbps)

(The threshold for VLD buffer refilling is half the VLD buffer size

Figure 6.15 shows the decoding timing diagram for I-, P-, and B-type macroblocks for

the bitstream women.m2v, as produced by the proposed performance simulation model.

This figure clearly shows that the decoding time of a macroblock in the proposed dual-path

decoding architecture and BLP scheme is mainly affected by only two factors: the

processing cycles of the VLD/IQ units for the first two coded blocks (e.g. block0 and

block1), and reference block loading cycles from external DRAM. No other HDTV video

decoder processing model can be simplified to just two major performance issues.

These two factors are critical because they determine the starting time of the

pipelined tasks in each decoding path where there are three blocks to be processed. In each

decoding path, the completion of the first coded block’s VLD process triggers the three

blocks’ IDCT processes. The three IDCT processes can usually concatenate together

because the VLD processes of the other two blocks will be hidden in the previous or two

previous IDCT processes. Completion of the loading of one or two reference blocks triggers

the corresponding motion compensation process. Among data transaction tasks, only

reference data loading is of concern because it consumes the most time within the whole

data transaction period, and the effect of other transaction tasks, such as data transfer to

avoid buffer underflow/overflow, can be made negligible under the BLP scheme.

Thus, running performance analyses of the BLP HDTV video decoder at a few critical

points can help system designers easily locate performance bottlenecks in the early design

and simulation stages, and quickly debug in the RT level implementation stage.

MC unit

0Time 1

1 114 136 146 168 178 200(cycle)

DRAM access11

MC unit

IDCT unit

IQ / IZZ

FLD andVLD

IQ / IZZ

IDCT unit

block0

block1 block2 block3 block4 block5

3522 1518 16

block 0

block 1

block 2

block 3

block 4

block 5

block 1

block 3

block 5

block 2

block 4

MVDecoder

block 0

MC unit

75 81 97 113 125 145(cycle)

DRAM access10

MC unit

IDCT unit

IQ / IZZ

FLD andVLD

IQ / IZZ

IDCT unit

1 2 3 4 blk5

813 154 11

block0

block 1

block2

block 3

block4

block 5

block 1

block 3

block 5

block 2

block 4

MVDecoder

block 0

block 2

block 4

block 1

block 3

block 5

(a) Timing diagram for decoding the 928th MB in picture 0 (I-picture). Amount of compressed data in each block: block 0 having 261 bits, block 1 having 146 bits, block 2 having 199 bits, block 3 having 111 bits, block 4 having 136 bits, block 5 having 117 bits

(b) Timing diagram for decoding the 928th MB in picture 1 (P-picture). Amount of compressed data in each block: block 0 having 22 bits, block 1 having 83 bits, block 2 having 31 bits, block 3 having 19 bits, block 4 having 60 bits, block 5 having 94 bits

block 0

Figure 6.15(a)(b) Timing diagram for I-, P-, and B-type macroblocks for Women.m2v

(c) Timing diagram for decoding the 928th MB in picture 2 (B-picture). Amount of compressed data in each block: block 0 having 8 bits, block 1 having 10 bits, block 2 having 0 bits, block 3 having 0 bits, block 4 having 0 bits, block 5 having 0 bits

MC unit

6 180(cycle)

DRAM accessR 0 R 1 R 2 R 3 R 4 R 5

28 22 2218 18 186

W0W1 W2 W3 W4

8 8 8 8 8

MC unit block 1

IDCT unit

IQ / IZZ

FLD andVLD

IQ / IZZ

IDCT unit

block 0

5block 1

block 0

64block 0

block 1

64block 3

block 5

block 2

block 4

MVDecoder

Rn : reference blocks read for MC process of block n (for half pel precision, one or two 9x9 reference blocks loaded),Wn : writing decoded block n to frame buffer, * : stochastic (no asterisk denotes deterministic)

calculating motion vectors and generating addresses of reference blocksmacroblock header information decoding

Figure 6.15(c) Timing diagram for I-, P-, and B-type macroblocks for Women.m2

The statistical distributions of macroblock decoding cycles for I-, P-, and B-pictures

in the two test bitstreams are given in Figure 6.16 and a summary of the simulation for each

bitstream is presented in Table 6.4. In the figure, each x-axis shows the number of

decoding cycles taken for each MB, while the y-axis shows the percentage of MBs in I-, P-,

and B-pictures decoded at a specific number of MB decoding cycles. Extended from the

architecture performance analyses of Figure 6.15, Figure 6.16 charts the characteristics of

intra/non-intra macroblocks in bitstreams of two different bit-rates using two statistical

values: the average decoding cycles, µ, and the standard deviation, σ.

The average number of decoding cycles for I-type macroblocks is larger than those of

P- or B-type macroblocks in both low and high bit-rate streams because most of the I-type

macroblocks contain more compression data than the other two types of macroblocks. The

standard deviation values for I-type macroblocks are also the largest because their decoding

cycles are only determined by the VLD working cycles of the first two coded blocks (no

effect from the motion compensation process). That is, the amount of compressed data can

occur across a wide range.

In decoding the low bit-rate stream, the average number of decoding cycles of P-type

macroblocks is smaller than that of B-types, and vice versa in decoding the high bit-rate

stream. On the other hand, the standard deviation values for P-types are larger than those of

B-type macroblocks in both high and low streams. The reason stems from the amount of

data contained in P- and B-type macroblocks. In a low bit-rate stream, although the amount

of data contained in each P-type macroblock is larger on average than in a B-type, the first

two coded blocks’ VLD processing cycles are fewer than the cycles of the first two tasks of

reference block loading. Hence, the decoding cycles of most of the P-type macroblocks are

mainly determined by reference block reading cycles. And, these reading cycles for P-type

macroblock decoding are usually fewer than for B-types because only one reference

macrobock is needed. Therefore, the average decoding cycles for P-type macroblocks is

fewer than those for B-types in the low bit-rate stream. However, the decoding cycles for

P-type macroblocks that contain more data in the first two coded blocks are determined by

the VLD working cycles; hence, the standard deviation values are larger than those of B-

type macroblocks in a low bit-rate stream. On the other hand, in the high bit-rate stream,

most of the P-type macroblocks contain so much compressed data that the number of

decoding cycles is mainly determined by the VLD working cycles. Hence, the average

number of decoding cycles and standard deviation values for P-types are larger than those

of B-type macroblocks.

The B-type macroblocks contain the least compressed data, so the number of

decoding cycles is mainly affected by the number of reference block reading cycles, within

which the variance in data reading time from macroblock to macroblock is usually small.

Two main factors result in this variance: the number of page-break occurrences and some

internal buffers’ refilling requests. However, the effects of these two factors have been

minimized by the proposed video data storage structure and the data bus scheduling

scheme. Hence, the standard deviation values of B-type macroblocks are the smallest in

both high and low bit-rate streams, which means their decoding cycles concentrate in a

small range.

From the above discussion of simulation results, the related issues of memory

interface, such as data storage structures and bus scheduling schemes, become important for

decoder performance. Both P- and B-type performance will be affected by these memory

interface issues. The analyses show that BLP can provide the most efficient memory

interface solution.

women: I picture

10.00%