Download - Towards High-Quality and Resource-Efficient Mobile Streaming

Transcript

University of Calgary

PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2017

Towards High-Quality and Resource-Efficient Mobile

Streaming

Zakerinasab, Mohammad Reza

Zakerinasab, M. R. (2017). Towards High-Quality and Resource-Efficient Mobile Streaming

(Unpublished doctoral thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/28481

http://hdl.handle.net/11023/3851

doctoral thesis

University of Calgary graduate students retain copyright ownership and moral rights for their

thesis. You may use this material in any way that is permitted by the Copyright Act or through

licensing that has been assigned to the document. For uses that are not allowable under

copyright legislation or licensing, you are required to seek permission.

Downloaded from PRISM: https://prism.ucalgary.ca

UNIVERSITY OF CALGARY

Towards High-Quality and Resource-Efficient Mobile Streaming

by

Mohammad Reza Zakerinasab

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF DOCTOR OF PHILOSOPHY

GRADUATE PROGRAM IN COMPUTER SCIENCE

CALGARY, ALBERTA

May, 2017

© Mohammad Reza Zakerinasab 2017

Abstract

Video streaming is one of the most popular applications on Internet-connected devices. In

particular, the increasing deployment of LTE/4G technologies and the advancements in

display quality and computing power of modern smartphones and tablets have led to a

significant growth in video streaming traffic in mobile networks. In these networks, commu-

nication and computational resources are not as abundant as that of wired networks or IPTV

systems. Therefore, numerous challenges need to be addressed to provide a low cost, high

quality video streaming service. Most importantly, adequate computational and networking

resources must be available and the streaming service must be adaptive to heterogeneous

end user devices and varying network conditions. Towards high-quality and resource-efficient

video streaming in mobile networks, in this thesis we propose innovative techniques to address

the computational resource efficiency in cloud and on the end user devices. Furthermore, we

address the network fluctuation and noise using a carefully-designed forward error correction

technique that also considers the energy limitations of end user devices such as smartphones.

Finally, we propose significant improvements over the state-of-the-art techniques to promote

collaborative video streaming in smartphones.

ii

Acknowledgments

My deepest gratitude to my wife Sarah, for her continuous and unparalleled love, help and

support. She encouraged me to start this journey years ago and stood beside me to the end.

She has been my inspiration and motivation for continuing to improve my knowledge and

move my research forward. She is my rock, and I gratefully dedicate this thesis to her. I

also thank my son Ali, for bringing more joy, color and motivation to our lives.

I am forever indebted to my parents for giving me the opportunities and experiences that

have made me who I am. They selflessly encouraged me to explore new directions in life and

seek my own destiny. This journey would not have been possible if not for them.

Finally, I owe my gratitude to my supervisor Dr. Mea Wang. Without her enthusiasm

and continuous support this thesis would hardly have been completed. I express my warmest

gratitude to my supervisory committee members, Professor Carey Williamson and Dr. Peter

Hoyer. Their guidance and support have been valuable assets towards the completion of this

thesis.

iii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Background and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Video Coding and Compression . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Single-layered Video Coding: H.264/AVC . . . . . . . . . . . . . . . . 142.1.3 Layered Video Coding: H.264/SVC . . . . . . . . . . . . . . . . . . . 25

2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.1 Analyzing the Performance of Scalable Video Coding . . . . . . . . . 302.2.2 Distributed Video Transcoding in the Cloud . . . . . . . . . . . . . . 322.2.3 Unequal Error Protection for Streaming Layered Videos . . . . . . . . 332.2.4 Cooperative Ad-Hoc Networks and WiFi Offloading . . . . . . . . . . 38

3 Detailed Analysis of Layered Video Coding . . . . . . . . . . . . . . . . . . . 423.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.1 Experiment Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.1.2 Selecting the Raw Video Dataset . . . . . . . . . . . . . . . . . . . . 453.1.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.2.1 The Effect of Frame Size . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.2 The Effect of Temporal Scalability . . . . . . . . . . . . . . . . . . . 563.2.3 The Effect of Spatial Layering . . . . . . . . . . . . . . . . . . . . . . 603.2.4 The Effect of Quality Layering . . . . . . . . . . . . . . . . . . . . . . 633.2.5 The Effect of Quantization Parameter . . . . . . . . . . . . . . . . . . 66

3.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 Preparing Video in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 704.1 Distributed Video Transcoding in the Cloud . . . . . . . . . . . . . . . . . . 73

4.1.1 The Necessity of Considering GOP Dependencies . . . . . . . . . . . 754.2 Dependency-Aware Distributed Video Transcoding . . . . . . . . . . . . . . 77

4.2.1 GOP-Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.2 Dependency-Aware Distributed Video Transcoding in the Cloud . . . 84

4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.3.1 The overhead of the transcoding scheme . . . . . . . . . . . . . . . . 884.3.2 Bitrate and Transcoding Time . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

iv

5 Video Transmission in Wireless Networks . . . . . . . . . . . . . . . . . . . . 935.1 Coding and Prediction in SVC . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2 Coding-Aware UEP for Layered Video Streaming . . . . . . . . . . . . . . . 98

5.2.1 Coding and Dependency-Aware Unequal Error Protection . . . . . . 1005.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3 Adaptive FEC for Layered Video Multicast . . . . . . . . . . . . . . . . . . . 1195.3.1 Adaptive FEC for Video Multicast . . . . . . . . . . . . . . . . . . . 1205.3.2 Case Study: Application in a Mobile Network . . . . . . . . . . . . . 1255.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366 Video Reception in Smartphones . . . . . . . . . . . . . . . . . . . . . . . . 1396.1 Energy-Efficient Collaborative Streaming . . . . . . . . . . . . . . . . . . . . 142

6.1.1 General transmission scheme . . . . . . . . . . . . . . . . . . . . . . . 1426.1.2 Two-Level Coding Scheme . . . . . . . . . . . . . . . . . . . . . . . . 1436.1.3 Distributed Scheduling Algorithm . . . . . . . . . . . . . . . . . . . . 147

6.2 Optimal Resource Allocation and Scheduling . . . . . . . . . . . . . . . . . . 1486.2.1 Modeling the Cooperative Streaming System . . . . . . . . . . . . . . 1506.2.2 The Power Consumption Minimization Problem . . . . . . . . . . . . 1556.2.3 The Rate Allocation and Scheduling (RAS) Algorithm . . . . . . . . 1606.2.4 Overhead Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1646.3.1 Cooperative Streaming using Different Coding Strategies . . . . . . . 1666.3.2 Centralized Optimal RAS vs. Distributed Heuristic Algorithms . . . 1696.3.3 Impact of the Session Elongation Constraint . . . . . . . . . . . . . . 171

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1757 Concluding Remarks and Future Works . . . . . . . . . . . . . . . . . . . . . 178Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182A High Efficiency Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 204

v

List of Tables

3.1 Selected video sequences and their properties. . . . . . . . . . . . . . . . . . 493.2 Comparing the performance of SVC when DTQ = (1, 4, 1) and H.264/AVC

for full HD video coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.3 The effect of additional I-pictures on performance of SVC when DTQ =

(2, 4, 3) and Intra Period = GOP size. . . . . . . . . . . . . . . . . . . . . . . 603.4 Dyadic vs. non-dyadic spatial layering results. Subcolumns show the respec-

tive overhead for one and two spatial layers (NDY1 vs. DY1 and NDY2 vs.DY2) respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5 Encoding configuration for quality (SNR) layering study. . . . . . . . . . . . 643.6 caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Reference videos and their visual properties. . . . . . . . . . . . . . . . . . . 884.2 Overhead of the proposed algorithm. . . . . . . . . . . . . . . . . . . . . . . 894.3 Comparing bitrate and average chunk size . . . . . . . . . . . . . . . . . . . 904.4 Comparing transcoding time . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 Reference video sequences and their properties. . . . . . . . . . . . . . . . . 1135.2 Dependency statistics for different video sequences using a fixed layering con-

figuration of DTQ = (2,4,0). . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.3 Computational overhead of the proposed UEP model compared to the video

encoding time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4 Y-PSNR of the transmitted videos when varying the video specification. . . . 1185.5 Packet loss rate of the multicast groups . . . . . . . . . . . . . . . . . . . . . 1265.6 The specification of PA layered video substreams (full-HD, 24 fps) . . . . . . 1265.7 PA substream specification for different quality layers . . . . . . . . . . . . . 1275.8 Energy profile of the reference mobile device. . . . . . . . . . . . . . . . . . . 130

6.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1546.2 Energy consumption optimization problem for video streaming in a coopera-

tive network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1586.3 Throughput and energy efficiency of wireless transmissions and coding oper-

ations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.4 Specification of heterogeneous nodes for experiment II . . . . . . . . . . . . . 1716.5 Specification of heterogeneous nodes . . . . . . . . . . . . . . . . . . . . . . 172

vi

List of Figures and Illustrations

1.1 The life cycle of a video streaming episode. . . . . . . . . . . . . . . . . . . . 2

2.1 Block diagram of AVC encoder / decoder [145]. . . . . . . . . . . . . . . . . 172.2 The temporal hierarchy of frames and the concept of group of pictures in

H.264/AVC. The number on each frame specifies the encoding order. . . . . 182.3 Dividing a frame into slice groups using flexible macroblock ordering. . . . . 192.4 H.264/AVC prediction directions for Intra 4× 4 prediction [145]. . . . . . . . 212.5 Block diagram of a SVC encoder for two spatial layers [112]. . . . . . . . . . 262.6 Layered design of SVC. The numbers on each frame specify the coding order

inside the spatial layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.7 Block diagonal and ladder shaped coefficient matrices for two video layers L1

and L2, in which each video segment is divided into k1 and k2 data blocks,respectively. These matrices are multiplied into k1 + k2 data blocks to createk1 + k2 reconstruction blocks and d1 + d2 redundant coded blocks for forwarderror correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Sample frames from the selected video sequences. . . . . . . . . . . . . . . . 503.2 Comparing the performance of H.264/AVC, SVC and Simulcast over the video

sequence Big Buck Bunny (BB) when the frame size is varied from 512× 288pixels to 1920× 1080 pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 The effect of increasing the GOP size from 2 to 16 on the performance ofH.264/SVC for encoding test video sequences. . . . . . . . . . . . . . . . . . 57

3.4 The effect of varying Intra Period parameter on the performance of H.264/SVCwhen encoding different layered representations of Pedestrian Area (PA) videosequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5 The effect of SVC spatial layering on (a) the streaming server side and (b)the receiver side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.6 The effect of varying the number of quality layers from zero to four on theperformance of H.264/SVC for different video sequences. . . . . . . . . . . . 65

3.7 The effect of varying quantization parameter on the performance of H.264/SVCwhen different layering structure is used to encode Pedestrian Area (PA) videosequence. The horizontal axis is the value of the highest quantization param-eter used in the layered structure. . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Workflow of distributed video transcoding in the cloud. . . . . . . . . . . . . 744.2 The effect of increasing the size of video chunks from 1 GOP to 64 GOPs on

the video bitrate and transcoding time. The numbers are adjusted accordingto the video chunks with size of unit GOP. . . . . . . . . . . . . . . . . . . . 76

4.3 Top: Prediction dependency links between two consecutive GOPs in the baselayer (layer S0) of the SVC video from Fig. 2.6. Bottom: Macroblock depen-dency graph modelling inter-GOP prediction. . . . . . . . . . . . . . . . . . 79

vii

4.4 Different types of dependencies between macroblocks in SVC. (a) Using a fullmacroblock as a reference, (b) Using a macroblock created from portions of2 or 4 macroblocks as a reference, (c) Using a submacroblock as a reference(after proper upsampling), and (d) Using multiple macroblocks as references. 82

4.5 Converting a macroblock-dependency graph Gm (a) to a frame-dependencygraph Gf (b), to a GOP-dependency graph Gg (d), and at last to a GOP-distance graph (e). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.6 The modified JSVM encoder software. Components in gray are modifiedJSVM components. Components in white are added to JSVM. . . . . . . . . 88

5.1 Prediction tree of the scalable video coding standard. The blocks with dashedlines may or may not exist at the discretion of the encoder. . . . . . . . . . . 95

5.2 Spatial prediction with dyadic settings in SVC. Each 16× 16 rectangle repre-sents a single macroblock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 An example of a dependency graph with 6 nodes, where m1 serves as anabsolute reference macroblock and m6 is not used by any other macroblock. . 101

5.4 Different types of dependencies among macroblocks in SVC. (a) Using a fullmacroblock as a reference, (b) Using a macroblock created from portions of2 or 4 macroblocks as a reference, (c) Using a submacroblock as a reference(after proper upsampling), and (d) Using multiple macroblocks as reference. 102

5.5 An example of a 10-node weighted dependency graph G. Nodes representmacroblocks and arcs represent the dependencies. (a) Before propagatingthe weights. (b) After propagating the weights by traversing the nodes intopological order and updating the weight of reference nodes according toEq. 5.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.6 The prediction dependencies in a SVC video sequence with two spatial andthree temporal layers. Dependency links between key pictures are shownin black. The grey links represent dependency among pictures between twoconsecutive key pictures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.7 The architecture of the performance evaluation system. Components in grayare modified JSVM components and those in white are developed from scratch.111

5.8 Performance of different UEP models over a packet erasure channel with vary-ing packet loss rate and fixed layering configuration of DTQ = (1,3,1). . . . 117

5.9 The proposed coding scheme for FEC blocks in layered video streaming. . . . 1235.10 Assigning layers in OFDMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.11 Block diagonal and ladder shaped coefficient matrices for two video layers L1

and L2, in which each video segment is divided into k1 and k2 data blocks,respectively. These matrices are multiplied into k1 + k2 data blocks to createk1 + k2 reconstruction blocks and d1 + d2 redundant coded blocks for forwarderror correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.12 Three minutes trace of the packet loss rate. . . . . . . . . . . . . . . . . . . . 1315.13 The objective video quality when using different layered protection mecha-

nisms and varying the loss rate from 0% to 20%. . . . . . . . . . . . . . . . . 1325.14 Transmission overhead of different multicast groups using different layered

protection mechanisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

viii

5.15 Transmission delay and waiting delay of different multicast groups using dif-ferent layered protection mechanisms. . . . . . . . . . . . . . . . . . . . . . . 134

5.16 Energy consumed by the reference mobile device per hour of streaming session.1355.17 Time needed to prepare all the redundant coded blocks. . . . . . . . . . . . . 137

6.1 An overview of a collaborative streaming system for smartphones . . . . . . 1406.2 Streaming segment i from the Cloud to all collaborative nodes in the proposed

transmission scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446.3 Network model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.4 Modeling the cooperative streaming system. . . . . . . . . . . . . . . . . . . 1526.5 Impact of cooperation arrangements and coding strategies on average energy

consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1676.6 A break-down of the average energy consumption . . . . . . . . . . . . . . . 1686.7 Effectiveness of the RAS algorithm . . . . . . . . . . . . . . . . . . . . . . . 1686.8 Average transmission delay of video segments offered by different scheduling

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1726.9 Length of the streaming session when varying shared session elongation coef-

ficient ψ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1736.10 Average energy consumption for different values of the shared session elonga-

tion coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1746.11 Average transmission delay of video segments when varying the shared session

elongation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

A.1 The block diagram of HEVC encoder / decoder (with decoder elements shadedin light gray) [125]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

A.2 Subdivision of a coding tree block (CTB) into coding blocks (CB) and trans-form blocks (TB). Solid lines indicate CB boundaries and dotted lines indicateTB boundaries. (a) CTB with its partitioning. (b) Corresponding quadtree[125]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

ix

Symbols, Abbreviations and Nomenclature

Symbol Definition

3G 3rd Generation (of mobile telecommunications technology)

4CIF 4x Common Intermediate Format, 704 x 576 pixels

4G 4th Generation (of mobile telecommunications technology)

AAC Advanced Audio Coding

ACM Association for Computing Machinery

AMVP Advanced Motion Vector Prediction

AVC Advanced Video Coding

AVI Audio Video Interleave

B-frame/picture Bi-predictive frame/picture

CAVLC Context-Adaptive Variable Length Coding

CB Coding Block

CBP Constrained Baseline Profile (H.264/AVC)

CCITT Consultative Committee for Int. Telephony and Telegraphy

CGS Coarse Grain Scalability

CIF Common Intermediate Format, 352 x 288 pixels

CTB Coding Tree Block

CTU Coding Tree Unit

CU Coding Unit

DCT Discrete Cosine Transform

DSCQS Double Stimulus Continuous Quality Scale

DVB Digital Video Broadcasting

EEP Equal Error Protection

ESS Extended Spatial Scalability (H.264/SVC)

x

EWFC Expanding Window Fountain Codes

FEC Forward Error Correction

FGS Fine Grain Scalability

FHD Full High Definition, 1080p, 1920 x 1080 pixels

FLV Flash Video

FMO Flexible Macroblock Ordering

FR-IQA Full Reference Image Quality Assessment

GF Galois Field

GOP Group of Pictures

H.264 Advanced Video Coding

H.265 High Efficiency Video Coding

HD High Definition, 720p, 1280 x 720 pixels

HDMI High-Definition Multimedia Interface

HDTV High Definition TV

HEVC High Efficiency Video Coding, H.265/HEVC

HiP High Profile (H.264/AVC)

IDR Instantaneous Decoding Refresh

IEEE Institute of Electrical and Electronics Engineers

I-frame/picture Intra-coded frame/picture

ITU-TThe International Telecommunication Union - Telecommunication

Standardization Sector

JSVM Joint Scalable Video Model

LDPC Low Density Parity Check

LTE Long Term Evolution

LW-UEP Layer Weighted Unequal Error Protection

MB Macroblock

MGS Medium Grain Scalability

xi

MOS Mean Opinion Score

MP4 MPEG-4 Part 14 digital multimedia format

MPEG The Moving Picture Experts Group

MSE Mean Square Error

MS-SSIM Multi-Scale Structural Similarity Index

MVC Multi-view Video Coding

NAL Network Abstraction Layer

NC Network Coding

NQM Noise Quality Measure

OFDM Orthogonal Frequency Division Multiplexing

OFDMA Orthogonal Frequency Division Multiple Access

OPTICS Ordering Points To Identify the Clustering Structure

P2P Peer-to-Peer

PB Prediction Block

P-frame/picture Predicted frame/picture

PSNR Peak Signal to Noise Ratio

PU Prediction Unit

PW-UEP Packet Weighted Unequal Error Protection

QCIF Quarter Common Intermediate Format, 176 x 144 pixels

QP Quantization Parameter

RCPC Rate Compatible Convolutional Codes

RLNC Random Linear Network Coding

RS Reed-Solomon codes

SAO Sample Adaptive Offset

SD Standard Definition

SDTV Standard Definition TV

xii

SHVC Scalable High Efficiency Video Coding, H.265/SHVC

SNR Signal to Noise Ratio

SS-SIM Structural Similarity Index

SVC Scalable Video Coding

TB Transform Block

TCP Transmission Control Protocol

TDMA Time Division Multiple Access

TU Transform Unit

UDP User Datagram Protocol

UEP Unequal Error Protection

UHD Ultra High Definition

UQI Universal Quality Index

VCL Video Coding Layer

VIF Visual Fidelity Index

VLC Variable Length Coding

VSNR Visual Signal to Noise Ratio

WSNR Weighted Signal to Noise Ratio

XP Extended Profile (H.264/AVC)

xiii

Chapter 1

Introduction

The increasing deployment of high speed telecommunication technologies such as LTE/4G

and the growing computing power and display quality of modern smartphones along with

the access to more than four million mobile applications [122] has enormously increased the

mobile data usage. Cisco predicted that the global mobile data traffic will increase eight-fold

from 44 exabytes1 in 2015 to 367 exabytes in 2020 [36]. Among different services provided for

Internet-connected mobile devices, video streaming accounted for 55% of total mobile data

traffic in 2015, i.e., more than half of all mobile data traffic, and it is expected to surpass

75% of mobile data traffic in 2020 [36]. Such a massive amount of video traffic in mobile

networks where resources such as spectrum and data transmission infrastructure are scarce

and expensive inevitably causes many technical and operational problems. These problems

can be classified according to the life cycle of a video streaming episode as shown in Fig. 1.1.

Generally speaking, the life cycle of a video streaming service can be divided into three

phases:

• service preparation inside the cloud, where the video is processed and prepared to be

delivered to the end user devices per request;

• service delivery to the end nodes, where the video is packetized, shielded against noise,

and then transmitted over the noisy communication channels to the end user device;

• and finally, the reception and playback phase, where the video packets are received,

missing packets are reconstructed, and video is played on the end user device.

In the service preparation phase, the high quality non-compressed raw video (or lossless

compressed video) is sent from the camera or the editing desk to the online video encoder

1 Each exabyte is 1018 bytes which is roughly equal to one billion gigabytes.

1

Acquisition

Media ServersMedia Storage

TranscodersEncoders

Streamers

Service Preparation Service DeliveryCollaboration /

Playback

Figure 1.1: The life cycle of a video streaming episode.

software / hardware module, where a high quality and high resolution encoded version of

the video is created and stored on the media storage servers. Then the high quality video

is converted to different video formats, resolutions, qualities and bitrates. The process of

converting an encoded video stream to another encoded video stream with different properties

is usually referred to as video transcoding [50]. Video transcoding can be done in real-time,

aka online video transcoding, or the transcoding requests can be addressed by offline video

transcoders using a normal or priority queue. Unless the video is intended for live streaming,

offline video transcoding is preferred as the transcoding system does not need to satisfy strict

timing constraints. However, in recent years the requirement for online video transcoding

has significantly increased mostly due to the increasing live video streaming services for user

generated videos over the Internet [31]. The transcoding task can be performed on the cloud

using one or more video transcoder virtual machines [18].

In video encoding and transcoding, the target properties of the encoded video (such as

video resolution, frame rate, video quality or video bitrate) are called a video encoding /

transcoding profile. A single transcoding profile can be used to generate a set of transcoded

versions of the video with different properties to support different levels of network connec-

tivity or hardware capability at the end user device. No matter which video coding standard

2

is used, the to-be-transcoded video is normally segmented into multiple chunks and the video

chunks are distributed to virtual machines in the cloud to speed up the transcoding process.

Then the transcoded video segments are merged together to create the transcoded video

stream.

Most video coding standards allow video transcoders to tune the quality of the transcoded

video, the required bandwidth (the video bitrate) and to some extent, the transcoding time.

Obviously, it is desirable to decrease the resource usage (i.e., transcoding time and video

bitrate, which is translated to network bandwidth for video transmission) while increasing

the visual quality of the transcoded video. This is a challenge since decreasing the video

bitrate or transcoding time normally results in lower quality transcoded video. Motivated

by the results of investigating the underlying mechanisms and properties of layered video

coding standards, in Chapter 4 a novel method for distributing video transcoding tasks

between cloud transcoders is proposed that decreases resource consumption on the cloud

while improving the video quality. We discuss this further in Chapter 4. The transcoded

versions of the video then are stored temporarily or permanently on the media storage server,

which can be on a private datacenter or on a public cloud [75].

When a streaming request is received by the media server, the media server selects a

proper version of the video in the appropriate quality level according to the properties of

the communication channel (e.g., connection bandwidth) and the specification of the end

user device (e.g., screen resolution). Then the selected video is packetized according to the

streaming protocol [50, 83]. Furthermore, if the end user device is using a mobile data net-

work, the transmission channel is exposed to the intrinsic characteristics of wireless networks

such as noisy communication channels, variable loss rate and bandwidth fluctuation. These

characteristics of mobile data networks pose challenges to resource efficient and high quality

video streaming. Video streaming, especially the live service, is sensitive to delay, jitter

and packet loss. When a packet is lost or partially distorted due to the presence of noise

3

in wireless channels, different bit-level, byte-level or segment-based forward error correc-

tion (FEC) methods might be used to reconstruct the distorted packets [40]. Clearly, extra

bandwidth is needed to transmit the FEC packets along with the original video packets. Fur-

thermore, extra processing resources are needed to apply the forward error correction codes

and retrieve the distorted packets. FEC codes strongly improve the visual quality of the

transmitted video stream [93], which has resulted in widespread use of these techniques in

video streaming over wireless networks by industrial manufacturers such as Qualcomm [41].

If the conventional FEC methods cannot recover the distorted packets, to preserve the

video quality, the lost packets must be retransmitted. In a live or low latency video streaming

application, such a retransmission may lead to unacceptable delay. To address this issue to

some extent, the video decoder is equipped with tools to recover from the lost video packets

by estimating the missing parts of the video frame [145]. This solution inevitably introduces

error to the video signal and reduces the visual quality of the video. Furthermore, due to the

prediction dependencies between video packets, some video packets can be more important

than others for the quality of the decoded video. This property is exploited to improve the

performance of forward error correction techniques by better protecting the more important

video packets, a technique called unequal error protection [74]. Towards higher quality of

the transmitted video stream in wireless networks, one research direction in the unequal

error protection literature is to construct better error correction codes or adjust the existing

ones to the application of video streaming. Furthermore, more accurate determination of the

importance of video packets results in higher visual quality of the transmitted video stream.

Towards resource efficiency, the employed UEP technique must adapt to the fluctuating

conditions of the wireless networks such that the bandwidth is not wasted. We investigate

these challenges in Chapter 5.

Finally, when the video is received on the end user device and the missing network packets

are recovered as much as possible, the video stream is sent to the requesting application for

4

decoding and playback. Using an ad-hoc network of devices in close proximity to each

other, the end user device may also share the video packets with other devices as part of a

collaborative network [48]. Depending on the network configuration and the collaboration

algorithm in place, the ad-hoc network may be a P2P network [123] or use a usual client-

server configuration [73]. In such a collaborative system, the quality of the transmitted

video stream can be improved by offloading mobile data transmission to short-range wireless

networks. The main challenge is to efficiently use the pool of shared resources to reduce

resource usage on each node of the network and increase the visual quality of the received

video. We investigate these challenges in Chapter 6.

1.1 Objectives

In this thesis, we study different phases of the lifecycle of video streaming service and in-

vestigate how to improve the resource efficiency and the visual quality of the transmitted

video stream. Towards these goals, we explore various challenges of video streaming in mo-

bile networks and propose innovative ideas, methods and algorithms to better address these

issues. In contrast to existing work, this thesis builds a bridge between video coding and

compression research and the networking components of the video preparation and delivery.

This approach is not very common in the related research since the researchers mostly belong

to either the video coding community or the networking community, hence trying to address

the issues from a simple perspective.

Recently, research works such as SoftCast [66] have tried to consider the networking prob-

lems directly in the design of the video coding and compression techniques. In comparison,

in this thesis we delve into the video coding and compression techniques to find out how

different the aforementioned issues can be addressed if a deeper knowledge of video com-

pression techniques is utilized. As mentioned earlier, we envision that the video stream is

delivered to the end user devices in three phases; namely, service preparation on the cloud,

5

service delivery to the end user device over the mobile network, and video reception and

display along with collaboration among the end nodes. In each phase, we go beyond the

past research works by considering the underlying mechanisms and properties of the state-

of-the-art video coding and compression techniques. The main objectives in the research

presented in this thesis are to reduce the computational complexity of video preparation on

the cloud; to minimize the negative effects of intrinsic properties of wireless networks, such

as fluctuating packet loss rate and network bandwidth; to improve the quality of the video

perceived by the mobile end user; and to reduce the resource usage at the end user mobile

devices.

1.2 Contributions

The contributions of this thesis can be summarized as follows:

• Investigating the underlying mechanisms and properties of the state-of-the-art video

coding standards

– Reviewing the related literature reveals that many ideas proposed in research papers

suffer from an inadequate knowledge of the underlying mechanism and properties

of video coding standards2. To avoid such a mistake and to fabricate a solid base

for this research, a thorough and deep study of the state-of-the-art single-layer and

layered video coding standards, i.e. H.264/AVC and H.264/SVC, respectively, is

performed3. Along with reading the standard and the source code of the reference

2 For example, numerous publications addressing the unequal protection of video packets in layeredvideo coding assume that the inter-layer prediction dependencies form a complete graph between consequentspatial, temporal and quality layers. Not only is this wrong according to the internal mechanism of layeredvideo coding standards, but also is neither practical nor possible due to the computational burden of requiredprediction loops, hence rendering the ideas, observations and analysis unreliable.

3 Even though H.265/HEVC video coding standard and its layered video coding extension H.265/HSVCare introduced recently, they still are not on the verge of common use on mobile video streaming applicationsdue to the heavy computational burden of encoding tasks. However, in Chapter 7 we discuss how the resultsof the presented research can contribute to video streaming on mobile networks when H.264/AVC andH.264/SVC are replaced with their descendants.

6

software published by the standardization group, in this study the effect of modifying

different coding parameters on computational complexity and video quality is broadly

investigated. We discuss this further in Chapter 3.

• Video preparation on the Cloud

– Motivated by the results of investigating the underlying mechanisms and properties

of layered video coding standards, in Chapter 4 a novel method for distributing

video transcoding task between cloud transcoders is proposed. The proposed model

takes the properties of the to-be-transcoded video into account and suggests a new

transcoding paradigm that adaptively changes the length of the video segments. The

suggested technique improves resource efficiency by decreasing the video bitrate and

the computational resource consumption on the cloud. It also increases the visual

quality of the transcoded video for more complex video sequences. We discuss this

further in Chapter 4.

• Video delivery over the mobile networks

– Video coding and compression techniques are based on extracting similarities be-

tween portions of video frames and exploiting these similarities toward lossy com-

pression of the video signal. Hence, some video packets are more important than

other packets for video playback. That is, the negative effect of losing different video

packets on the quality of the reconstructed video can be significantly different. Con-

sidering the lossy nature of wireless and mobile communication networks, the more

important video packets must receive more protection no matter which forward er-

ror correction (FEC) technique is employed. In this thesis, we propose a novel video

unicast model that, for the first time, considers the internal design of the video cod-

ing standards and brings a significant video quality improvement over the previous

unequal protection proposals for video streaming. The proposed model considers

an independent and identically distributed random packet loss model. Furthermore,

7

the proposed model is extended to video multicast service and its application in a

common mobile communication network is studied. The model improves the quality

of the transmitted video and also reduces the energy consumption on mobile phones.

We discuss this further in Chapter 5.

• Cooperative streaming

– The adjacent end user devices may create a collaborative ad-hoc network to collec-

tively download the video packets over the mobile network. Such an arrangement

can significantly reduce the mobile data usage per end user device. Simultaneously,

the battery consumption can be decreased since data transmission over mobile net-

works is more battery consuming than that of ad-hoc wireless networks. To maximize

these benefits, the video segments must be wisely associated to the end nodes and

the segment downloads should be scheduled properly. Towards these goals, we pro-

pose an optimal rate allocation and scheduling algorithm that maximizes the benefit

of collaboration among end nodes.

– Furthermore, a novel two-level coding scheme is proposed to protect the video packets

against losses while transmitting over the mobile network and the ad-hoc network.

The coding scheme manages to keep the computational complexity of the coding

scheme on the server side, hence saving the battery of the end user devices. The

proposed model considers an independent and identically distributed random packet

loss model. We discuss these further in Chapter 6.

The results of the research reported in this thesis are presented in six reputable confer-

ences including ACM Multimedia [160], IEEE LCN (IEEE Conference on Local Computer

Networks) [157,159], IEEE MASCTOS (IEEE International Symposium on Modeling, Analy-

sis and Simulation of Computer and Telecommunication Systems) [156], IEEE/ACM IWQoS

(IEEE/ACM International Symposium on Quality of Service) [154], and IEEE WCNC (IEEE

Wireless Communications and Networking Conference) [158]. Furthermore, some primary

8

research results are presented as posters in IEEE ICDCS (IEEE International Conference on

Distributed Computing Systems) [162] and IEEE/ACM IWQoS [161] conferences.

In parallel with the research presented in this thesis, an efficient model was developed for

the update problem in network coding enabled cloud storage systems [153, 155, 163]. This

was the preliminary research topic of this PhD. It was left aside in favor of the current topic

before the candidacy exam. Results are not presented in this report as they were not strongly

related to the main topic of this thesis.

Along with modifying numerous available open source tools, more than ten thousand

lines of code were written in duration of this research to evaluate the presented ideas, run

the experiments, and analyze the results. The code was mostly written in C++, Python

and shell scripting language. Computing resources were obtained from Cybera Rapid Access

Cloud, Amazon EC2 and Microsoft Azure. Furthermore, a private server cluster of ten

powerful nodes was used for the experiments from early 2012 to late 2015.

1.3 Structure of the Thesis

The rest of this thesis is organized as follows. Chapter 2 provides background and related

works. Chapter 3 investigates the underlying mechanism and performance of the state of the

art layered video coding standard, i.e. H.264/SVC. Chapter 4 addresses a novel proposal

on how to decrease resources needed to transcode the videos on the cloud while achieving

better quality of the transcoded videos. Chapter 5 presents the proposed dependency aware

unequal error protection technique along with the novel adaptive FEC method for layered

video streaming. Chapter 6 studies the collaboration among end user devices and how such

a collaboration can be used to reduce the resource usage in transmission network and the

end nodes simultaneously. Finally, Chapter 7 concludes the thesis and proposes directions

for future research.

9

Chapter 2

Background and Related Works

This chapter provides a concise summary of the key-enabling technologies that are employed

in this thesis, including state-of-the-art single-layer and multi-layered video coding standards

used in mobile video streaming and the use of Cloud technology as an enabler for this

application. Furthermore, a review of the related research work is presented. The emphasis

is particularly on related work most relevant to the research proposed in this thesis.

2.1 Video Coding and Compression

In this section, we briefly review the basic concepts related to video coding and compression.

Next, we review the state-of-the-art video coding standards for single-layered videos. Finally,

multi-layered video coding is reviewed. Only the essential information related to this research

is reviewed in this section. Whenever needed to better understand the topic, proper referrals

are provided.

2.1.1 Preliminaries

Digital video compression standards first appeared in the early 1980’s and have made enor-

mous progress since then. The first generation of video coding standards was published by

the CCITT (now the ITU-T) in 1984 and more than 10 new standards have been published

afterward. Expectedly, video compression techniques have deep roots in image compression

since the compression of still frames is an important part of all video coding standards.

However, in this section we keep the focus on the basic concepts that are more related to

video coding and compression.

10

Video Scene Capture and Representation

Before compressing the video, it is necessary to capture the video scene properly. In the cur-

rent methodology of video capturing, the video scene is captured as a sequence of temporal

samples (aka pictures or frames), where each temporal sample is often composed of a rect-

angular grid of spatial samples (aka points or pixels). Furthermore, a proper representation

of color information is essential for video capture and compression, since the human visual

system is more sensitive to luminance than to chrominance.

Among different color spaces that can be used to represent the color information of a

spatio-temporal sample (a pixel of a frame), YCbCr is often used in video compression.

In YCbCr, Y represents the luminance and Cb and Cr represent the chrominance (color)

components of the pixel. Along with the color space, the sampling format specifies how

many bits are required to represent a single pixel. To decrease the number of bits required

to represent the color information of a pixel, it is very customary to ignore most of the

chrominance components of adjacent pixels. The most common sampling format used in

video compression is 4:2:0. In 4:2:0, the luminance component of color sample is captured

and stored for all the pixels. However, the chrominance samples are captured only for the

top-left pixel of each 2 × 2 rectangle of pixels. This halves the amount of bits needed to

represent luminance and chrominance of each pixel before compressing the video scene and

without significant reduction in visual quality of the captured video scene. For example, if

the color depth is 8 bits in a target color space, 4:2:0 sampling requires 12 bits to represent

a pixel1. In contrast, in 4:4:4 sampling format the chrominance components have the same

resolution as the luminance component.

Along with the color space and sampling format, the number of pixels captured for each

picture of a video scene is an important factor that affects the perceptual quality of the

captured scene and the performance of video compression techniques. The number of pixels

1In contrast to still images, in usual video applications there is no notion of transparency or blending.Therefore, all the colors are assumed to be opaque and there is no need to store the transparency informationfor the pixels.

11

in each picture is determined by the height and width of the video frames, mostly referred

to as video frame format. Many different video frame formats are used in video coding

standards and technologies, starting from 144p (256× 144 pixels per frame) and increasing

to 8K (7680× 4320). Nowadays, 720p (HD, 1280× 720) and 1080p (Full-HD, 1920× 1080)

are the most common video frame formats on the web, while more content providers are

presenting 4K (UHD, 3840× 2160) content every day.

Finally, when the video is compressed by a chosen video encoder, which specifies the

video coding format, it should be encapsulated with audio, subtitles, etc., inside a multi-

media container format such as AVI, MP4, FLV or Matroska. As such, the user normally

doesn’t have an encoded or transcoded video file, but instead has a container file normally

containing H.264-encoded video alongside AAC-encoded audio. Multimedia container for-

mats can contain any one of a number of different video coding formats; for example the

MP4 container format can contain video in either the MPEG-2 Part 2 or the H.264 video

coding format, among others.

Video Quality Assessment

In order to evaluate the performance of video coding standards and video communication

systems, it is necessary to assess the quality of the encoded or transmitted video. Video

quality assessment can be either subjective or objective.

Subjective video quality assessment aims to measure the quality of the video as perceived

by the end user. However, this is not straightforward since a viewer’s opinion on quality of

a played video is influenced by many subjective factors such as the viewing environment,

the observer’s state of mind and the extent to which the observer interacts with the visual

scene [115]. Therefore, it’s very common to measure the video quality using mathematical

algorithms. Developers of video compression and video processing systems rely heavily on

so-called objective (algorithmic) quality measures. The most widely used measure is Peak

Signal to Noise Ratio (PSNR). PSNR is measured on a logarithmic scale and depends on the

12

mean squared error (MSE) between an original and an impaired video signal. In case of video

quality assessment, PSNR is mostly applied only on the luminance, hence called Y-PSNR.

Given the luminance component of a noise-free m× n video frame Im×n (usually called raw

video frame), and a noisy approximation Km×n with defects due to lossy compression or

transmission, Y-PSNR for K can be calculated as follows:

MSE =1

mn

m−1∑i=0

n−1∑j=0

[I(i, j)−K(i, j)]2

Y−PSNRdB = 10 · log10

(max2

I

MSE

)(2.1)

If the color information need to be considered in calculating PSNR, overall MSE must

be calculated as the average (or a weighted average) of MSE for each color component of the

color space (primary additive colors in RGB or luminance and chrominances in YCbCr). It

can be shown that the calculated video quality varies in different color spaces for the same

set of original and impaired video frames. This is not an issue as long as the same measure

has been used for all the compared video compression or transmission methods.

Y-PSNR can be calculated easily and quickly and is, therefore, a very popular quality

measure. Y-PSNR for the impaired video sequence is normally calculated as the average

of Y-PSNR for all the video frames2. Nevertheless, Y-PSNR does not correlate well with

subjective video quality measures such as DSCQS 3 [37]. For a given video frame or video

sequence, high Y-PSNR usually indicates high quality and low Y-PSNR usually indicates low

quality. However, a particular value of Y-PSNR does not necessarily equate to an absolute

2If a frame cannot be decoded or gets lost during transmission, the lost frame is replaced by anotherframe depending on video decoder or loss recovery component of the transmission model and Y-PSNR canbe seamlessly calculated. If no frame replacement measure is in place, then the previous frame or the nextframe or their average can be used to calculate the respective Y-PSNR.

3 Double Stimulus Continuous Quality Scale (DSCQS) is the subjective quality assessment method sug-gested for video in ITU-R Recommendation BT.500-11 [37]. In DSCQS, the assessor is shown a series ofpairs of video sequences. For each pair, the assessor is asked to give each video a quality score by marking ona continuous line with five intervals ranging from Excellent to Bad. Within each pair of sequences, one is anunimpaired reference sequence and the other is the same sequence, modified by a system or process undertest. At the end of the session, the scores are converted to a normalized range. The final result is normallydescribed as a mean opinion score (MOS). MOS indicates the relative quality of the impaired and referencesequences.

13

subjective quality. For example, equivalent distortion inside or outside a region of interest

in a video frame equally decrease Y-PSNR but the effect on subjective quality is different.

The limitations of PSNR and Y-PSNR metrics have led to many efforts to develop more

sophisticated measures that approximate the subjective video quality. In a recent work [92],

the correlation of common objective quality measurement algorithms with the subjective

quality of a large set of videos is investigated. The videos are played on mobile devices,

which makes the results more useful for this research. In this work nine objective quality

measures are investigated, namely signal-to-noise ratio (SNR), peak signal-to-noise ratio

(PSNR), weighted signal-to-noise ratio (WSNR) [89], visual signal-to-noise ratio (VSNR)

[30], structural similarity index (SS-SSIM) [139], multi-scale structural similarity index (MS-

SSIM) [141], visual information fidelity (VIF) [116], universal quality index (UQI) [140], and

noise quality measure (NQM) [38]. Based on the correlation between the video quality

assessment algorithm scores and the subjective video quality, it has been shown that VIF

has the highest correlation with the subjective quality assessment results [92].

In this research, we use both Y-PSNR and VIF for video quality assessment to maintain

the comparability of the reported results with previous research work. Furthermore, we

look at other performance metrics wherever appropriate. Most importantly, we discuss the

coding efficiency of the video compression systems whenever needed. Coding efficiency can

be defined as the ratio of the bitrate of the un-coded video to the coded-video, or the ratio of

the bitrate of the coded video when encoded using different video compression systems. The

computational complexity of encoding and decoding operations are other metrics of interest.

2.1.2 Single-layered Video Coding: H.264/AVC

H.264/AVC or Advanced Video Coding standard was introduced in 2003 for encoding and

decoding single-layer video streams [145]. Different annexes and extensions have been added

to the standard in subsequent years. Most importantly, the scalable extension of AVC, called

14

SVC, was introduced in 2007 and the multi-view extension, called MVC, was introduced in

2011. Later in 2012, a new video coding standard, named High Efficiency Video Coding

(H.265/HEVC) was introduced [125]. HEVC is designed to answer the challenge of efficient

encoding of ultra high resolution videos, i.e., 4K and 8K videos. It is reported that compared

to H.264/AVC, HEVC reduces the bitrate of the encoded video sequences by an average of

40% to 50% (i.e., roughly halving the bandwidth needed to stream the video at the same

objective quality) [98]. However, the saving comes with a steep price in terms of coding

complexity.

The main bottleneck of the new HEVC video coding standard (along with its competitors

such as Google VP9) is the significantly higher computational complexity, which is the

inevitable cost of the higher compression rate and more complex coding loop. This bottleneck

has significantly slowed down the widespread use of these video codecs since they need more

expensive encoding equipment in the media server and more expensive end user devices for

the playback. Currently, only high-end smartphones can decode HEVC videos, and they

consume significantly more energy to do that. In fact, it has been reported that decoding an

H.265/HEVC encoded video needs up to three times more CPU time compared to the same

video when encoded with H.264/AVC. The difference in energy usage gets more significant

when considering that most of the smartphones use hardware decoders for H.264/AVC but

even the high end smartphones mostly use software decoders to decode H.265/HEVC videos

[88].

Accordingly, in the remainder of this chapter, we focus on the coding and compression

in H.264/AVC and SVC. As mentioned earlier, throughout this thesis the research work and

results are presented for H.264/AVC and its layered extension, H.264/SVC. H.265/HEVC

is briefly introduced in Appendix A. In Chapter 7, we illustrate how the research results

reported in this thesis can be applied to H.265/HEVC and its multi-layered extension after

proper calibration.

15

Coding and Compression in H.264/AVC

Towards optimizing the rate distortion of the encoded video sequences (i.e., improving the

quality of the encoded video subject to video bitrate and the capacity of the communication

channel), H.264/AVC, like other video coding standards, exploits the redundancy of visual

information in time and scale domains. For example, in the case where a camera pans

slowly through the scene or the scenery is stationary, the video sequence may be highly

compressed without noticeable loss of quality due to high visual similarity in consecutive

frames. H.264/AVC is agnostic to the underlying characteristics of the video sequence, e.g.,

the progressive or interlaced nature of the captured video scene. Therefore, in an interlaced

video sequence fields are merged into frames before arriving into the encoding unit.

A. Encoding a video frame

A video frame can be encoded using the redundant visual information of the frame

itself (Intra-frame or Intra-picture prediction), or using the redundancy of visual information

between consecutive frames (Inter-frame or Inter-picture prediction)4. The basic encoding

algorithm is a hybrid of inter-frame prediction to exploit temporal statistical dependencies

and transform coding of the prediction residual to exploit spatial statistical dependencies.

As illustrated in the block diagram of the AVC encoder/decoder (Fig. 2.1), for each specific

frame only one of these general prediction mechanisms can be used.

When inter-frame prediction is employed, a reference picture refers to the picture or frame

whose visual information is used to partially reconstruct past or future pictures. Similarly,

a dependent picture refers to the picture that stores differential information from reference

picture(s). A picture may be a reference picture for some pictures and also be a dependent

picture depending on some other pictures.

In the temporal domain, video frames are divided into three non-overlapping frame types,

i.e., I-frames, P-frames and B-frames. I-frames are intra-coded pictures and do not use any

4 In video coding terminology, the terms picture and frame are used interchangeably.

16

Figure 2.1: Block diagram of AVC encoder / decoder [145].

information from other encoded frames. Each video sequence in H.264/AVC starts with an

I-frame. This is frame 0 in Fig. 2.2. P-frames only use the visual information from past

I- or P-frames. Therefore, no information about the future frames are needed to encode

or decode them. Furthermore, they specify the boundaries of the hierarchical temporal

structure used in H.264/AVC for temporal and inter-frame coding, called a group of pictures

or GOP. Finally, there are B-frames, which use visual information from past and future

frames. Fig. 2.2 describes the structure of a GOP. In H.264/AVC, a group of pictures is

a sequence of consecutive frames that starts with a key-picture (an I-frame or a P-frame)

and contains 2x − 1 bi-directionally predicted pictures (B-frames). Thus, IBn−1(PBn−1)∗

represents the sequence of frames in a video sequence. In the example depicted in Fig. 2.2,

the GOP size is 8.

B. Macroblocks, Slices and Slice Groups

As in all prior ITU-T video coding standards, H.264/AVC design follows the block-

based video coding approach (as depicted in Fig. 2.1), in which each coded picture is repre-

17

0 4 5 7 8 13 2 6

Group of Pictures (GOP)

P-frameI-frame B-frames

Figure 2.2: The temporal hierarchy of frames and the concept of group of pictures inH.264/AVC. The number on each frame specifies the encoding order.

sented in block-shaped units of associated luma and chroma samples called macroblocks. In

H.264/AVC, each picture is partitioned into fixed-size macroblocks that each cover a rect-

angular picture area of 16× 16 samples of the luma component and 8× 8 samples of each of

the two chroma components. Considering the greater sensitivity of the human visual system

towards the brightness of the image, H.264/AVC main profile uses YCbCr colour space and

4:2:0 luminance and chrominance sampling with 8 bits of precision per sample. Therefore,

the number of chrominance samples is one fourth of that of luminance samples, as discussed

in Chapter 1, and representing each uncoded sample in a raw video frame requires 12 bits.

Macroblocks are the basic building blocks of the standard for which the decoding process

is specified. The coding algorithm, however, may use submacroblocks of 16 × 8 or 8 × 8

samples.

In H.264/AVC, a picture is a collection of one or more slices. Slices are a sequence

of macroblocks that are processed in the order of a raster scan when not using flexible

macroblock ordering (FMO), which is described later in this paragraph. Slices are self-

contained in the sense that each slice can be correctly decoded standalone, with no data

from other slices needed5. Flexible macroblock ordering (FMO) modifies the way pictures

are partitioned into slices by utilizing the concept of slice groups. Each slice group is a

set of macroblocks. Each macroblock has exactly one slice group identification number

5Some information from other slices might be needed to apply the de-blocking filter across slice boundaries.

18

that specifies the slice group to which the macroblock belongs. Each slice group can be

partitioned into one or more slices, such that the macroblocks within the same slice are

processed in the order of a raster scan. Using FMO, a picture can be split into many

macroblock scanning patterns such as interleaved slices, a dispersed macroblock allocation,

one or more foreground slice groups and a leftover slice group, or a checker-board type of

mapping. For example, in Fig. 2.3 the left-hand side mapping can be used in region-of-

interest type of coding applications and the right-hand side can be used for concealment in

video conferencing applications where slice group 0 and 1 are transmitted in separate packets

and one of them is lost.

Slice Group #0

Slice Group #1

Slice Group #2

(a)

Slice Group #0

Slice Group #1

(b)

Figure 2.3: Dividing a frame into slice groups using flexible macroblock ordering.

Slices can also be categorized based on how the contained macroblocks are coded. The

most important categories, as in frames, are I-, P- and B-slices. In an I-slice, all macroblocks

are coded using intra-frame prediction. In a P-slice, in addition to the coding type of the

I-slice, some macroblocks can be coded using inter-frame prediction from the same slice in

past frames. Finally, in a B-slice, in addition to the coding types available in a P-slice,

some macroblocks can be coded using inter-frame prediction from the same slice in past and

future frames. More information on intra- and inter-frame prediction is provided later in this

section. For more information about slice types, including SI and SP switching slice types,

19

please refer to [145].

C. Encoding and Decoding of Macroblocks

In H.264/AVC, all luma and chroma samples of a macroblock are either spatially or tem-

porally predicted, and the resulting prediction residual is encoded using transform coding.

For transform coding, each color component of the prediction residual signal is subdivided

into smaller 4×4 blocks. Each block is transformed using an integer transform, and the trans-

form coefficients are quantized and encoded using entropy coding methods. As illustrated

in Fig. 2.1, the input video signal is split into macroblocks, the association of macroblocks

to slice groups and slices is selected, and then each macroblock of each slice is processed as

shown. An efficient parallel processing of macroblocks is possible when there are multiple

slices in the picture.

D. Intra-Frame Prediction

The coding of the macroblocks depend on the slice type. In all slice types, intra-coding

is supported. Intra-prediction in H.264/AVC is always conducted in the spatial domain. For

luma prediction, two intra-prediction modes are available: Intra 4× 4 and Intra 16× 16.

The Intra 4×4 mode is based on predicting each 4×4 luma block separately, and is well

suited for coding of parts of a picture with significant detail. The Intra 16 × 16 mode, on

the other hand, performs prediction on the whole 16× 16 luma bloc,k and is more suited for

coding very smooth areas of a picture. In addition to these two types of luma prediction,

a separate chroma prediction is conducted on 8 × 8 chroma samples. Furthermore, if a

specific portion of the picture requires lossless compression, H.264/AVC provides an I PCM

mode that allows the encoder to simply bypass the prediction and transform coding. Intra-

prediction in H.264/AVC always uses the neighbouring samples of previously-coded blocks

which are to the left and/or above the predicted block. As illustrated in Fig. 2.4, for each

4× 4 block one of nine prediction modes can be utilized. In addition to DC prediction mode

20

(mode 2, where the average of adjacent samples is used to predict the entire 4 × 4 block),

eight directional prediction modes can be used.

(a) Mode 2 - DC

0

1

37 5

46

8

(b) Directional modes

Figure 2.4: H.264/AVC prediction directions for Intra 4× 4 prediction [145].

When the Intra 16 × 16 prediction mode is selected, the whole luma component of a

macroblock is predicted. Multiple prediction modes are supported, namely, prediction mode

0 (vertical prediction), mode 1 (horizontal prediction), and mode 2 (DC prediction). The

specification of these prediction modes are similar to that of Intra 4 × 4 prediction modes.

The chroma samples of a macroblock are predicted using a similar prediction technique as

for the luma component in Intra 16× 16 macroblocks, since chroma is usually smooth over

large areas. In H.264/AVC, no prediction is used across slice boundaries to keep all slices

independent of each other.

E. Inter-Frame Prediction

Inter-frame prediction can be used only for macroblocks that belong to a P- or B-slice.

H.264/AVC specifies different motion-compensated prediction types for P-macroblocks. The

syntax supports motion-compensated prediction for partitions with luma block size of 16×16,

16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4 samples and corresponding chroma samples.

21

The reference partition for each motion-compensation predicted P-partition is specified by

a translational motion vector and a picture reference index. The motion vector compo-

nents are differentially coded using either median or directional prediction from neighbouring

blocks. No motion vector component prediction takes place across slice boundaries. The syn-

tax supports multipicture motion-compensated prediction, i.e., more than one prior coded

picture can be used as reference for motion-compensated prediction. Multi-frame motion-

compensated prediction requires both encoder and decoder to store the reference pictures

used for inter-prediction in a buffer.

B-macroblocks may use a weighted average of two distinct motion-compensated predic-

tion values for building the predicted signal. H.264/AVC specifies four different inter-frame

prediction modes: list 0, list 1, bi-predictive, and direct prediction. In list 0 and list 1 predic-

tion modes, the prediction signal uses macroblock(s) belonging to past and future frame(s),

respectively. For the bi-predictive mode, the prediction signal is formed by a weighted aver-

age of motion-compensated list 0 and list 1 prediction signals. The direct prediction mode

is inferred from previously transmitted syntax elements and can be either list 0 or list 1

prediction or bi-predictive.

F. Transform, Scaling, and Quantization

Similar to previous video coding standards, H.264/AVC utilizes transform coding of the

prediction residual. However, in H.264/AVC, the transformation is applied to smaller 4× 4

blocks, and instead of a 4× 4 discrete cosine transform (DCT), an integer transform is used.

The basic transform coding process is very similar to that of previous standards. At the

encoder, the process includes a forward transform, zig-zag scanning, scaling, and rounding

as the quantization process followed by entropy coding. At the decoder, the inverse of the

encoding process is performed except for the rounding. More details on the specific aspects

of the transform in H.264/AVC can be found in [145]. A quantization parameter is used

for determining the quantization of transform coefficients in H.264/AVC. The quantization

22

parameter can take 52 values. Theses values are arranged such that an increase of 1 in

quantization parameter means an increase of quantization step size by approximately 12%

(an increase of 6 means an increase of quantization step size by exactly a factor of 2). It can

be noticed that a change of step size by approximately 12% also means roughly a reduction

of bit rate by approximately 12% [145].

G. Entropy Coding

In H.264/AVC, two methods of entropy coding are supported. The simpler entropy

coding method uses a codeword table for all syntax elements except the quantized transform

coefficients. Thus, instead of designing a different variable length code (VLC) table for each

syntax element, only the mapping to the single codeword table is customized according to

the data statistics. The chosen single codeword table is an exponential Golomb code. For

transmitting the quantized transform coefficients, a more efficient method called Context-

Adaptive Variable Length Coding (CAVLC) is employed. In this scheme, VLC tables for

various syntax elements are switched depending on already transmitted syntax elements.

As expected, the performance of entropy coding is better than schemes using a single VLC

table. In the CAVLC entropy coding, the number of non-zero quantized coefficients (N) and

the actual size and position of the coefficients are coded separately. After zig-zag scanning

of transform coefficients, their statistical distribution typically shows large values for the low

frequency part, decreasing to small values later in the scan for the high-frequency part.

H.264/AVC Profiles and Levels

Profiles and levels specify conformance points. They facilitate interoperability between var-

ious applications that have similar functional requirements. A profile defines a set of coding

tools or algorithms that can be used to generate a conforming bitstream, whereas a level

places constraints on certain key parameters of the bitstream. All decoders conforming to a

specific profile must support all features in that profile. Encoders are not required to make

use of any particular set of features supported in a profile but have to provide conforming

23

bitstreams, i.e., bitstreams that can be decoded by conforming decoders.

Originally, three profiles were specified in the H.264/AVC standard, namely the Base-

line, Main, and Extended Profile. However, currently more than 20 different profiles are

defined. Among the specified profiles, the following are mostly used in today’s video encod-

ing applications. Please note that multiple simpler or more advanced profiles are defined in

H.264/AVC:

• Constrained Baseline Profile (CBP): This profile is typically used in video-conferencing

and mobile applications. It corresponds to the subset of features that are common

between the Baseline, Main, and High Profiles.

• Extended Profile (XP): Intended as the streaming video profile, this profile has rela-

tively high compression capability and measures for robustness to data losses and server

stream switching.

• High Profile (HiP): The primary profile for broadcast and disc storage applications,

particularly for high-definition television applications. For example, this is the profile

adopted by the Blue-ray Disc storage format and the DVB HDTV broadcast service.

Furthermore, the current version of the standard enlists 19 different levels, starting from

level 1 and ending in level 5.2. Level 3 supports HD video coding. Full HD video coding is

supported in level 4, and 4K video coding is supported in level 5. The current version of the

H.264/AVC standard does not specify any level supporting 8K or higher video resolutions.

Currently, only the last level of the new HEVC standard, i.e., level 6, supports encoding 8K

videos.

Video Coding Format vs. Video Container

Before moving forward to discuss the state-of-the-art multi-layer video coding and compres-

sion standard, it is worthy to mention that video compression only addresses one aspect

of the multimedia content, i.e., the visual information. However, the audio stream must

also be transmitted to the end user device and played back synchronously with the video

24

stream. Similar to the video stream, the audio stream must also be compressed using an

audio coding and compression standard. The video stream and the audio stream must be

bundled inside a multimedia container format such as AVI, MP4, FLV or Matroska. As

such, the user normally doesn’t have a H.264 file, but instead has a .mp4 video file, which is

an MP4 container containing H.264-encoded video, normally alongside AAC-encoded audio.

Multimedia container formats can contain any one of a number of different video coding

formats; for example the MP4 container format can contain video in either the MPEG-2

Part 2 or the H.264 video coding format, among others.

2.1.3 Layered Video Coding: H.264/SVC

In this section, we provide an overview of SVC, an annex of the H.264/AVC standard, which

offers a layered coding approach and provides a framework for scalable video coding. A SVC

compliant video stream is scalable in the sense that a valid video stream can be reconstructed

at a lower quality level, even in the absence of certain parts of the bitstream. This special

property of SVC allows multimedia streaming systems to support diverse devices using just

one video stream. In a nutshell, the streaming server encodes the video only once in SVC

format, and the devices (e.g., smartphones, tablets, laptops, desktops, TVs, etc.) on the

user end may decode the video to the best quality supported by their hardware/software

and network connectivities.

SVC supports three modes of scalability, i.e. temporal, spatial and quality scalability.

Every SVC compliant bitstream contains a H.264/AVC compliant base layer, which contains

the lowest temporal, spatial, and quality representation of the video, and several enhance-

ment layers, which provide the scalability in different modes. A block diagram of a SVC

encoder for a scalable video stream with two spatial layers is presented in Fig. 2.5. The base

layer is the essential layer needed to playback a video at the lowest possible quality. The

quality improves as more enhancement layers become available. The number of enhancement

25

Hierarchical MC and Intra

Prediction

Base Layer

Coding

SNR Scalable Coding

Mul

tiple

x

Texture

Motion

Spatial Layer 1

Hierarchical MC and Intra

Prediction

Base Layer

Coding

SNR Scalable Coding

Texture

Motion

Spatial Layer 0

Inter-layer prediction:Intra, Motion, ResidualSpatial

decimation

Scalable Bitstream

H.254/AVCcompatiblebase layer

Figure 2.5: Block diagram of a SVC encoder for two spatial layers [112].

layers available depends on the hardware/software specifications and the network connec-

tivity of the end-user devices. Such a layered design also allows the end-user device to

dynamically adjust the playback quality according to the availability of computational and

communication resources.

As a member of the H.264 standard family, each video sequence in SVC starts with an In-

stantaneous Decoding Refresh (IDR) access unit. IDR access unit is the union of one I-frame

(e.g., frame 0 of S0 in Fig. 2.6) with some critical data such as the set of coding parameters,

followed by a hierarchical temporal prediction structure. This hierarchical structure is de-

fined by the size of group of pictures (GOP size) and the distance between two intra-coded

pictures (Intra Period). GOP size specifies the distance between two key pictures, i.e., I- or

P-frames. In the example from Fig. 2.6, the GOP size is 8.

Spatial Scalability

Spatial scalability in SVC is provided by a layered approach. As illustrated in Fig. 2.6,

the base spatial layer S0 encodes lower resolution frames from only the first three temporal

layers (T0, T1, and T2). The enhancement layer S1 has enhancement frames for the same

26

0 4 5 7 8 1

0

3 2 6

1

Group of Pictures (GOP)

S1

S0

T0 T1 T2 T3

23 4

Figure 2.6: Layered design of SVC. The numbers on each frame specify the coding orderinside the spatial layer.

temporal layers, if not more, as in the preceding layer. SVC supports both dyadic and non-

dyadic spatial layering. The dyadic configuration enforces the spatial layers to conform to

a 2:1 resolution scale, i.e., lower resolution layers can be scaled up efficiently using bitwise

shift operations. Furthermore, with Extended Spatial Scalability (ESS), a class of more

complex algorithms for non-dyadic spatial scalability, SVC allows the neighbouring spatial

dependency layers to have arbitrary resolutions. However, the frame resolution of layers with

lower spatial dependency identifiers (e.g., S0) cannot be larger than that of the posterior

layers (e.g., S1) in height or width.

As shown in Fig. 2.5, in SVC each spatial dependency layer requires its own prediction

module to perform motion-compensated prediction and intra-prediction within the layer.

Furthermore, for each dependency layer D (e.g., layer S1), a reference layer DR < D (e.g.,

layer S0) can be used for inter-layer prediction, where motion vectors, intra texture and

residual signals of the reference layer can be used to predict the same data for the predicted

layer.

27

Temporal Scalability

To support temporal scalability, SVC relies on a hierarchical temporal prediction mechanism

that is extended from H.264/AVC. While previous scalable standards such as H.263 and

MPEG-4 Visual basically provide dyadic temporal scalability by segmenting video layers

according to different frame types (i.e. I, P and B frame types), in SVC the basis of temporal

scalability is found on a hierarchical temporal prediction structure. In the example shown

in Fig. 2.6, there are four temporal layers (T0, T1, T2, and T3), where T0 is the temporal

base layer. Within a spatial layer, frames in layer T0 are predicted only from frames in layer

T0. Frames in layer T1 are predicted from layers T0 and T1, whereas frames in layer T3,

contained in spatial layer S1, are predicted by adjacent frames from any preceding layers.

The hierarchical temporal prediction structure can be characterized by GOP size and

Intra Period parameters. Assume that the initial frame rate of a video sequences is 24 fps.

With GOP size of 8, there are 3 fps frames for layer T0, 6 fps frames for layer T1, 12 fps

frames for layer T2, and 24 fps frames for layer T3. The Intra Period specifies how often a

P-frame at the end of a GOP can be replaced by an I-frame. It must be multiples of the

GOP size. For example, for Intra Period of 2 and GOP size of 8, the P-frame at the end of

every other GOP is replaced by an I-frame.

Quality Scalability

H.264/SVC allows the encoder to use complementary data in different layers to generate

video streams that provide distinct quality levels for the reconstructed video. Quality layering

applies after spatial and temporal scalability, hence, dependent quality layers have the same

frame size and frame rate. As illustrated in Fig. 2.5, each layer has a SNR refinement module

that provides the necessary mechanisms for quality scalability. Three quality scalability

modes are supported, namely CGS, MGS and FGS.

CGS (Coarse Grain Scalability) can be considered as a special case of spatial scaling

where upsampling and inter-layer deblocking of the intra-coded macroblocks of the reference

28

layer are not required, since the predicted macroblocks and the reference ones are the same

size. In CGS, the enhancement layer typically contains the residual texture signal that is

quantized with a smaller quantization step compared to that of preceding layers [112], hence,

providing incremental visual information.

In MGS (Medium Grain Scalability), both the base and enhancement layers can be used

to predict a layer at the same time, hence improving the coding efficiency when a variety of

bitrates are required in a scalable bitstream. MGS uses periodic key pictures to immediately

resynchronize the prediction module. Furthermore, MGS allows switching between different

MGS layers at any access unit, hence increasing the flexibility of bitstream adaptation [112].

FGS (Fine Grain Scalability) provides a continuous adaptation of bitstream bitrate by

using an advanced bit-plane technique. In this technique, different quality layers contain dis-

tinct subsets of bits for each macroblock. By supporting progressive refinement of transform

coefficients, FGS allows the decoder to truncate the bitstream at arbitrary points [112].

Besides these three scalability modes, quantization parameter (QP) is another factor

that affects the quality of the encoded layers. The value of QP ranges from 0 to 51. At the

beginning of the encoding process, where DCT transformation of macroblocks is performed,

the result is quantized by dividing the matrices of luma and colour components into two

specific matrices of integers. QP directly affects this process since the denominator matrices

are multiplied by QP before the division. Therefore, higher value of QP eliminates more

coefficients, hence increases coarseness and decreases the bitrate and quality of the encoded

video. To optimize the rate-distortion ratio, SVC adjusts the QP of each frame according to

its location in the respective group of pictures.

As shown in Fig. 2.5, all the enhancement layers and the H.264/AVC-compliant base layer

are merged by multiplex, and different temporal, spatial and quality layers are integrated

into a single scalable bitstream. We use a triplet (D,T,Q) to specify the number of spatial,

temporal and quality layers of a SVC-compliant bitstream. Interested readers may refer to

29

[112] for a more detailed discussion.

2.2 Related Works

In this section a summarized review of the related research work is presented. Since this

thesis covers cloud-assisted video streaming on mobile networks, the related work has been

categorized such that it conforms to the structure of the next chapters.

2.2.1 Analyzing the Performance of Scalable Video Coding

Scalable video coding (SVC) allows the media server to prepare and maintain a single copy

of a layered video and stream different numbers of layers to diverse devices with different

network connection qualities. Due to the diverse hardware/software capabilities and network

connectivities of end-user devices, the scalable approach to video streaming has received

much research attentions. The SVC research work related to this study can be briefly

summarized into three directions: comparing SVC with other standards, addressing the

impact of different layering configurations in SVC from an objective quality perspective, and

the subjective video quality offered by SVC. This section provides a brief review for each

direction and relates this work to the research results presented in Chapter 3.

Comparing SVC with other standards: Researchers have compared the performance of

SVC with a diverse set of other video coding standards, mainly focusing on the rate-distortion

performance, i.e., the objective quality of the video (normally measured by Y-PSNR) as a

function of average bitrate. Wien et al. [146] analyzed the effect of quality scalability and

spatial scalability (both dyadic and non-dyadic modes) on the objective quality of eight video

sequences in two quality classes, common intermedia format (CIF) (352 × 288 pixels) and

quarter CIF (QCIF) (176 × 144 pixels). In this study, the rate-distortion performance of

SVC is compared with that of H.264/AVC, MPEG-4 Visual [5] and Simulcast. Simulcast is

a method for broadcasting video content in which a video is encoded in different settings and

30

the different versions of the video are sent together. Results show that SVC imposes overhead

compared to single layer H.264/AVC, and outperforms Simulcast [22, 146]. Furthermore, it

has been reported that SVC slightly outperforms Google’s VP8 [113] and MPEG-4 Part

2 [131] in terms of rate-distortion but exhibits higher values of rate variability [113, 131].

The rate variability is defined as the covariance of the frame sizes in bytes. From another

perspective, it has been reported that in terms of time required to extract a substream

from a scalable bitstream, SVC bitstream extractor slightly outperforms MPEG-21 DIA

[43]. MPEG-21 DIA is a standard that provides videos in the form of substreams to support

interoperable access to them [132]. Finally, in [35] Choi et al. have compared the performance

of H.264/SVC with H.265/SHVC for two data sets of 240p and 480p video sequences. It

has been reported that at the same video quality (PSNR), SHVC outperforms SVC by 10%

average reduction in video bitrate [35].

Impact of different layering configurations in SVC: The effect of temporal, spatial

and quality layering on the performance of SVC has been investigated in [52,82,106,120,130].

In [130], it has been reported that increasing the number of temporal layers can increase the

objective quality of the encoded video in a constant bitrate, while using spatial or quality

layers has a negative impact due to the bitrate overhead. In [82], it has been reported that

SVC inter-layer prediction provides higher subjective quality when SVC is used to encode

fast and complex video sequences compared to that of slow and simple scenarios. However,

the visual properties of the video sequences is identified according to the visual experience

of the authors, and no methodical analysis of the video properties is performed. Comparison

studies on the performance of different quality scalability modes of SVC (namely CGS and

MGS) over five long CIF videos revealed that MGS provides higher objective quality at the

cost of higher rate variability [52, 106]. Finally, Slanina et al. [120] studied the impact of

the number of temporal and quality layers on the rate distortion performance of SVC, using

two full HD video sequences encoded with constant frame rate.

31

Subjective video quality offered by SVC: As discussed earlier, in contrast to objective

video quality, subjective video quality is the visual quality of the video as perceived by the

human viewers. Several research works have investigated the subjective video quality of

SVC [44,79,92,97,105]. For instance, Politis et al. measured the effect of user mobility and

handover on the objective and subjective video quality of SVC and H.264/AVC codecs using

two CIF video sequences, and reported that SVC outperforms H.264/AVC in both objective

and subjective video qualities [105]. Contrarily, a comparison based on four full HD video

sequences concludes that in three out of four video sequences AVC slightly outperforms SVC

[97]. However, according to [97] with a bitrate overhead of 10% for SVC, the visual quality

should be indistinguishable from that of single layer AVC.

Compared to the previous research work, in Chapter 3 a systematic study is conducted on

using H.264/AVC and SVC for full HD video streaming. In this study the video properties

are methodically extracted and the performance of different parameters of SVC encoding is

analyzed according to the visual and statistical properties of the test video sequences.

2.2.2 Distributed Video Transcoding in the Cloud

Due to the increasing demand for video streaming and the massive computing power offered

by the cloud, video transcoding in the cloud has recently received great attention. The most

simple and straight-forward use of the cloud is utilizing the virtual instances to perform

conventional video transcoding upon request [136, 167]. The cloud is also utilized to assist

mobile devices for customized transcoding services [32], for cloud-assisted video transcoding

[33] [149], and for energy conservation on mobile devices [165].

To better utilize the computing resources in the cloud, there have been proposals on

consolidating under-utilized VMs for cost effectiveness [20], predicting transaction load for

precise resource provisioning [69], and scheduling video transcoding tasks [21, 57, 151]. Fur-

thermore, to better utilize storage space in the cloud, there have been proposals on caching

32

transcoded versions of requested videos [83]. The trade-off between computation and storage

is investigated in [68], and a cost-efficient virtual machine provisioning strategy is proposed

for cloud-assisted video transcoding. The computational cost, storage cost, and video pop-

ularity of individual transcoded videos is used in [68] to decide how long a video should be

stored or how frequently it should be re-transcoded from the source version.

Towards efficient video transcoding in the cloud, a new approach is suggested in [70] to

reduce the bitrate of the transcoded video by encoding the video using a higher quantization

parameter without reducing frame size or frame rate. Furthermore, there are proposals on

transcoding only part of a video to reduce the transcoding time [49, 77]. To implement

distributed video transcoding in the cloud, MapReduce (along with other components of

Hadoop, if needed) is used to distribute video content to virtual machines [55, 58, 76]. For

example, CloudStream [58] segments SVC video into chunks of unit GOP size and uses

MapReduce [39] to parallelize SVC video transcoding among virtual machines in the cloud.

Furthermore, two approximate solutions are proposed to minimize the transcoding delay and

reduce the transcoding jitter.

Compared to the previous research work, in Chapter 4 a novel scheme towards distributed

video transcoding is proposed. In summary, this scheme takes the visual similarity of the

video frames into account to reduce bitrate and transcoding time. More detail is provided

in Chapter 4.

2.2.3 Unequal Error Protection for Streaming Layered Videos

Unequal error protection (UEP) is a special form of forward error correction (FEC) where a

stronger forward error correction code is used to protect more important data. The applica-

tion of UEP in data transmission over noisy channels was first proposed in [90]. Since then,

the use of UEP for fault tolerant streaming of layered videos has been widely investigated.

The existing research work on UEP for streaming layered videos can be summarized into

33

three focuses: coding paradigms, importance measures, and unequal error protection across

video layers.

Coding paradigms: Any coding technique that can generate code at an arbitrary code rate

can be used to provide unequal protection by applying stronger codes with more redundant

data to more important information. In [135], the performance of different random linear

coding strategies for unequal protection of data over lossy packet erasure links is analyzed.

Common coding techniques for UEP, among others, include rate-compatible convolutional

codes (RCPC) [100], low-density parity-check (LDPC) codes [51], growth codes [108], ex-

panding window fountain codes (EWFC) [95], and Raptor codes [96]. While the aforemen-

tioned work are mostly on packet erasure channels, i.e., they set up the UEP mechanism in

the application layer, there is a large body of research on UEP in the physical and transport

layers using bit-level UEP schemes [61]. Since the focus of the research work presented in

Chapter 5 is on UEP for layered videos in the application layer, different coding techniques

for UEP are not investigated and general random linear codes are used as the FEC code in

the proposed UEP model.

Importance metrics: Whatever coding paradigm is selected, unequal protection technique

requires the to-be-transmitted data to be divided into different categories of importance. In

the case of layered videos, various information can be used as the input to the video packet

importance metric:

• Layer information: A basic importance measure of a video packet is the posi-

tion of its corresponding video layer inside the layer dependency structure. For

example, a video packet may contain information of a reference or a dependent

picture, which can be used as an importance measure [166].

• Data partitioning: AVC generates three separate data partitions: header data,

intra-predicted macroblocks (macroblocks that are predicted from macroblocks

from the same picture), and inter-picture predicted macroblocks (macroblocks

34

that are predicted from other pictures). In [96] and [124], the data partition

is used as the importance metric for corresponding video packets.

• Bitrate of the video frame: In [166], the bitrate of the encoded video frames is

considered as an importance metric for the corresponding video packets. The

rationale behind this design is that larger frames contain more information

due to less compression. Similarly, in [95] and [96], the bitrate of video slices

is used for the same purpose.

• Slice order: An experimental study showed that early slices are more important

than the later ones [94]. Hence, in [94] the order of slices in video frames

is used as an importance metric. In [126], the flexible macroblock ordering

(FMO) capability of the H.264 family of video coding standards is used to

dynamically associate the macroblocks to the slices. This mechanism can be

used to improve the performance of UEP by keeping the macroblocks of the

same importance level in the same slice.

• Error propagation zone: Error propagation zone is the set of processing units

(frames, slices, or macroblocks) that depend on a specific reference unit. In

[53], error propagation zone of each frame is employed to create a hierarchy of

importance levels.

• Motion information: In [104], the size of motion information of each slice is

used as an importance measure at the slice level. Similarly, in [42] motion

energy, defined as macroblock size times the motion vector size, is used as the

importance measure at the frame level.

We note that none of the proposed UEP mechanisms for layer-coded video streaming

considers the internal design of the video codec standard in use. In fact, most of the proposed

mechanisms can be used for any data with different importance levels, such as data partitions

35

in H.264/AVC [145], multiple descriptions in MDC [23], and video layers in SVC [112].

While the generality of the proposed methods can be considered as an advantage, ignoring

the internal design of video codec standards leads to less effective UEP due to inaccurate

estimation of the importance of visual information encapsulated in each video packet. In

Chapter 5, we propose a novel UEP mechanism that considers the internal design of SVC,

the state-of-the-art layered video coding standard, to calculate the relative importance of

different video packets.

Unequal protection across video layers: Despite the selected coding technique and

importance measure, it is common to apply the unequal protection algorithm separately to

the video packets that belong to the same layer. Next, the video layers can be encoded

together according to the layer dependencies [40, 84, 111, 127, 137, 160]. The algorithms

proposed to encode video layers together can be divided into two main categories: those that

use block diagonal coefficient matrix [107,127], and those that use ladder shaped coefficient

matrix [138].

To explain, assume a simple coding technique such as Reed-Solomon code [144] is used for

unequal protection of video layers [127,137]. Due to the layering design of SVC, coding can

be performed independently in each layer, where packets belonging to more important layers

are encoded into more coded blocks. Each segment within a layer l is divided into k fixed-

size blocks Bl = [bl1,bl2, . . . ,b

lk]. The blocks are then linearly combined into n > k encoded

blocks (Cl = [cl1, cl2, . . . , c

ln]) using n sets of coefficients as cli = εli ×Bl. In this equation, εli

is a set of randomly chosen coding coefficients for layer l in a finite field, normally of size

256. All operations are performed in the finite field so that the size of the coded blocks are

the same as the original blocks.

The key advantage of such a coding technique is that the original blocks can be recovered

from any k + ε, ε ≥ 0, out of the n encoded blocks. If the n set of coefficients that form

the coefficient matrix are properly chosen, with high probability any k coded blocks are

36

sufficient to recover the original blocks, i.e., the transmission overhead ε is minimal. To this

end, block diagonal [107, 127] and ladder shaped [138] coefficient matrices, as illustrated in

Fig. 5.3.3, are used in most of the proposed coding schemes for unequal protection of video

layers. The low triangular design of the coefficient matrix is expected to provide progressive

decoding when receiving the first k coded blocks.

Coefficients for Layer 1

Coefficients for Layer 2

Redundancy coefficientsfor Layer 1

Zeros(a) Block DiagonalCoefficient Matrix

(b) Ladder ShapedCoefficient Matrix

Redundancy coefficientsfor Layer 2

k1

d1

k2

d2

k1

d1

k2

d2

12

34

12

34

1

2

3

4

Figure 2.7: Block diagonal and ladder shaped coefficient matrices for two video layers L1 andL2, in which each video segment is divided into k1 and k2 data blocks, respectively. Thesematrices are multiplied into k1 + k2 data blocks to create k1 + k2 reconstruction blocks andd1 + d2 redundant coded blocks for forward error correction.

For each video segment of each layer, the block diagonal coefficient matrix is a combi-

nation of a lower triangular and a general coefficient matrix, as illustrated in Fig. 5.3.3(a),

where each coded block i of layer l is a linear combination of blocks 1 to i of the same video

segment. There will be exactly k such coded blocks for each segment that consists of k orig-

inal blocks. Redundant coded blocks are produced and sent to recover any loss during the

transmission of the first k coded blocks. However, if any of the first k coded blocks are lost,

the decoding must wait for the redundant coded blocks to recover the loss, which introduces

an extra delay.

The ladder shaped coefficient matrix is an extension over the block diagonal coefficient

matrix, as illustrated in Fig. 5.3.3(b). It trades computation and bandwidth for increasing

redundancy in higher priority layers by including the higher priority layers in the coded

blocks from lower priority layers [138]. In this approach, each coded block i of layer l is a

37

linear combination of blocks from the same video segment in layers 1 to l − 1 and blocks 1

to i in layer l. Hence, the base layer of SVC is decodable with higher probability since it is

included in every coded block.

Compared to block-diagonal and ladder-shaped coefficient matrices, in Chapter 5 a novel

coding scheme is proposed for delivering layered video in lossy packet networks. The pro-

posed coding scheme is based on a coefficient matrix that combines the diagonal coefficient

matrix with the ladder-shaped coefficient matrix. Besides eliminating unnecessary coding

operations, the proposed scheme also reduces the delays due to packet loss and out-of-order

delivery. Similar to existing proposals for using erasure codes in wireless communication

[13, 19], erasure coding is employed to cope with channel loss. However, it does not du-

plicate the source bits (as in [19]), that would impose extra multiplexing/demultiplexing,

nor assumes single channel transmissions (as in [13]). Furthermore, any coding paradigm or

importance metric can be used along with the proposed scheme for UEP across video layers,

since we assume that such schemes are in place to provide erasure transmission channels.

More details are covered in Chapter 5.

2.2.4 Cooperative Ad-Hoc Networks and WiFi Offloading

The idea of utilizing WiFi communication among cooperative smartphones was originally

proposed in [15], in which unicast connections over WiFi are used to locally distribute data.

The main motivation for utilizing opportunistic communication over local short-range links,

e.g., Bluetooth and WiFi, in mobile devices is to offload the cellular traffic, while maintaining

short delays [54, 62, 143]. For example, a system may stream video segments over WiFi

and share error recovery code over Bluetooth [81], or schedule transmission over multiple

WiFi hotspots [121] to receive one flow of data. In [26, 59], mobile phone users use social

relationships along with the geographical proximity to disseminate information stored on

the devices through WiFi instead of cellular links. In [34, 109], the same idea has been

38

extended to simultaneous Internet connections over cellular and WiFi channels. Similarly,

by exploiting multiple cellular and WiFi connections, digital content may be delivered to

mobile devices over multiple paths [128].

In [123], a collaboration model is proposed that aggregates cellular and local WiFi band-

width to provide higher downlink bitrate for participant nodes. To disseminate data, multi-

hop peer-to-peer unicast communication is used, which introduces additional delay to the

streaming session and limits the potential saving in terms of energy consumption. In contrast,

in [72,114] it has been proposed to use the overhearing nature of wireless communication and

WiFi Direct [9] to obtain single-hop unicast transmission between collaborating node. WiFi

Direct is a standard that allows smartphones to create peer-to-peer connections without the

need for a wireless access point. The connected devices can transfer data between each other

while maintaining their connection to the Internet. In WiFi Direct, one of the phones plays

the role of a software access point, centrally managing the WiFi network.

In Chapter 6, we use the same idea and utilize both the cellular networks and WiFi

communication opportunities to improve the quality of multimedia streaming on smartphone

devices while conserving energy. The goal is to allow each user to enjoy the aggregated

downlink for an energy conserving, continuous, high bitrate, and low delay video streaming

experience. Toward this goal, we propose a light-weight distributed scheduling algorithm

along with simple, yet effective, collaboration policies for the cooperating nodes. Nonetheless,

both wireless networks that are engaged in the proposed collaborative streaming system are

subject to noise and can significantly benefit from protection against packet loss. In the

proposed system, we use network coding.

Network coding offers a simple yet effective loss recovery mechanism, with minimum

communication overhead. Network coding (NC) was originally proposed in the field of in-

formation theory to achieve optimal communication throughput [12]. After the introduction

of random linear NC (RLNC) [56], the concept of NC has been widely applied in practical

39

content distribution systems [85]. Instead of encoding and decoding the data only at the

source and destination nodes as in usual applications of erasure codes, network coding allows

any intermediate node to decode and/or re-encode disseminated data. In RLNC, the new

coded blocks are encoded by using randomly generated coding coefficients in GF(2x). If

a sufficiently large field size is used, with high probability any k coded blocks are linearly

independent.

One of the challenges of applying NC is the computational complexity of the encoding

and decoding operations. This challenge has prevented the application of NC on mobile

devices due to limited computing and battery power until recent advances in processing

power on mobile devices. The first implementation of NC on mobile devices was presented

in [103], followed by an implementation of NC on iPhone in [117]. These studies have

shown that it is now feasible to perform coding operations on mobile devices, which leads to

investigations of various applications of NC in mobile networks. For instance, Pedersen et

al. proposed Pictureviewer, a mobile application that utilizes NC to transfer pictures among

mobile devices over WiFi links [102]. Furthermore, network coding is utilized as an effective

mechanism to recover losses and errors with minimum communication overhead in wireless

networks [101]. Towards multimedia streaming, Vingelmann et al. applied NC to stream

video content among a group of iPhone devices [133].

Recently, network coding has been applied in cooperative streaming systems to reduce

traffic in the cellular network and to simplify the cooperation mechanism in the WiFi network

[72]. Microcast performs RLNC in GF(256) to encode the video content. The coded content

is transmitted from the source to smartphones over a cellular network and is then shared

among smartphones in local WiFi network [72]. However, this design trades higher power

consumption on mobile devices for less traffic in the cellular network and better system

throughput. Although it has been shown that modern smartphones can perform coding

operations in GF(256) at a decent rate [103, 117], the operations still consume a noticeable

40

amount of energy.

To address these issues, in Chapter 6 a two-level coding scheme is utilized that reduces the

computational complexity of NC on mobile devices [154]. Furthermore, for the first time, the

power consumption minimization problem in the NC-based cooperative streaming systems

is formulated and an optimal rate allocation and scheduling (RAS) algorithm is proposed

[157]. The proposed cooperative streaming system minimizes both the streaming traffic in

the cellular network and the energy consumed by streaming applications on mobile devices.

The system carefully employs NC only when necessary to minimize the communication

and computational overhead introduced by coding operations. Finally, the system enforces

fairness in battery drainage among mobile devices so that the system can support longer

streaming sessions.

41

Chapter 3

Detailed Analysis of Layered Video Coding

The growing number of makes and models of modern smartphones, tablets and smart TVs

has led to an increasing diversity of devices used for streaming multimedia content from the

Internet. Despite the diversity in hardware/software capabilities and variations in network

connectivity, end users expect the best possible quality in the video streaming experience.

Moreover, increasing traffic from cellular networks poses another challenge in maintaining

the playback quality throughout a streaming session since the network characteristics may

fluctuate significantly.

One solution to this problem is to prepare several versions of a video for a predefined

set of resolutions. For example, YouTube [152] encodes videos in H.264/AVC [145] and VP9

[142] formats and supports a variety of video resolutions. Currently, YouTube offers nine

recommended video resolutions, starting from 144p (256× 144 pixels) to 8K (7680× 4320).

This range of video resolutions can offer various video bitrates from less than 256 Kbps to

more than 10 Mbps. A user may manually select the video resolution, and may also leave it to

YouTube to send the video in the quality that best suits the device and network conditions.

On the one hand, extra storage is needed on the server side to store different versions

of the video. On the other hand, users are limited to the bitrates and video resolutions

offered by YouTube. Scalable video coding (SVC), an extension of H.264/AVC, has emerged

to support ultra and full high definition video streaming to diverse devices with different

network connection qualities. With SVC, the server maintains a single version of each video,

but the video content is delivered to end-user devices at different quality levels according to

the device capabilities and network conditions.

Moreover, SVC can address another deficiency in the YouTube-like streaming systems.

42

In these systems, if a user views the same video on different devices at different quality

levels, a copy at each quality level must be downloaded, which leads to increasing network

traffic and high workload demand on the video server. With SVC, the video server may

serve one copy of the video in a properly chosen format to the router or the edge media

server that is adjacent to the end-user devices. Then the router or the edge media server

may serve video streams that best match the characteristics of each device by sending a

proper set of layers. This not only alleviates the heavy burden on the streaming server and

the Internet routers, but also delivers the video in better quality to each end-user device,

thanks to the extra available bandwidth and the fine-grained bitrate adaptation provided by

the multi-dimensional scalability of SVC [112].

As opposed to conventional single-stream videos, a scalable video consists of multiple

video substreams at different quality levels. The substreams are normally referred to as

layers [112]. In a nutshell, SVC encodes a video with spatial scalability (layers with different

resolutions), temporal scalability (layers with different frame rates), quality scalability (layers

with different qualities), or any arbitrary combination of them. Furthermore, SVC can

tolerate frame losses, i.e., even if frames are dropped during transmission, the original video

can still be rendered with little distortion. For this reason, SVC has received great research

attention and has been used in many proposals for improving multimedia streaming systems

[148]. Nevertheless, in contrast to H.264/AVC [145], which is the de facto standard for single

layer video coding, just a few commercial streaming systems utilize SVC as the video codec.

The main reason is the bitrate and computational overhead that is introduced by the multi-

stream representation of the video. In this chapter, we conduct a systematic study on the

use of SVC for full HD video streaming. Our goal is to identify the good and bad uses of

SVC, to quantify the coding overhead, and to benchmark the video quality under different

spatial, temporal, and quality settings. Using a set of carefully selected and diverse video

sequences, we also identify the types of video that can benefit from SVC. Our study reveals

43

that the efficiency and computational gap between SVC and H.264/AVC is much less when

encoding high quality videos, e.g., in full HD resolution. There are three more interesting

observations: (1) Replacing P-frames with I-frames in complex video sequences can decrease

the encoding complexity with a very mild increase in the bitrate of the encoded video; (2)

When the video is complex and the consecutive frames are considerably different, adding

spatial layers may decrease the bitrate of the encoded video; and (3) Increasing the frame

size decreases the computational and bitrate overhead of non-dyadic spatial layers, which

can be helpful as the diverse screen resolutions of end user devices limits the application of

SVC if only dyadic spatial resolution is employed.

Towards understanding the underlying properties of SVC, we need a thorough analysis of

the performance of the coding standard in different scenarios. The current body of research

in this field is reported in Section 2.2. We note that the existing studies have one or several

of the following problems: (1) the number of test video sequences is not sufficiently large

to represent different types of videos, therefore the conclusions may be biased towards the

particular video sequences used; (2) the criteria used to select the test video sequences is not

clear; (3) the video resolutions are too small to be relevant in today’s applications; and (4)

the performance of SVC is limited to only a few scalability modes. To address these issues,

we conduct a systematic analysis on the performance of SVC in full HD video streaming.

We carefully select a set of video sequences from 29 full HD video sequences. The video

sequences represent a variety of content properties, which also allows us to identify any

linkage between performance and video content. We also examine the complete range of

video resolution, from 288p to 2160p. Furthermore, we conduct the analysis on different

aspects of SVC, including the decoding complexity and considering the effect of frame size

and all the scalability modes provided by the SVC standard.

44

3.1 Experiment Setup

To conduct a systematic study on H.264/AVC and SVC for high resolution video streaming,

a careful design of the experiments is necessary. In this section, we describe the experiment

setup and performance metrics.

3.1.1 Experiment Testbed

We conduct all the experiments on a server cluster of 10 nodes. Each server node is equipped

with four Intelr Xeonr E5640 CPUs and 16 GB of 1066 MHz memory. Each Xeonr E5640

CPU has four CPU cores at 2.67 GHz and 12 MB of shared CPU cache. To avoid the effect of

multi-core operation on core performance and CPU cache hit ratio, we utilize only one core

on each CPU and at most three CPUs on each machine. To ensure the full compliance of

the used encoder/decoder software with the standard specification, we decided to use open-

source reference encoder and decoder software published by the Joint Video Team (JVT)

of ITU-T and MPEG, i.e., JM-18.6 [65] for H.264/AVC and JSVM-9.19.15 [1] for SVC.

Since the emphasis of the reference software is on compliance with specifications and not

the optimization of the coding process, these software are much slower than the third party

codecs such as x264 [91]. Both software were compiled on RedHatr Linux with kernel v2.6.18

and gcc v4.1.2. In addition, MPEG-7 Visual Description Tools [118] was used to partially

extract some visual features of video sequences and EPFL Video Quality Measurement Tool

v1.1 [2] was used to calculate different objective video quality metrics. Tools to extract

the motion vectors and calculate descriptive video features such as Detail [99] and Motion

Activity [67], as will be described later, were developed from scratch.

3.1.2 Selecting the Raw Video Dataset

To perform the aforementioned experiments, a proper set of high resolution raw video se-

quences is needed. By raw video sequence, we mean a video sequence in which the video

45

frames are not compressed even by a lossless encoder. For example, a raw video frame that

uses 4:4:4 color presentation can be considered as a bitmap image. In conformance with the

main profile of H.264/AVC and SVC, we used raw video sequences with 4:2:0 color presen-

tation. We collected 29 raw video sequences from Xiph.org Test Media collection [3], with

frame size of 1920 × 1080 pixels and frame rate of 24 frames per second (i.e. 1080p24).

This is the minimum frame size and frame rate for full HD in ATSC standards [4]. After

investigating the video sequences, we decided to use a precisely selected subset of collected

raw video sequences, since the video sequences were not well spread across different genres

and visual features. Furthermore, considering the limited available computational resources,

using all the video sequences would severely affect the number and quality of the potential

experiments, let above the time and effort needed to accumulate and analyze the results.

A. Selection of Video Features

A proper sampling method must be utilized to select a proper set of reference video

sequences such that the selected samples represent a variety of content types that a video

streaming server might need to encode and stream. Toward finding proper and descriptive

features, we reviewed different visual features suggested in the literature, including the fea-

tures suggested in MPEG-7 Visual standard [67, 78, 119, 147]. Consequently, we selected

three feature categories, representing color, texture and motion, as described below.

• Color: Video compression algorithms are neutral to the exact value of the color but are

sensitive to the diversity and spatial dispersion of the colors in a video frame. Therefore,

to represent the color properties of each video frame, the Dominant Color descriptor as

defined in MPEG-7 Visual standard [119] was used. This descriptor characterizes an

image by a small number of representative colors. The colors are selected by quantizing

pixel colors into up to eight principal clusters. The description then consists of the

fraction of the image or region represented by each color cluster and the variance of each

one. A measure of overall spatial coherency of the clusters is also defined. Altogether,

46

this descriptor provides a very compact description of the representative colors in an

image. To represent the color properties of a video sequence, we extracted the number of

dominant colors (with maximum of 8) and the spatial coherency of the color clusters for

each frame of the video sequences. Then we used the average of the calculated feature

values of the frames as the value of that feature for the respective video sequence.

• Texture: To represent the texture property of the video frames in each raw video se-

quence, we extracted the MPEG-7 edge histogram descriptor [118] for each frame. Next,

we quantized the average of edge histogram values over all frames of each video sequence

as suggested in [99]. This descriptor is known as Detail [67].

• Motion: The motion features of a video sequence are best represented by the motion

vectors and Motion Activity. Motion vectors provide the gross motion characteristics

of a video segment. However motion vectors cannot be extracted from a raw video

sequence, since there is no motion data in a raw video. To extract the motion vectors,

we first encoded each video sequence using JSVM-9.19.15 in single layer mode1, i.e.,

without any spatial or quality layers. Then the JSVM-9.19.15 decoder tool was modified

to report the motion vectors for each inter-frame motion compensated macroblock and

sub-macroblock prediction. Next, the extracted motion vectors were used to calculate

the Motion Activity [67]. Motion Activity considers the intensity, direction, spatial

distribution and temporal distribution of activity in a video sequence. To calculate

the motion activity, according to [67], the standard deviation of the magnitudes of all

motion vectors of each frame was quantized between 1 and 5, and the average of the

quantized motion activity values over all the frames was used as the Motion Activity

of each video sequence.

To ensure that calculating the features and performing the experiments are feasible on

1The most important encoder parameters were GOP size of 8, base quantization parameter of 32, motionprediction buffer size of 16 frames, and using fast bi-directional motion search. However, in this special case,the selection of coding parameters does not affect the results as long as the same set of parameters are usedfor all the video sequences.

47

the server cluster, the video sequences were cropped to 240 frames, i.e. 10 seconds of raw

video with frame rate of 24 fps2. If a video sequence had movie titles at the beginning, the

first 1000 frames were skipped. Otherwise, the frames were selected from the beginning of

the video sequence.

B. Selection of Video Sequences

To select the proper video sequences from the population of 29 samples, we used strati-

fied sampling without replacement. First, the video sequences were divided to four clusters

according to the video genre, i.e., animation, scene, nature, and one cluster for video se-

quences that belong to scene and nature genres together (scene/nature). In each cluster all

the feature values were normalized to [0, 1]. Next, two video sequences were selected from

each of the first three video genres, the ones that have the smallest and largest Euclidean

distance to the sample mean. The closest video sequence to the sample mean was selected

for the scene/nature genre. Afterwards, we manually reviewed the selected video sequences

for each genre to make sure that they exhibit diverse values for the aforementioned features

among the cluster samples. The selected video sequences and their properties are reported

in Table 3.1 and a sample frame from each selected video sequence is shown in Fig. 3.1. The

list below further describes each of the seven video sequences used in this study.

• Big Buck Bunny (BB): An animation clip that shows a big rabbit waking up in the

morning. The animation has low and high motion activities, and features detailed

shading and hair and fur demonstration.

• Elephants Dream (ED): An animation clip that displays a surreal scene wherein two

characters are talking, and features a foggy environment.

• Pedestrian Area (PA): The camera is fixed towards a pedestrian area. Pedestrians show

diverse contrasts and colours and complex motions.

2 The length of the video sequences are shorter than usual video sequences that exist on the web. However,this is long enough to expose video codec features. In fact, studying video codecs with video sequences of10 or more GOPs is common in video coding research community.

48

• Rush Hour (RH): This video shows a street in rush hour. The camera is fixed towards

the cars passing by.

• Park Joy (PJ): This video is set along the side of a river. The camera pans from left

to right and follows a group of people running in front of trees.

• Riverbed (RB): The camera is fixed towards a riverbed and records frequent small waves

on the edge of the river. Due to the high frequency of the small waves and the reflection

of light on the surface of the water, this scene exhibits high values of motion activity

and details.

• Sunflower (SF): The camera pans horizontally and follows a bee on a sunflower.

Table 3.1: Selected video sequences and their properties.

Content GenreTotal Selected Avg. Num. Spatial

DetailAvg. Num. Motion

Frames Frames of Dominant Coherency of Motion ActivityColors Vectors

BB Animation 14, 315 1, 001-1, 240 7.49 8.85 3.52 10223 1.63ED Animation 15, 691 1, 001-1, 240 3.73 23.97 3.73 10147 2.39PA Scene 375 1-240 4.41 22.88 3.15 10116 4.42RH Scene 500 1-240 4.29 25.16 3.17 11041 3.12PJ Scene/Nature 500 1-240 6.46 3.49 4.24 17373 3.73RB Nature 250 1-240 3.94 24.33 4.72 8612 4.13SF Nature 500 1-240 6.48 19.42 4.04 13043 2.57

3.1.3 Performance Metrics

To evaluate the performance of H.264/AVC and SVC from different aspects, various per-

formance metrics were used. Most importantly, we are interested to investigate the coding

efficiency, encoding and decoding complexity, and objective quality of the encoded video.

A. Coding Efficiency: The bitrate of the encoded video stream is the main metric for mea-

suring the coding efficiency of a codec. Coding efficiency can be calculated based on the

bandwidth required to stream the raw video over the channel. However, to keep the results

more sensible, we always compare the bitrate of the compressed video encoded by the con-

49

(a) Big Buck Bunny (BB) (b) Elephants Dream (ED)

(c) Pedestrian Area (PA) (d) Rush Hour (RH)

(e) Park Joy (PJ) (f) Riverbed (RB)

(g) Sunflower (SF)

Figure 3.1: Sample frames from the selected video sequences.

50

figuration of interest with the bitrate of the same video encoded by the reference coding

configuration. Along with the bitrate of the encoded video, MPEG-7 motion activity of the

encoded video is also used wherever it helps the discussion.

B. Encoding and Decoding Complexity: We measure the CPU time required to encode and

decode each video sequence. We compare the performance of SVC with the single layer

H.264/AVC and Simulcast, all performed using JM-18.6 and JSVM-9.19.15 software. The

performance difference in terms of encoding and decoding time reflects the difference be-

tween SVC and other coding standards. Both JM and JSVM software packages are strict

implementations of the standard. Therefore, no optimization is performed, and no part of

specification is sacrificed for better encoding or decoding performance.

C. Objective Quality: In this chapter (and in the future ones), we focus on the objective video

quality instead of subjective video quality. As a reminder, for any subjective measure to be

representative, we need a large pool of participants and a large collection of videos. Such

a user-based study is orthogonal to this work. To quantify the objective video quality, we

meaured Y-PSNR (the PSNR value of the luma component of the video sequence), SSIM

(structural similarity index) [139], MS-SSIM (multi-scale structural similarity index) [141]

and the pixel domain version of VIF (visual information fidelity) [116] for each encoded

video sequence. All these metrics are calculated as the average value of a full reference

image quality assessment (FR-IQA) metric over raw and decoded video frames. To keep the

discussion concise, only the results for Y-PSNR and VIF metrics are included in Sec. 3.2.

Compared to other objective video quality metrics, VIF is known for its high correlation

with the subjective video quality [92].

Throughout the performance analysis, we use a combination of the aforementioned per-

formance metrics and overhead of the performance metrics wherever appropriate. We define

the overhead of a performance metric as the ratio of the performance of the configuration

51

being measured and the performance of the reference coding configuration.

3.2 Performance Analysis

In this section, we present the results of our analysis of different video properties in layered

video coding. We begin with an analysis of the effect of frame size, followed by studies on

each of the three scalability modes (spatial, temporal, and quality) in layered video coding.

By default, we configure AVC encoder such that the default number of picture in each

GOP (GOP size) is 16; the base quantization parameter is 28; the size of motion prediction

buffer, in which the reconstructed frames are kept for motion estimation and decoding, is

16 frames; and fast motion search algorithm has been used, which reduces the encoding

time without considerably decreasing the objective quality of the encoded video. The same

setting is used for the SVC encoder. Additionally, we configure the SVC encoder to generate

a scalable video stream with two dyadic spatial layers (1920 × 1080 pixels and 960 × 540

pixels), five temporal layers (GOP = 16) and two quality layers (using MGS scalability

mode). The minimum quantization parameters (QP) is set to 28 with default delta QP of

−2 for quality layers to ensure enough spatial detail is preserved. Automatic QP cascading is

used for temporal layers. To evaluate the performance of SVC codec over extended layering

scenarios, the memory management and encoding/decoding modules of the JSVM-9.19.15

were modified such that the number of permissible layer configurations was increased from

8 to 32.

In this analysis, H.264/AVC refers to the single layer encoded version of the video se-

quence in full HD, and Simulcast refers to the single layer encoded version of the video

sequence in multiple resolutions. We note that the aforementioned SVC encoding configu-

ration supports 18 different bitrate points, spanning from 119.6 Kbps to 1.3 Mbps for BB

video sequence as an example, whereas the H.264/AVC and Simulcast support only one and

two bitrates, respectively.

52

3.2.1 The Effect of Frame Size

Before examining the scaling factors of SVC, we wish to first understand the effect of frame

size (resolution), which is the most observable quality feature by the end users. In this

experiment, we downsample the video sequences from full HD resolution (1920 × 1080) to

smaller frame sizes using non-normative downsampling [110] with frame heights of 288, 360,

480, 576, 720 and 900 pixels while preserving the 16:9 ratio for the frame width. We observe

the similar performance trend among all seven video sequences. To facilitate the discussion,

we present the results from video sequence BB in Fig. 3.2.

Hig

hest

Qua

lity

Laye

r Bi

trat

e (M

bps)

0

0.5

1

1.5

288

360

480

576

720

900

1080

BB - SVC (DTQ=1,4,1)BB - H.264/AVCSimulcast

Video Frame Height (16:9)

(a) The effect of frame size on coding efficiency.

1

6

11

16

21

288

360

480

576

720

900

1080

BB - SVC (DTQ=1,4,1)BB - H.264/AVCSVC overhead

-20%-12%

-3%

5%10%12%28%

Video Frame Height (16:9)

0%

Enco

ding

Tim

e (s

cale

d)

(b) The effect of frame size on encoding time.

Dec

odin

g T

ime

(sca

led)

1

7

13

19

25

288

360

480

576

720

900

1080

BB - SVC (DTQ=1,4,1)BB - H.264/AVC

Video Frame Height (16:9)

(c) The effect of frame size on decoding time.

Y-PS

NR

(db

)

33

35

37

39

41

43

288

360

480

576

720

900

1080

BB - SVC (DTQ=1,4,1)BB - H.264/AVC

Video Frame Height (16:9)

(d) The effect of frame size on video quality.

Figure 3.2: Comparing the performance of H.264/AVC, SVC and Simulcast over the videosequence Big Buck Bunny (BB) when the frame size is varied from 512 × 288 pixels to1920× 1080 pixels.

53

In general, as the frame size increases, so do the bitrate, coding complexity, and the

objective quality for both coding standards. However, there are subtle differences among the

two standards. According to the bitrates reported in Fig. 3.2(a), the bitrate overhead of SVC

compared to AVC, decreases from 80.3% to 17.8% when increasing the video frame size from

512× 288 to full HD. The bitrate overhead in SVC is the tradeoff for supporting 18 different

bitrate points spanning from 119.6 kbps to 1.3 Mbps, while H.264/AVC supports only one

bitrate. If fewer bitrates are required, this overhead can be decreased by lowering the number

of spatial or quality layers. In Simulcast, the full HD version of BB video sequence and its

downsampled version with 960× 540 pixels are encoded separately with H.264/AVC codec.

Compared to the two-resolution Simulcast, SVC’s bitrate is significantly less, especially

for frame sizes larger than 576p. To compare, as depicted in Fig. 3.2(a), two single layer

H.264/AVC streams are required to offer the flexibility provided by the spatial scalability

of the SVC bitstream, and the bandwidth requirement surpasses 1.5 Mbps. Furthermore,

from Table 3.2, it can be seen that this overhead is the highest overhead observed in the

test video sequences. To investigate the presence of this effect in higher resolutions, the

same experiment has been repeated over BB video sequence in Quad-Full-HD resolution

(3840× 2160 pixels), and as expected the bitrate overhead of SVC more decreased to 11.2%.

In terms of coding complexity, as shown in Fig. 3.2(b), SVC and AVC have similar

encoding time for frame sizes less than 720p. For larger frames, it takes SVC less time to

encode than AVC does. When increasing the frame size from 512×288 to full HD, the number

of 16×16 macroblocks in each frame increases from 576 to 8100. Thereby, the probability of

finding similar macroblocks for motion compensation increases, which consequently reduces

the motion compensated residual error generated for each motion vector, hence resulting in

fewer bits required for each motion compensated macroblock. This observation is reinforced

by our measurement for the video sequence BB, which shows that when increasing the

frame size from 512× 288 to full HD, the average number of bits required to represent each

54

motion compensated macroblock by H.264/AVC and SVC is reduced by 34.8% and 48.7%,

respectively. The decoding complexity is different from encoding complexity. Fig. 3.2(c)

shows that SVC has higher decoding complexity compared to AVC, which is expected due to

the growing number of motion vectors needed for layered prediction of motion compensated

macroblocks during the decoding process. More interestingly Fig. 3.2(b) shows that enlarging

the video frame size to HD (1280×720) fills the computational gap between H.264/AVC and

SVC in this specific encoding setting; and according to Table 3.2 when encoding BB video

sequence in full HD and Quad-Full-HD the encoding complexity of SVC codec is lower than

that of SVC codec by 20.3% and 28.7%, respectively. Since SVC uses motion prediction

in each enhancement layer, it better benefits from the increased number of higher similar

macroblocks.

Increasing the frame size also improves the objective quality of the video, since more

similar macroblocks allows the rate distortion optimization module of the encoder to select

higher quality points when encoding the video sequence. Fig. 3.2(d) shows that both SVC

and H.264/AVC codecs experience a similar growth in objective video quality measured in

Y-PSNR. SVC consistently has higher Y-PSNR values than AVC does.

The first two observations of this measurement study are important since the main reason

for not utilizing scalable coding in video content distribution industry is the bitrate and

complexity overhead of SVC. To confirm these results, the same set of experiments have

been repeated for other source video sequences. The results are presented for the full HD

video sequences in Table 3.2.

Table 3.2: Comparing the performance of SVC when DTQ = (1, 4, 1) and H.264/AVC forfull HD video coding.

ED PA RH PJ RB SF

Bitrate Overhead 18.1% 3.3% −3.4% 12.5% 17.7% 5.4%Encoding Overhead −18.9% −34.5% −30.6% −38.1% −43.2% −44.2%Decoding Overhead 114.3% 107.4% 106.8% 91.8% 106.2% 104.5%Quality Improvement (dB) 1.3 0.7 1.1 0.5 1.2 0.4

55

3.2.2 The Effect of Temporal Scalability

As described in Sec. 2.1.3, SVC temporal scalability is provided by a hierarchical temporal

prediction structure among I-, B- and P-frames. The structure can be characterized by GOP

size and Intra Period parameters. We vary these parameters to study the effect of temporal

scalability.

First, we increase the GOP size from 2 to 16, which adjusts the number of temporal layers

from 2 to 5 accordingly. Fig. 3.3(a) shows that increasing the GOP size from 4 to 8 decreases

the bitrate of the encoded video sequences by an average of 3.6%, but increasing the GOP

size from 8 to 16 increases the bitrate by an average of 3.9%. This is rather counterintuitive.

The bitrate is expected to drop, since growing the GOP size requires replacing P-frames

with B-frames. However, we note that this replacement may increase the residual error

of macroblocks that use the P-frame as a reference frame. Thus, more bits are needed to

represent the residual error, which yields higher bitrate overall. We also observe that the

bitrates of video sequences PJ and RB are noticeably higher than the other video sequences.

This is because both videos have high values for Detail and Motion Activity in Table 3.1.

In terms of coding complexity, all video sequences exhibit an increasing trend when the

GOP size grows from 2 to 16, as shown in Fig. 3.3(b) and Fig. 3.3(c). This observation on

full HD videos contradicts the observation on QCIF and CIF video sequences as reported in

[130]. The increase is mostly due to the extended search domain for motion-compensated

predictions when more B-frames are used, as reported in Fig. 3.3(d). Among the growing

coding complexities, there is a slight decrease in decoding complexity for video sequences

BB, SF, and ED when the GOP size goes from 2 to 4. According to Table 3.1, these three

video sequences have the lowest Motion Activity, which suggests that replacing P-pictures

with B-pictures has little impact on the number of motion compensated predictions.

We also observe quality improvement in terms of Y-PSNR when increasing the GOP size,

except for the video sequence RB, which has the highest Detail descriptor. As depicted in

56

0

4

8

12

2 4 8 16

BB ED PA RHPJ RB SF

Hig

hest

Qua

lity

Laye

r Bi

trat

e (M

bps)

GOP Size

(a) The effect of GOP size on coding efficiency.

100

115

130

145

160

2 4 8 16

BB ED PA RHPJ RB SF

Enco

ding

Tim

e (%

)

GOP Size

(b) The effect of GOP size on encoding time.

95

100

105

110

2 4 8 16

BB ED PA RHPJ RB SF

Dec

odin

g T

ime

(%)

GOP Size

(c) The effect of GOP size on decoding time.

32

35

38

41

44

Video Sequence

BB ED PA RH PJ RB SF

GOP 2 GOP 4GOP 8 GOP 16

Y-PS

NR

(dB

)

(d) The effect of GOP size on video quality.

0.5

0.6

0.7

0.8

Video Sequence

BB ED PA RH PJ RB SF

GOP 2 GOP 4GOP 8 GOP 16

Vis

ual I

nfor

mat

ion

Fide

lity

(VIF

)

(e) The effect of GOP size on video quality.

Video Sequence

BB ED PA RH PJ RB SF

GOP 2 GOP 4GOP 8 GOP 16

24

18

12

6

0Avg

. Num

. of M

C P

redi

ctio

ns x

100

0

(f) The effect of GOP size on number of motioncompensated predictions.

Figure 3.3: The effect of increasing the GOP size from 2 to 16 on the performance ofH.264/SVC for encoding test video sequences.

57

Fig. 3.3(e), the increase among video sequences BB, ED and SF is more noticeable. With

larger GOP size, the distance between predicted frames is also larger, which is not a desirable

property for the video sequences with high motion activity values. In contrary, for the video

sequences with low motion activity values, replacing P-frames with B-frames conserves the

bitrate and allows the rate distortion optimizer module to select higher quality points, leading

to increased video quality.

Next, we investigate the effect of Intra Period parameter on the performance of SVC

codec. In this experiment, we encode all video sequences using three different SVC encoding

configurations (0, 2, 2), (1, 3, 1) and (2, 4, 3). The Intra Period parameter is varied from 0

GOP to 4 GOPs, i.e., substituting one motion predicted P-frame with an I-frame for every

0 – 4 GOPs. As Intra Period increases, fewer I-frames appear in the video sequence. For

Intra Period of 0 GOP, no substitution will take place. The results from video sequence PA

are shown in Fig. 3.4. As shown in Fig. 3.4(a), there is a slight increasing trend for bitrate

when more I-frames are inserted. Fig. 3.4(b) and 3.4(c) show that adding intra-coded frames

slightly reduces encoding and decoding complexities, since some motion predicted P-frames

are replaced by I-frames that are less complex. In terms of objective quality, very little

improvement is observed in Fig. 3.4(d). Although using I-frames improves the quality of the

encoded video, it causes the rate distortion optimizer module to select lower quality points

due to the increased bitrate.

To compare the effect of Intra Period parameter among all seven video sequences, we

present the results when Intra Period parameter is 1 GOP in Table 3.3. We observe that the

bitrate overhead is less significant for video sequences with high detail and motion activities,

because an additional intra-coded frame can be helpful in providing higher quality reference

macroblocks and also resetting the error propagation chain among the predicted macroblocks.

This ultimately decreases the residual errors and the number of bits required to represent

them. In summary, when using SVC for full HD video streaming, additional intra-coded

58

100

120

140

160

1 GOP 2 GOPs 4 GOPs 8 GOPs No IDR

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

Hig

hest

Qua

lity

Laye

r Bi

trat

e (%

)

Intra Period Parameter

(a) The effect of Intra Period parameter on codingefficiency.

Enco

ding

Tim

e (%

)

90

95

100

105

1GOP 2 GOPs 4 GOPs 8 GOPs No IDR

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

Intra Period Parameter

(b) The effect of Intra Period parameter onencoding time.

Dec

odin

g T

ime

(%)

98

99

100

101

1 GOP 2 GOPs 4 GOPs 8 GOPs No IDR

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

Intra Period Parameter

(c) The effect of Intra Period parameter ondecoding time.

YPS

NR

(db

)

Intra Period Parameter

37

38

39

40

41

GOP 2 GOPs 4 GOPs 8 GOPs No IDR

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(d) The effect of Intra Period parameter on videoquality.

Vis

ual I

nfor

mat

ion

Fide

lity

(VIF

)

Intra Period Parameter

0.5

0.6

0.7

0.8

GOP 2 GOPs 4 GOPs 8 GOPs No IDR

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(e) The effect of Intra Period parameter on videoquality.

Mot

ion

Act

ivity

Intra Period Parameter

2.4

2.6

2.8

3.0

3.2

3.4

1 GOP 2 GOPs 4 GOPs 8 GOPs No IDR

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(f) The effect of Intra Period parameter on themotion activity of the encoded video.

Figure 3.4: The effect of varying Intra Period parameter on the performance of H.264/SVCwhen encoding different layered representations of Pedestrian Area (PA) video sequence.

59

frames are beneficial to videos with high detail and motion activity values.

Table 3.3: The effect of additional I-pictures on performance of SVC when DTQ = (2, 4, 3)and Intra Period = GOP size.

VideoBitrate Encoding Time Decoding Time Y-PSNR

Overhead Gain Gain GainBB 36.0% 1.8% 0.7% 0.7%ED 17.3% 1.0% 0.6% 0.4%PA 8.6% 1.6% 0.4% 0.2%RH 8.1% 4.0% 0.6% 0.1%PJ 0.5% 7.8% 0.1% 0.0%RB 0.1% 7.1% 0.2% 0.0%SF 15.1% 2.6% 0.6% 0.4%

3.2.3 The Effect of Spatial Layering

As described in Sec. 2.1.3, SVC supports spatial scalability in both dyadic and non-dyadic

modes. To investigate the effect of spatial layering on the performance of SVC codec for

full HD video streaming, two separate experiments were performed, one for the dyadic mode

and one for the non-dyadic mode. Since we adjust the GOP in this analysis, we use a

smaller GOP size of 4 for the reference encoding. Finally, the source video sequences have

been encoded such that spatial support for two common standard video resolutions, i.e., HD

(1280× 720) and Wide 480p (848× 480) are provided besides full HD.

Dyadic Spatial Layering vs. Single Layer Coding

In this experiment, dyadic spatial layering is applied to the full HD version of each video

sequence to create two layered video sequences with two and three dyadic spatial layers, called

DY1 and DY2, respectively. For comparison purposes, the same experiment is repeated with

single layer H.264/AVC, where the encoder encodes the video in full HD and two dyadic

spatial resolutions in parallel. We use SIMC1 to refer to the combination of full HD and one

dyadic spatial resolution, and SIMC2 to refer to the combination of full HD and two dyadic

spatial resolutions.

60

As shown in Fig. 3.5(a), sending two and three different resolutions of the videos in

parallel imposes an average bitrate overhead of 28.6% (for SIMC1) and 40.3% (for SIMC2).

In contrast, SVC spatial scalability significantly decreases the bitrate overhead to 8.8% (for

DY1) and 14.1% (for DY2). The performance gain is due to the intra-texture, motion and

residual signals of the lower resolution layers used to predict the higher resolution layers.

Interestingly, the video sequence RB in SVC format decreases the required bandwidth by

4.4% and 2.0% compared to the single-layer reference AVC encoding. This is related to

the high detail and motion activity value of the video as well as the low number of motion

vectors from Table 3.1. These properties indicate that the temporal prediction does not

provide enough motion compensated predictions. Compared to the single-layer coding, two

additional spatial layers in SVC increase the average number of motion vectors among all

test video sequences by 16.4%. RB experiences a 29.7% increase, which approves the role of

spatially predicted macroblocks in the bitrate decrease that is observed for RB. For coding

complexity, the use of one dyadic and two dyadic spatial layers in SVC increases the encoding

time by an average of 50.6% and 68.4%, respectively. The same trend is observed for all

vidoes, thus, the detailed results for each are not shown.

-15

0

15

30

45

BB ED PA RH PJ RB SF

SVC - Two Spatial Layers Simulcast - Two ResolutionsSVC - Three Spatial Layers Simulcast - Three Resolutions

Bitr

ate

Ove

rhea

d (%

)

Video Sequence

(a)

0

5

10

15

20

25

BB ED PA RH PJ RB SF

AVC-480 SVC-480AVC-960 SVC-960AVC-1920 SVC-1920

Bitr

ate

Ove

rhea

d (%

)

Video Sequence

(b)

Figure 3.5: The effect of SVC spatial layering on (a) the streaming server side and (b) thereceiver side.

61

Fig. 3.5(b) compares the bitrate required on the client side to receive the video in either

AVC or SVC format with the specified dyadic resolutions (480p, 960p, and 1920p). The

average bitrate required to receive the base layer of SVC bitstream (SVC-480), which is also

AVC compatible, is 70.4% that of single-layer AVC (AVC-480). This inevitably leads to a

lower objective video quality. Our measurements show that using SVC with the specified

settings decreases the objective video quality of 480×270, 960×540 and full HD reconstructed

videos by an average of 0.98, 1.17 and 1.05 dBs, respectively. Furthermore, the average

bandwidth required to receive the videos in 960× 540 and full HD resolutions in SVC mode

are 10.1% and 15.0% more than that of single-layer AVC mode, respectively. For coding

complexity, there is no significant difference between SVC-480 and AVC-480, since they are

both AVC compatible. The added spatial layers in SVC require 27.1% (for SVC-960) and

84.6% (for SVC-1920, full HD) more decoding time. Hence, full-HD SVC is not recommended

on battery-operated devices or devices with limited CPU power. However, the decoding time

of the SVC bitstream can be dramatically decreased at the expense of minor limitations in

spatial layering capabilities [25]. Again, PJ and RB exhibit a different behaviour, since they

need less bandwidth for their 960× 540 and full HD resolutions, respectively, which can be

due to their high values of detail and motion activity.

Dyadic vs. Non-Dyadic Spatial Layering

To compare dyadic and non-dyadic spatial layering, we modify the resolution and the frame

ratio of the spatial layers so that the number of macroblocks in each layer remains unchanged

and the layer resolutions are non-dyadic. This version of coding is referred to as NDY1

and NDY2 for one and two non-dyadic spatial layers, respectively. We repeat the same

spatial layering experiment as in Sec. 3.2.3 with the new non-dyadic layers. Furthermore, to

investigate the effect of frame size, the same experiments are repeated with 480× 270 pixels

frames in the highest resolution layer. The average overheads from all video sequences are

reported in Table 3.4. We observe that increasing the frame size to full HD in non-dyadic

62

spatial layering significantly reduces the average overhead, and all overheads are less than

7.5%.

Table 3.4: Dyadic vs. non-dyadic spatial layering results. Subcolumns show the respectiveoverhead for one and two spatial layers (NDY1 vs. DY1 and NDY2 vs. DY2) respectively.

Video Bitrate Encoding DecodingResolution Overhead Overhead Overhead

480× 270 11.2% 16.6% 13.7% 18.1% 6.2% 11.7%Full HD 4.4% 6.5% 6.2% 7.5% 2.2% 4.2%

3.2.4 The Effect of Quality Layering

Next, we study the quality scalability in SVC. CGS does not provide the required flexibility

for most real world situations, and JSVM-9.19.15 does not allow the configuration of relevant

parameters of FGS separately. For these reasons, we study only the MGS mode in this sec-

tion. Besides these three quality scalability modes, quantization parameter (QP) is another

factor that directly affects the quality of the encoded layers and the overall bitstream. To

investigate the effect of quality layers and QP in full HD video streaming with SVC, two

separate experiments were performed, one for quality layers and one for QP.

In JSVM-9.19.15, the number of quality layers and their properties can be specified using

the MGSVectorMode parameter with the MGSVector defining up to 16 layers. Each element

i in MGSVector specifies the quality level of the ith SNR layer, and the sum of the elements

in MGSVector must equal 16. In this experiment, we vary the number of quality layers from

zero to four using a GOP size 4. Table 3.5 presents the MGS configuration for all layer

configurations.

As shown in Fig. 3.6(a), by adding the first quality layer, the SNR refinement module

is loaded into the prediction module, resulting in 42.7% – 79.0% increase in bitrate. More-

over, adding the first quality layer introduces more bitrate overhead for less complex video

sequences (e.g., BB and ED). Except for the first quality layer, any additional quality layer

has almost negligible impact on the bitrate, since the same quality information is divided

63

Table 3.5: Encoding configuration for quality (SNR) layering study.SNR Layers MGSV0 MGSV1 MGSV2 MGSV3

01 162 8 83 8 4 44 4 4 4 4

and placed in different layers. For the same reason, the Y-PSNR is increased by 1.6% on

average when adding the first quality layer, and additional quality layers does not improve

the quality, as shown in Fig. 3.6(d). Recall that SVC is used to provide adaptive streaming

to allow end-user devices to receive a subset of these quality layers and still be able to render

the video.

According to Fig. 3.6(b) and Fig. 3.6(c), additional quality layers introduce little com-

plexity in the encoding process, but do require more decoding time on the receiver side. On

the server side, complex video sequences (e.g., RB and PJ) exhibit a decreasing quality trend

as more quality layers are added, while less complex video sequences (e.g., BB) require more

encoding time. On the receiver side, each quality enhancement layer adds about 23% more

decoding time, i.e., adding four quality layers increases the decoding time by 92.5%. This

can be due to the internal structure of JSVM-9.19.15 decoder module, where enhancement

layers are decoded and applied to the decoded picture buffer consecutively. On the other

hand, increasing the number of quality layers does not have a strong effect on the encoding

complexity, where the overhead changes from −2% for more complex video sequences, (i.e.

RB and PJ that have high values of Detail and Motion Activity descriptors), to 4% for less

complex video sequences, (i.e. BB). On the other hand, Fig. 3.6(c) depicts that the decoding

complexity of the video sequences increases almost linearly, where each quality enhancement

layer adds an overhead of almost 29% to the decoding time, i.e. adding four quality layers

increased the decoding time by 111.5%. This can be due to the internal structure of JSVM-

9.19.15 decoder module, where enhancement layers are decoded and applied to the decoded

64

Hig

hest

Qua

lity

Laye

r Bi

trat

e (%

)

Number of Quality Layers

100

120

140

160

180

0 1 2 3 4

BB ED PA RHPJ RB SF

(a) The effect of number of quality layers on codingefficiency.

Enco

ding

Tim

e (%

)

Number of Quality Layers

97

99

101

103

105

0 1 2 3 4

BB ED PA RHPJ RB SF

(b) The effect of number of quality layers onencoding time.

Dec

odin

g T

ime

(%)

Number of Quality Layers

100

120

140

160

180

200

0 1 2 3 4

BB ED PA RHPJ RB SF

(c) The effect of number of quality layers ondecoding time.

YPS

NR

(dB

)

Number of Quality Layers

32

36

40

44

48

0 1 2 3 4

BB ED PA RHPJ RB SF

(d) The effect of number of quality layers on videoquality.

Figure 3.6: The effect of varying the number of quality layers from zero to four on theperformance of H.264/SVC for different video sequences.

65

picture buffer consecutively.

3.2.5 The Effect of Quantization Parameter

To investigate the effect of QP, we use the three different DTQ configurations used in

Sec. 3.2.2: (0, 2, 2), (1, 3, 1) and (2, 4, 3). For each configuration, the highest value of QP,

the QP for the base layer (QPb), is varied from 32 to 42, where delta QP is −2 for quality

layers. Automatic QP cascading was employed for temporal and spatial layers. The results

of experiments for video sequence PA are reported in Fig. 3.7. All other video sequences

share the same performance trend.

Fig. 3.7 shows that the impact of QPb is very little for videos with DTQ configuration

(0, 2, 2), where no spatial layering is used. The impact of QPb is stronger for DTQ=(1, 3, 1)

and DTQ=(2, 4, 3). According to Fig. 2.5, when spatial layering is used, the inter-layer pre-

diction is utilized to use intra, motion and residual signal information of the lower spatial

layers to predict the macroblocks in the upper layers. The reduced values of the residual

signals of the upper layers makes QP a more determining factor in the performance anal-

ysis. Therefore, when spatial layering is utilized, increasing the value of QP significantly

decreases the bitrate, slightly decreases the encoding time, and negligibly decreases the de-

coding time. The objective quality of the video stream also considerably decreases. However,

we noted that even when Y-PSNR is close to 36 dB in Fig. 3.7(d), the visual quality of the

reconstructed video is very good from a human viewer perspective.

Furthermore, Table 3.6 shows that while the effect of varying QP is similar for different

videos, it strongly affects the objective quality of more complex video sequences, i.e., PJ

and RB. This is an expected behavior since the presence of more details in more complex

video sequences leads to having more high power AC components after DCT transform,

hence increasing the effect of QP variations on the number of components remaining after

quantization.

66

Quantization Parameter

Hig

hest

Qua

lity

Laye

r Bi

trat

e (%

)

10

40

70

100

32 34 36 38 40 42

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(a) The effect of varying QP on coding efficiency.

Quantization Parameter

Enco

ding

Tim

e (%

)

85

90

95

100

32 34 36 38 40 42

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(b) The effect of varying QP on encoding time.

Quantization Parameter

Dec

odin

g T

ime

(%)

97

98

99

100

32 34 36 38 40 42

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(c) The effect of varying QP on decoding time.

Quantization Parameter

YPS

NR

(dB

)

35

37

39

41

43

32 34 36 38 40 42

PA (DTQ=0,2,2)PA (DTQ=1,3,1)PA (DTQ=2,4,3)

(d) The effect of varying QP on video quality.

Figure 3.7: The effect of varying quantization parameter on the performance of H.264/SVCwhen different layering structure is used to encode Pedestrian Area (PA) video sequence.The horizontal axis is the value of the highest quantization parameter used in the layeredstructure.

Table 3.6: The effect of varying QP on the performance of SVC when DTQ = (2, 4, 3) andQPb is changed form 32 to 42.

ContentBitrate Enc. Time Dec. Time Y-PSNRRatio Ratio Ratio QPb=32 QPb=42

BB 40.7% 95.9% 97.0% 41.9 37.3ED 34.5% 93.2% 97.2% 42.4 37.2PA 32.1% 90.7% 98.6% 41.7 37.4RH 27.4% 91.1% 97.8% 41.7 37.9PJ 29.5% 88.5% 91.3% 37.5 30.5RB 29.4% 83.0% 96.0% 39.6 32.1SF 33.7% 96.1% 98.3% 43.1 38.8

67

3.3 Summary and Discussion

In this chapter, we conducted a systematic study on the use of H.264/AVC and SVC for full

HD video streaming. Compared to the previous research work listed in Sec.2.2.1, this research

is different from several perspectives. First, the number of test video sequences is sufficiently

large to represent different types of videos, therefore the conclusions are not biased to the

selected video sequences. Second, the criteria used to select the test video sequences is clear

and ensures diversity in their visual properties. Third, the video resolutions are large enough

to be relevant in today’s applications. Finally, in this research, all the scalability modes of

layered video coding and their different modes, wherever applicable, are investigated.

We learned that in spite of the results reported in previous research works for using SVC

for low resolution videos (e.g., CIF and 4CIF), SVC requires less computational resources

in the encoding phase and also benefits from higher video quality in higher resolutions (as

much as multiple dBs in terms of Y-PSNR). At the same time, SVC requires higher bitrate

due to the higher quality of the encoded video and also the presence of multiple video layers.

According to Fig. 3.7(a), a reduction of 2 dBs of quantization parameter can result in as

much as 20% reduction in the layered video bitrate. This will cover the higher bitrate of

layered coded video in expense of erasing its quality advantage over single-layer video.

SVC suffers from higher decoding complexity. However, it must be noted that due to

the presence of hardware decoders in mobile devices, decoding the video does not consume

significant energy compared to that of receiving the video over the wireless links and playing

the video on the screen. We will come back to this issue in Chapter 6. In return, SVC is

efficient in serving streaming sessions in multiple quality levels.

We identify that SVC is more advantageous in full HD streaming, since the efficiency and

computational gap between SVC and H.264/AVC is much less when encoding high quality

videos. When using SVC for full HD video streaming, additional intra-coded frames are

beneficial to videos with high detail and motion activity values. Using a set of carefully

68

selected and diverse video contents, we also discovered that certain SVC configurations have

advantages over H.264/AVC for complex video sequences with high detail and motion activity

values. For example, replacing P-frames with I-frames in such video sequences can decrease

the encoding complexity without increasing the bitrate of the encoded video, and additional

spatial layers may decrease the bitrate of the encoded video.

In addition to investigating the performance of layered video coding in higher video

resolutions, understanding the internal mechanism of video codecs was another outcome

of this study. In the following chapter, this knowledge is applied to the distributed video

transcoding problem as a part of preparing the multimedia streaming service in the cloud.

More specifically, the properties of the to-be-transcoded video are extracted from the high

quality encoded video stream and used to adaptively change the length of the video segments.

We discuss this further in the following chapter.

69

Chapter 4

Preparing Video in the Cloud

When the raw video is acquired from a camera feed, it requires significant bandwidth for

transmission and storage. For example, using 4:2:0 color space, a full HD raw video with

24 frames per second has a bitrate of 570 Mbps. The storage requirement for such a video

bitrate is 268 TB per hour of video. These numbers would double if 4:4:4 color space is used

and quadruple if the video is recorded in 4K resolution. Therefore, the raw video must be

encoded into a high quality lossless or lossy compressed video prior to further processing.

Such an encoding procedure usually uses the same video resolution and frame rate as those

of the raw video input to create and store a reference version of the video in the system.

Furthermore, if lossy compression is used, the encoding parameters are set such that the

video quality does not significantly degrade. This version of the video may still require

lots of bandwidth due to its high bitrate, especially for streaming over wireless network

connections. Instead, it is used as a high quality input for the video transcoders to generate

lower-quality and lower-bitrate copies of the video for streaming.

Transcoding is the process of encoding a previously encoded video using new encoding

parameters. These new parameters normally decrease frame rate, frame resolution, objective

video quality, or all of them together. It is possible to design a special transcoder to increase

frame rate (by frame interpolation), frame resolution (by upsampling) or subjective video

quality (by using heuristics extracted from the characteristics of the human visual system,

e.g., by increasing the picture contrast). However, these topics are orthogonal to our dis-

cussion here. To summarize, in a usual video streaming scenario, transcoding is utilized to

convert a high frame rate, high resolution, high quality reference video into a version that

can be served over network toward end user devices.

70

The reference video must be transcoded to different resolutions, frame rates and quality

levels. The various versions of the video are served subject to network conditions and hard-

ware specifications of the end user device. If the target video profile uses the same video

codec as that of the reference video, depending on the video codec and the transcoding

profile, it might be possible to transcode the reference video directly to the target video

sequence. For example, if a video transcoding profile only enforces reducing the frame rate,

it can be done by dropping some frames. However, generally speaking, video transcoding

requires to first decode the high-quality encoded reference video and then re-encode it to

the target quality level. Such a transcoding process may lead to significant delay in video

preparation phase due to the computational complexity of the encoding task. To address

this delay, distributed video transcoding is used to speed up the transcoding process. A

cloud-assisted video transcoding system segments a video into chunks and distributes video

chunks to virtual transcoder instances in the cloud for parallel transcoding. This paradigm

greatly reduces the video access delay [49, 58, 77]. In addition, layered video encoding can

be used along with cloud-assisted video transcoding to allow the media service provider to

transcode a video once and use it for several target bitrates and resolutions [32,58,136].

We summarized the previous related research work in Section 2.2. We note that existing

proposals for cloud-assisted video transcoding treat the encoded video no different from a

raw video. A fixed number of consecutive frames or group of pictures (GOP) are grouped into

a video chunk [58]. The chunks are assigned to virtual machines using a scalable technique

such as MapReduce [39]. The transcoded video chunks then can be merged into a single

video sequence to be delivered to the end user or transmitted as is. By inspecting video

codec standards, we learn that due to the high similarity of consecutive frames in video

sequences, certain important inter-frame dependency exist among them. This dependency

may get broken when segmenting a video into fixed-size chunks. For example, two GOPs

with very dissimilar pictures (e.g., due to change of scenery) may be grouped into one video

71

chunk, and two consecutive GOPs with high degree of similarity may be separated into two

video chunks. Since video encoding techniques, like other compression techniques, are based

on utilizing the similarity between the to-be-encoded pictures, encoding dissimilar group of

pictures as a video segment can increase the video bitrate and the transcoding time and

decrease the video quality. Intuitively, this problem can be addressed by proper adjustments

of video chunks according to the visual similarity of the consecutive video frames or group

of pictures. We investigate the correctness of this statement in Sec. 4.1.1.

Exploiting this statement, in this chapter we propose a distributed video transcoding

scheme that improves resource efficiency by decreasing the video bitrate and the compu-

tational resource consumption on the cloud. It also increases the visual quality of the

transcoded video for more complex video sequences. The proposed model exploits depen-

dency among GOPs of the encoded reference video and creates video chunks of variable size.

Inter-frame dependencies reflect the visual similarity between the consecutive frames and

GOPs. High amount of inter-frame dependency means that the encoder was able to find

many similarities between consecutive frames, hence higher visual similarity is expected.

Similarly, lack of inter-frame dependencies among consecutive frames can be interpreted as

lack of visual similarity. The goal is to reduce the bitrate and transcoding time for fast

delivery of transcoded video to end users. The key to achieve this goal is the variable-size

chunk. In the proposed scheme, the chunk size is determined according to the prediction de-

pendency among GOPs in an encoded video. Highly dependent GOPs are encoded together

to take advantage of visual similarity among enclosed video frames. We utilize layered video

coding along with video transcoding to produce transcoded videos that can satisfy certain

range of quality requirements [32, 58, 136]. The experimental results on a set of real video

sequences with diverse visual features show that the proposed transcoding scheme reduces

bitrate and transcoding time compared to conventional video transcoding schemes that use

fixed-size video chunks.

72

In general, the cloud provides a scalable, responsive, and cost-effective solution for video

transcoding services. We note that existing proposals on video transcoding in the cloud are

all performing conventional video transcoding of a video on either a single virtual machine

or multiple virtual machines. The performance gains are mostly due to efficient use of cloud

resources or parallelism managed by a Map-Reduce model. None of these proposals considers

information that can be extracted from the video. In fact, an encoded video encapsulates

useful dependency information among GOPs, frames, slices or even macroblocks. In this

chapter, we propose a dependency-aware distributed video transcoding scheme.

4.1 Distributed Video Transcoding in the Cloud

Our proposed distributed video transcoding scheme is a cloud-based solution that exploits

the coding and prediction dependency in layered video coding to transcode a video satisfying

certain requirements. Fig. 4.1 illustrates the workflow of distributed video transcoding in

the cloud. Upon receiving a video streaming request, the streaming server instructs the

transcoding controller to load the requested source video from the video repository. The

controller segments the video into chunks and distributes video chunks to virtual instances in

the cloud using a parallel computing model such as Map-Reduce [39]. At last, the transcoded

video chunks are merged into a video sequence to be delivered to end users, and if desired,

stored in the video repository for future requests.

Towards improving the distributed video transcoding process, numerous models and algo-

rithms have been proposed to minimize transcoding delay, number of transcoding virtual ma-

chines needed, energy consumption, and transcoding cost in the cloud [32,136,149,165,167].

By intuition, encoding each GOP separately reduces encoding time, since the encoder does

not need to consider the information from other GOPs. But counter-intuitively, this ap-

proach leads to poor coding efficiency, i.e., more bits are needed to present the video at

a specific level of quality [161]. On the one hand, the larger the chunk size is, the more

73

VideoRepository

StreamingRequests

StreamingServer

TranscodingController

TranscodingServers

VideoStream Video

MergerSourceVideo

Videochunks

Figure 4.1: Workflow of distributed video transcoding in the cloud.

visual similarity is captured in a chunk. Hence, we expect more coding efficiency gain (i.e.

lower video bitrate) as we increase the chunk size. On the other hand, increasing the chunk

size normally increases the transcoding time. However, the trade-off between coding effi-

ciency and transcoding time depends on the visual properties of the to-be-transcoded video.

Therefore, an adaptive algorithm is required to determine the proper size of video chunks.

To select the proper chunk size, we must consider video properties such as similarity

among frames in consecutive GOPs rather than grouping a fixed number of GOPs into a

chunk. Visual features such as detail and motion activity can provide hints on the appropriate

chunk size for transcoding. However, it is computationally expensive to extract these features

and they require access to the raw video. In this chapter, we present a novel linear-time

approach for determining the appropriate chunk size. Based on this approach, we propose

a distributed transcoding scheme that segments a video into variable-size chunks according

to prediction dependencies mined during the encoding process. The idea of grouping related

frames was first suggested in [24] to eliminate network redundancy in video caching and to

avoid caching the same video multiple times. The authors defined “sample-based” chunking

as grouping all video samples between two consecutive IDR frames. This approach results in

video chunks with the same number of frames but varying sizes in bytes, which is essentially

74

different from this proposal.

4.1.1 The Necessity of Considering GOP Dependencies

We mentioned that the size of video chunks might significantly affect the performance of

the video transcoding system in terms of coding efficiency and transcoding time. Hence, we

need to wisely select the size of video chunks according to the visual properties of the source

encoded video1. To support this claim, we performed a set of video transcoding experiments

over the full-HD source video sequences that were discussed in Sec. 3.1. To better evaluate

the effect of the length of video segments, 1600 frames of each video was selected. If the length

of the original video sequence was less than 1600 frames, multiple copies of the video were

concatenated until enough number of frames were provided. We transcoded the source video

sequence from full-HD (1080p) to HD (720p) video frame format and used the layer settings

of two spatial, four temporal and two quality layers, i.e., DTQ=(1, 3, 1). To investigate

our claim, we divided each reference video sequence into video chunks that contain a fixed

number of GOPs, starting from one GOP per video chunk and multiplying by two up to 64

GOPs per video chunk. Results are presented in Fig. 4.2.

Due to the propagation of prediction information from past GOPs to future GOPs, we

essentially expect coding efficiency to increase with the number of GOPs in each video chunk.

From Fig. 4.2.a, it can be seen that while this general assumption is true, this observation

is much stronger for video sequences with less detail and motion activity, such as BB and

SF, and is less significant or negligible for video sequences with highly detailed and changing

scenery, such as PJ and RB. On the one hand, compared to long video chunks, say 64

GOPs, if we set the size of video chunks to a small number of GOPs, say 1 GOP, the size

of the transcoded video will be up to three times for video sequences such as BB while the

computation time is decreased only by 25%. On the other hand, compared to short video

1 Our experiments revealed no significant change in the quality of the encoded videos. In fact, increasingthe size of video chunks from 1 GOP to 64 GOPs increased the average Y-PSNR of the transcoded videosby 0.3 dB, where the average Y-PSNR for one GOP per video chunk was 37.4 dB.

75

0%

25%

50%

75%

100%

1 2 4 8 16 32 64

BB SF ED PAPJ RH RB

Num. of GOPs in Video Chunks

Adj

uste

d V

ideo

Bitr

ate

(a) The effect of size of video chunks on video bi-trate.

100%

115%

130%

145%

160%

1 2 4 8 16 32 64

BB SF ED PAPJ RH RB

Num. of GOPs in Video Chunks

Adj

uste

d Tr

ansc

odin

g T

ime

(b) The effect of size of video chunks on transcod-ing time.

Figure 4.2: The effect of increasing the size of video chunks from 1 GOP to 64 GOPs onthe video bitrate and transcoding time. The numbers are adjusted according to the videochunks with size of unit GOP.

chunks, say 1 GOP, if we set the size of video chunks to a large number of GOPs, say 64

GOPs, then the transcoder is transcoding videos such as RB up to 45% slower without any

bandwidth saving. Therefore, an adaptive algorithm is required to use large video chunks

for BB and small video chunks for RB.

While this observation can be explained by different amounts of detail and motion activity

from Table 3.1, according to our experiments such visual features can only be used as a

general rule of thumb. Furthermore, they are expensive to compute and require access to

the raw video sequence to be more precise. In contrast, the computationally light algorithm

that is suggested in Sec. 4.2 allows the transcoding controller to select the size of video

chunks with very small overhead and adaptively as the video transcoding task proceeds.

76

4.2 Dependency-Aware Distributed Video Transcod-

ing

As discussed in Sec. 4.1, transcoding fixed-size video chunks leads to coding inefficiency.

We also observed that a group of n GOPs sharing great visual similarity can be encoded

significantly faster than a group of n relatively independent GOPs. The visual similarity

among consecutive GOPs in a raw video cannot be measured easily. Nonetheless, since the

visual similarity drives the prediction decision when encoding a raw video, the prediction

dependency among GOPs found in a coded video reflects the visual similarity and greatly

determines the coding complexity. The GOP dependency may be calculated when producing

a coded version of a video from a raw video. In a cloud-based distributed video transcoding

system, as illustrated in Fig. 4.1, the transcoding controller can segment the to-be-transcoded

video into video chunks according to dependency among GOPs and then distribute the

variable-size video chunks to virtual instances in the cloud for fast transcoding. In this

section, we propose a GOP-dependency model that exploits the visual similarities (also

referred to as the coding dependency) among GOPs in Sec. 4.2.1. Based on this model, we

propose a dependency-aware video transcoding scheme that clusters GOPs into video chunks

according to their inter-dependency and distributes the chunks in the cloud for transcoding

in Sec. 4.2.2.

4.2.1 GOP-Dependency Graph

From the deep inspection of SVC encoder, we note that there is a correlation between the

prediction decisions made by the encoder and the visual similarity of the encoded pictures.

Hence, the GOP-dependency model may be derived based on the layered structure in a SVC

video. However, recently it has been shown that the layering information is not sufficient

to characterize dependency in a video sequence [161]. For example, a pair of frames from

77

two different spatial layers may have stronger dependency than a pair of frames within the

same spatial layer, or vice versa. Thus, segmenting and transcoding video chunks based

on dependency among layers may still lead to coding inefficiency. For this reason, it has

been suggested to utilize dependency among macroblocks and sub-macroblocks (the basic

encoding units in the H.264 standard family) to accurately model the dependency in a video

[161]. Inspired by deep inspection of SVC encoder and observations reported in [161], we

build a GOP-dependency graph derived from the macroblock-level prediction dependency

among consecutive GOPs in two steps.

Step 1: Generating the macroblock dependency graph

To generate the macroblock dependency graph for two consecutive GOPs, we propose a

dependency graph Gm, where Gm is a weighted directed acyclic graph (DAG) Gm = (Vm, Am).

Each node mi ∈ Vm represents a macroblock belonging to the key frames (frames 0 and 1

in Fig. 4.3) or a non-key frame that depends on the key frame in the second GOP (frames

2 and 4 in Fig. 4.3). Hereafter, we refer to this set of macroblocks as M. Since GOP is

the smallest transcoding unit assigned to a transcoding server, there is no need to capture

intra-GOP dependency in Gm.

Each directed arc ai,j ∈ Am indicates a prediction dependency between macroblocks mi

and mj, where the direction is from the reference macroblock towards the dependent mac-

roblock. Next, to generate the macroblock-dependency graph Gm, we extract the dependency

among all pairs of macroblocks mi in frame fy and mj in frame fz. This can be done when

encoding a raw video sequence for the video repository. When the SVC encoder visits a new

macroblock that belongs to M, a new node is added to the dependency graph. For each

prediction decision, if the reference macroblock is a member of M, an arc is added to the

graph from the dependent macroblock to the reference macroblock. The resulting graph Gm

is a DAG, as shown in Fig. 4.3, since no two macroblocks can either directly or indirectly

mutually depend on each other.

78

GOP1

0 123 4

GOP2

Figure 4.3: Top: Prediction dependency links between two consecutive GOPs in the baselayer (layer S0) of the SVC video from Fig. 2.6. Bottom: Macroblock dependency graphmodelling inter-GOP prediction.

Since the degree of dependency between a pair of macroblocks may vary depending on

the prediction method used, we associate a weight to each dependency arc by using the error

introduced due to prediction decision, also referred to as distortion. The prediction distortion

is calculated by the encoder when making each prediction decision. We then normalize the

distortions to be in the range of [0, 1] for each predicted frame as follows:

‖ di,j ‖=di,j −minmk∈fz dk,j

maxmk∈fz dk,j −minmk∈fz dk,j(4.1)

where di,j is the distortion introduced when predicting mj ∈ fz from mi ∈ fy. Next, we

calculate the weight of each link as follows:

wami,j = 1− ‖ di,j ‖ (4.2)

where wami,j is the weight of dependency link between mi ∈ fy and mj ∈ fz and ‖ di,j ‖ is the

normalized distortion from Eq. 4.1. The weight of a link is large if the prediction distortion

is small, i.e., a strong dependency exists between two macroblocks that are visually very

79

similar. The weight of a link is small if the prediction distortion is large, i.e., a weak

dependency exists between two macroblocks that are visually very different.

To achieve high compression rate and high video quality at the same time, SVC encoder

is not limited to the boundaries of the reference macroblocks. The dependency among

macroblocks may be categorized into four cases, as illustrated in Fig. 4.4. The weight of

each dependency relation in each case may be calculated as follows:

• Using a full macroblock as a reference (Fig. 4.4(a)): In the simplest form, the prediction of

a macroblock is based on another macroblock. In this case, one dependency arc is added

to the graph and the weight of the arc is calculated using Eqn. 4.2.

• Using a macroblock created from portions of 2 or 4 macroblocks as a reference (Fig. 4.4(b)):

The prediction modules may use a 16 × 16 area located on the borders of two or four

macroblocks as a reference macroblock. In this case, we add a dependency arc from the

predicted macroblock to each of the macroblocks serving as a partial reference. The weight

of each dependency arc is prorated weight of the reference macroblock:

wpami,j = (1− ‖ di,j ‖)×smi,j256

(4.3)

where ‖ di,j ‖ is the normalized distortion introduced by the respective prediction, and smi,j

is the number of pixels (out of the 256 pixels) in the reference macroblock that is used to

predict the dependent macroblock.

• Using a submacroblock as reference (after proper upsampling) (Fig. 4.4(c)): A submac-

roblock may be upsampled to serve as a whole reference macroblock. If the reference sub-

macroblock belongs to a macroblock, there is only one arc from the predicted macroblock

to the macroblock containing the reference submacroblock, as illustrated in Fig. 4.4(c).

In this case, the weight of the arc is the same as the case that a full macroblock is used

80

as a reference. Thus, the weight of the arc is calculated as in Eqn. 4.2. If the reference

submacroblock is located across boundaries of 2 or 4 macroblocks, similar to the case

illustrated in Fig. 4.4(b), we add a dependency arc from the predicted macroblock to each

of the macroblocks serving as a partial reference. The weight of each dependency arc is

calculated as in Eqn. 5.3, except that the constant in the denominator is 64 (representing

the smaller size of 8× 8 co-located submacroblock).

• Using multiple macroblocks as reference (Fig. 4.4(d)): A predicted macroblock may use

multiple reference macroblocks and combine the result by, for example, taking an average

over the predicted samples. The importance of each reference macroblock depends on the

availability of the reference macroblocks. The quality of the reconstructed macroblock

improves as more reference macroblocks become available. In this case, we add one de-

pendency arc from the predicted macroblock to each of the reference macroblocks. The

weight of each dependency arc is calculated as follows:

wpami,j =1− ‖ di,j ‖Nref

(4.4)

where Nref is the number of reference macroblocks. Since the availability of each reference

macroblock is not known before delivering the video to end users, all reference blocks are

equally important. Thus, each arc has an equal share of the full weight.

Step 2: Creating the GOP-dependency graph

Fig. 4.5 illustrates the creation of the GOP-dependency graph. First, we begin with

the Gm that is prepared in the previous step to model inter-GOP prediction dependency,

as exemplified by Fig. 4.5(a). We then convert the macroblock-dependency graph Gm to a

frame-dependency graph Gf = (Vf , Af ), as shown in Fig. 4.5(b). To do so, we merge nodes in

Gm representing macroblocks from the same frame into a single node to simplify the graph.

Correspondingly, we merge the dependency arcs in Am that have a common start and end

81

Reference macroblock(s)

Predicted macroblock

(a)

(b)

Reference macroblock(s)

Predicted macroblock

(c)

(d)

Figure 4.4: Different types of dependencies between macroblocks in SVC. (a) Using a fullmacroblock as a reference, (b) Using a macroblock created from portions of 2 or 4 mac-roblocks as a reference, (c) Using a submacroblock as a reference (after proper upsampling),and (d) Using multiple macroblocks as references.

frame into one dependency arc in Af , where the weight of each combined arc is the weighted

average of the weights of all individual arcs being merged, i.e.,

wafy,z =∑∀ami,j∈C

wami,jNgop

(4.5)

where wafy,z is the weight of an arc in Af from frame y to frame z, wami,j is the weight of an

arc from the macroblock-dependency graph Gm, C is the set of arcs in Am just being merged,

and Ngop is the total number of 16× 16 macroblocks in each frame.

Next, we convert Gf to a directed GOP-dependency graph ~Gg = (Vg, ~Ag), as shown in

Fig. 4.5(c), by merging nodes representing frames belonging to the same GOP into one node.

In the H.264 standard family, each key frame in GOPk+1 depends on the key frame in the

previous GOPk, and some non-key frames in GOPk depend on the key frame of GOPk+1, as

shown in Fig. 4.3. Thus, in Gf , there is always one dependency arc from the key frame in

GOPk to the key frame in GOPk+1, and a number of dependency arcs from the key frame

of GOPk+1 to some non-key frames in GOPk. The weight of the arc from GOPk to GOPk+1

82

(a)

GOP 1 GOP 2

(c)

(d)

(b) 0 2 1

0.60.3

0.4

0.73GOP 1 GOP 2

0.60.33GOP 1 GOP 2

0.36GOP 1 GOP 2(e)

3 4

Figure 4.5: Converting a macroblock-dependency graph Gm (a) to a frame-dependency graphGf (b), to a GOP-dependency graph Gg (d), and at last to a GOP-distance graph (e).

in the GOP-dependency graph Gg is the weight of the dependency arc from the key frame

in GOPk to the key frame in GOPk in the frame-dependency graph Gf , which is 0.6 in the

example illustrated in Fig. 4.5. The weight of the arc from GOPk+1 to GOPk is calculated as

weighted average of the weights of all arcs from the key frame in GOPk+1 to non-key frames

in GOPk as in Eq. 4.6:

wa~gk+1,k

=∑ (S − I(fj))

S − 1wafi,j

(4.6)

where wa~gk+1,k

is the weight of backward dependency arc from GOPk+1 to GOPk, wafi,jis the

weight of a dependency arc from fi ∈ GOPk+1 to fj ∈ GOPk in the frame-dependency graph

Gf , S is the number of frames in each GOP (4 in the example illustrated in Fig. 4.5), and

I(fj) is a function that returns the index of fj inside GOPk starting from 0 for the key frame.

In Fig. 4.5(c), the weight of the arc from GOP2 to GOP1 is 230.3+ 1

30.4 = 0.33. The weight is

inversely proportional to the distance from the reference key frame, and proportional to the

83

number of frames that will be affected by the quality of reference key frame due to temporal

prediction. In general, frames that appear earlier in a GOP (in coding order as shown in

Fig. 4.3) are used as reference frames by more frames than later frames are. For example,

we gave more weight to the dependency link from frame 1 to frame 2 because frames 3 and

4 both use frame 2 as their reference frame, as shown in Fig. 4.3.

Next, the directed GOP-dependency graph ~Gg is further simplified to a undirected GOP-

dependency graph Gg, as shown in Fig. 4.5(d), by merging the two directed arcs into one

undirected arc. The weight of the undirected arc is calculated as follows:

wagk,k+1= w

a~gk,k+1+ (1− w

a~gk,k+1)× w

a~gk+1,k(4.7)

The rationale behind using one minus the weight of the forward link as a coefficient

for the weight of the backward link is that if the forward link is very strong, then the

information spread back from the key picture in GOPk+1 is very similar to that of the key

frame in GOPk, hence, GOPk+1 does not provide much new information. Since wa~gk,k+1

and

wa~gk+1,k

are normalized, the result of this function is always between 0 and 1 and no further

normalization is required. In Fig. 4.5 (d), the weight of the undirected arc between GOP1

to GOP2 is 0.6 + 0.4 ∗ 0.33 = 0.73. Finally, using Eq. 4.8, the dependency between GOPs

can be converted to a distance measure for the GOP clustering algorithm. This step will be

detailed in Sec. 4.2.2.

4.2.2 Dependency-Aware Distributed Video Transcoding in the

Cloud

As described in Sec. 4.1, the transcoding controller segments the to-be-transcoded video into

chunks and distributes chunks to virtual machines in the cloud to speed up the transcod-

ing process. In this section, we propose a new cloud-based distributed video transcoding

scheme that uses the GOP-dependency graph Gg to perform transcoding on variable-size

84

video chunks. In other words, the new scheme assigns GOPs sharing great visual similarity

to the same machine for better coding efficiency and faster transcoding. We first present the

clustering algorithm for grouping GOPs to variable-size video chunks based on the GOP-

dependency graph Gg, and then present the algorithm that dispatches video chunks to virtual

machines for transcoding.

Preparing variable-size video chunks

In order to benefit from the visual similarity among pictures in a video sequence, we

propose to segment the video into variable-size video chunks so that GOPs are clustered ac-

cording to prediction dependency (hence, the visual similarity). Many clustering algorithms

have been proposed to group data into a fixed number of clusters [150] or any number of

clusters as needed [17, 45, 164]. Since the number of desired video chunks in the proposed

adaptive model is not known a priori when transcoding a video in real time, we adopt OP-

TICS (Ordering Points To Identify the Clustering Structure) [17] to cluster nodes in the

GOP-dependency graph Gg into as many video chunks as necessary.

Since OPTICS clusters a stream of data points according to distances between each

pair of points, we must convert the GOP-dependency graph Gg to a GOP-distance graph

Gd by converting the dependency weight of each arc to a distance measure between two

consecutive GOPs. Since highly dependent GOPs should be transcoded together, they should

be clustered into one video chunk. Thus, the distance between two consecutive GOPs dk,k+1

should be inversely proportional to the degree of dependency (the arc weight wagk,k+1in the

GOP-dependency graph Gg), as calculated in Eqn. 4.8.

dk,k+1 =1

wagk,k+1

− 1 (4.8)

We reduce one from the inverse of the weight of a GOP-dependency arc to make the

distance greater than or equal to zero. According to Eqn. 4.2–4.7, if two consecutive GOPs

are very similar, the weight of the corresponding dependency arc in Gg is close to 1 (due

85

to the low prediction distortion), which makes the distance between these two GOPs in Gd

approaching to zero according to Eqn. 4.8.

OPTICS has two parameters: ε – the maximum distance among nodes in a cluster, and

MinPts – the minimum number of nodes in a cluster. We set MinPts to one by default,

meaning that if there is no GOP with a strong visual similarity with a GOP, then the GOP

can be processed alone as a video chunk. The value of ε is set to 5 experimentally. The

computational complexity of OPTICS depends on the complexity of the ε-neighborhood

query function which is invoked exactly once for each GOP. Since the GOP-dependency

graph Gg is a chain of GOPs, the query function is invoked at most n times, where n is

the number of GOPs in the video sequence and the query function adds the distances of

the new GOPs together until the accumulated distance from the first GOP of the current

cluster exceeds the threshold ε. Since the computational complexity of the query function

(one addition and one comparison) is constant, the complexity of this algorithm is O(n).

Dispatching video chunks for distributed transcoding in the cloud

After segmenting the video into variable-size video chunks according to dependency

among GOPs, the transcoding controller dispatches video chunks to virtual machines in

the cloud for transcoding. Though it is an NP-hard problem to optimize the dispatching

algorithm for transcoding delay, number of virtual machines, or the energy consumed in the

cloud, heuristic solutions have been proposed [21]. For real-time streaming, video chunks

must be transcoded prior to their playback deadline. Thus, a simple FIFO dispatching algo-

rithm is suitable for our transcoding scheme since it preserves the time order of video chunks.

In other words, the transcoding controller dispatches the first job in the FIFO queue as soon

as a virtual machine becomes available.

86

4.3 Performance Evaluation

In order to evaluate the proposed dependency-aware distributed transcoding scheme, we

implement a prototype of the transcoding system in a private cloud with 10 computing

units. Each computing unit is equipped with 16 Intelr Xeonr E5640 CPU cores at 2.67

GHz. One machine is dedicated to serve as the transcoding controller, and the remaining

machines serve as transcoding servers. We used the most recent release of the reference

software package for scalable video coding, i.e. JSVM 9.19.15 [1]. As shown in Fig. 4.6

we modified the SVC encoder in JSVM by wiretapping a new module into the main video

coding modules of the encoder to generate the macroblock-dependency graph when encoding

a video, as described in Step 1 of generating the GOP-dependency graph in Sec. 4.2.1. The

macroblock-dependency graph is then converted to the GOP-dependency graph, as described

in Step 2 of generating the GOP-dependency graph in Sec. 4.2.1. The conversion can also

be performed in parallel to the encoding process on a different processor, since the encoder

produces consecutive GOPs. The GOP distances are calculated as described in Sec. 4.2.1,

and the results are stored as a list of n−1 distance measures in the video repository, where n

is the number of GOPs in the video sequence. Finally, the transcoding controller clusters the

GOPs into variable-size video chunks using the OPTICS algorithm according to the GOP

distances. The proposed algorithm needs to run only once for each raw video sequence prior

to being encoded and stored on the video repository. Once the distances are calculated and

stored, they can be used to serve any transcoding request received by the cloud transcoding

system.

We used the same set of seven full-HD raw video sequences as input to the transcoding

system as discussed in Sec. 3.1. As reported in Table 3.1, these videos are selected from

different genres. They exhibit diverse values of detail [99] and motion activity [67] visual

features. We encoded each video sequence using the modified SVC encoder with layering

configuration DTQ=(1, 3, 1), i.e., the SVC encoded video contains two dyadic spatial layers

87

GOP Encoder

Frame Encoder

LayerEncoder

Slice Encoder

MB Encoder

Dependency Graph Generator & GOP Distance Calculator SV

C E

ncod

er

RawVideo

NAL Unit Encoder

EncodedVideo Sequence

GOP Distances

VideoRepository

Figure 4.6: The modified JSVM encoder software. Components in gray are modified JSVMcomponents. Components in white are added to JSVM.

(1920× 1088 and 960× 544 pixels), four temporal layers (GOP = 8) and two quality layers

(QP = 36 and QP = 30). The encoded SVC videos are stored in the video repository along

with the respective GOP distances. By default, the transcoding request requests a video

with the same layer configuration but in 720p frame resolution. The visual properties of the

test video sequences are repeated from Table 3.1 in Table 4.1.

Table 4.1: Reference videos and their visual properties.Content Genre Detail Motion Activity

Big Buck Bunny (BB) Animation 3.52 1.63Elephants Dream (ED) Animation 3.73 2.39Pedestrian Area (PA) Scene 3.15 4.42

Rush Hour (RH) Scene 3.17 3.12Park Joy (PJ) Scene/Nature 4.24 3.73Riverbed (RB) Nature 4.72 4.13Sunflower (SF) Nature 4.04 2.57

4.3.1 The overhead of the transcoding scheme

The proposed cloud-based distributed transcoding scheme introduces computational and

storage overhead in different stages of the process. We present the results for two video

sequences with the highest and lowest computational and storage overhead, i.e., BB and

RB. At first, the macroblock-dependency graph Gm is created by capturing macroblock

88

dependency when encoding a raw video. Then Gm is converted to the GOP-distance graph

Gd. This overhead is a one-time overhead since the GOP-distance graph once created can be

used for any transcoding request. As shown in Table 5.3, the highest computational overhead

(the CPU time) is less than 2% of the encoding time for reference video sequences. Next,

the GOP-distance graph is stored in the video repository along with the GOPs to serve any

transcoding request. Since the GOP-distance graph is simply a chain of n nodes representing

a sequence of GOPs, we only need to store the distance measures of the n − 1 edges in

the graph. To store the distance measures with double precision, the storage overhead is

(n− 1)× 8 bytes, which is very small compared to the size of the encoded videos. According

to Table 5.3, this storage overhead is less than 0.04% of the space required to store a reference

encoded video. Finally, a delay is introduced by the OPTICS algorithm when clustering the

GOPs into variable-size video chunks in the transcoding controller. Compared to the time

required by the transcoding controller to retrieve the video from video repository and decode

the video prior to dispatching transcoding jobs to the virtual machines, this overhead was less

than 0.02% for all video sequences. Compared to the computation and storage required by

the encoding and transcoding processes, the overhead introduced by the proposed distributed

transcoding scheme is almost negligible.

Table 4.2: Overhead of the proposed algorithm.BB RB

Computational overhead - preparing Gd 1.64% 0.19%Storage overhead - storing Gd 0.035% 0.003%Delay in transcoding controller 0.016% 0.008%

4.3.2 Bitrate and Transcoding Time

To evaluate the performance of the proposed dependency-aware distributed video transcoding

scheme, we compare the proposed scheme using variable-size video chunks with a conven-

tional video transcoding scheme using fixed-size video chunks. For the conventional video

89

transcoding scheme, we vary the chunk size from 1 GOP to 64 GOPs. We measure the bitrate

(Kbps) and the average transcoding time (second) for each reference video. Furthermore,

we also compare the proposed scheme (results are labeled with keyword ‘Variable’) with a

conventional scheme whose chunk size is the average size (s) of the variable-size video chunks

produced by the proposed scheme (results are labeled with keyword ‘Average’). Since video

chunk size must be a multiple of GOPs, we set the chunk size to bs ∗ (i+ 1)c−bs ∗ ic so that

the overall average is still s and no GOP is broken into two chunks. Due to page limit, we

only represent the results for fixed video chunk sizes of 1, 8 and 64 GOPs here.

Average video chunk size and bitrate: According to Table 4.3, the average size of

video chunk prepared by the proposed scheme varies significantly from one video to the

next. This implies that the proposed scheme chooses different video chunk sizes according

to the video context. For example, for the BB video sequence, with great visual similarity

among consecutive GOPs, the proposed scheme produces video chunks enclosing more GOPs

(19.7 GOPs on average). In contrast, for RB video sequence with more details and changing

scenery, the average chunk size is 1.6 GOPs.

Table 4.3: Comparing bitrate and average chunk sizeBB ED PA PJ RB RH SF

Average chunk size (GOPs) using the proposed scheme

19.7 11.9 5.1 2.8 1.6 5.4 8.3Video bitrates (Kbps) using the proposed scheme

Variable 564 1168 1366 8489 6865 1081 776Average 599 1207 1443 9041 6881 1135 857

Video bitrate (Kbps) using fixed-size video chunks

1 1720 1801 1824 9977 6883 1346 18088 678 1222 1394 8588 6857 1104 86264 546 1156 1339 8439 6854 1075 747

The proposed scheme effectively reduces the video bitrate. From Table 4.3, we observe

that the bitrate of the proposed scheme closely approximates the bitrate of the conventional

scheme with chunk size of 64 GOPs (the best-case scenario). We also observe that the

bitrate of the proposed scheme is always less than the bitrate of the conventional scheme

90

with the average chunk size (e.g., 10.8% reduction in bitrate for the SF video). Hence, in

order to match the bitrate produced by the proposed scheme, the conventional scheme must

work with chunks much larger than the average chunk size found in the proposed scheme.

Furthermore, our analysis on the quality of the transcoded videos (YPSNR in dB) indicates

that the proposed scheme not only maintains a good video quality, but also outperforms all

fixed size video chunks for videos with high detail and motion activity, such as RB.

Transcoding time: Table 4.4 compares the transcoding time needed by the proposed

scheme and the conventional scheme with different chunk sizes. For all reference videos, the

transcoding times needed by the proposed scheme are always between the time needed by the

conventional scheme with chunk size of 1 and 8. The proposed scheme also transcodes much

faster than the case with average chunk size (e.g., 24.4% faster for BB video sequence). This

confirms that the proposed transcoding scheme provides high coding efficiency with reduced

transcoding time. For example, for video BB, setting the video chunk size to 1 GOP leads

to 1720 Kbps video bitrate and 5.55 second video transcosing time. If the video chunk size is

set to 64 GOPs, the video bitrate decreases to 546 Kbps but the transcoding time increases

by 53%. Nevertheless, using the proposed adaptive scheme leads to a 564 Kbps video bitrate,

which is very close to that of 1 GOP video chunks, while the transcoding time is increased

only by 21% compared to 53% for video chunks of 64 GOPs.

Table 4.4: Comparing transcoding timeBB ED PA PJ RB RH SF

Transcoding time (s) using the proposed scheme

Variable 6.75 7.80 7.80 8.73 10.24 7.00 7.62Average 8.39 8.86 9.21 9.39 11.46 8.31 8.78Transcoding time (s) using fixed size video chunks

1 5.55 5.84 6.42 6.98 9.82 5.87 5.868 8.20 8.82 9.63 10.53 14.46 8.85 8.7964 8.47 9.37 9.93 11.00 14.85 9.09 8.98

91

4.4 Summary

In this chapter, we proposed a novel distributed video transcoding scheme. We note that

the existing proposals for cloud-assisted video transcoding, as listed in Sec. 2.2.2, treat the

encoded video no different from a raw video. A fixed number of consecutive frames or group

of pictures are grouped into a video chunk, the chunks are transcoded in parallel, and the

transcoded video chunks are then merged into a single video sequence to be delivered to

end users. On the contrary, in this research we proposed a dependency-aware distributed

video transcoding scheme that uses variable-size video chunks. The proposed model takes

advantage of visual similarity among macroblocks and GOPs in a video sequence.

As pioneering work in this research direction, we proposed an algorithm to extract the

dependency among macroblocks in an encoded video, based on which we determined the de-

pendency between successive GOPs. GOPs are then clustered according to their dependency

to create variable-size video chunks so that visually similar GOPs are put in one chunk. Our

experiments show that the proposed scheme improves resource efficiency by decreasing the

video bitrate and the computational resource consumption on the cloud. It also increases

the visual quality of the transcoded video for more complex video sequences. While the

proposed model was evaluated using layered video coding standard SVC, the same principle

applies to other video coding standards such as H.264/AVC or H.265/HEVC.

The suggested prediction dependency model between GOPs in an encoded video sequence

is a useful concept that can also be used in other multimedia streaming problems. In the

following chapter, this concept is re-engineered and a dependency model that considers the

fine grain dependencies between macroblocks and submacroblocks is presented. Furthermore,

a novel video packet importance model is built upon the proposed dependency model, and

the result is applied to the unequal error protection problem. This problem is part of the

service delivery phase in the life cycle of a video streaming episode, as illustrated in Fig. 1.1.

We discuss this further in the following chapter.

92

Chapter 5

Video Transmission in Wireless Networks

In Chapter 1, the life cycle of a video transmission session is illustrated. Dismissing the

details, such a life cycle starts from the raw video capture by the IP camera and concludes

with the video playback at the end user device. After the video is prepared by the video

streaming server, usually it must traverse through multiple transmission media to reach the

end user device. The current infrastructure of large IP networks such as the Internet or

the legacy IPTV networks, increasingly use high capacity cable connections for the entire

backhaul portion of the network. Nowadays, fiber-optic cables are used as the transmission

medium for the intermediate links between the core network and the small subnetworks at

the edge of the network. Furthermore, due to the increasing usage of FTTx architecture

in IP networks, the fiber-optic cables are also used as the transmission medium for the last

mile telecommunications. The most important benefits of such cable connections are the

high data transmission rate and the very low amount of physical noise leading to negligible

packet loss rate (unless the network is partially or fully congested). However, the lack of

support for mobility makes cable connections ineffectual for mobile communications.

For mobile communication, including mobile video streaming, the last mile of the telecom-

munication network uses wireless communication channels. The wireless connection can be

maintained using a mobile telecommunication technology (such as 3G or 4G networks) or

offloaded to a wireless LAN (such as WiFi networks)1. In contrast to the cable transmis-

sion mediums, wireless channels are intrinsically prone to interference and noise, resulting

in fluctuation in channel capacity. These features of wireless channels impose substantial

challenges for video streaming.

1Depending on the architecture, a wireless telecommunication network may use wireless data connectionsin part of the backhaul portion of the network besides the last mile.

93

In this chapter, we first look deeper into the coding and prediction mechanism of state-

of-the-art layered video coding standard, i.e., SVC. Next, toward smarter protection of video

packets over noisy communication channels and better quality of the transmitted video, we

exploit the prediction structure to propose a novel coding and dependency-aware unequal

error protection algorithm. The proposed algorithm calculates the importance of different

video packets and associates protection to each video packet, respectively. Experimental

results show that the proposed algorithm outperforms the state-of-the-art unequal error pro-

tection algorithms in terms of the quality of the transmitted video. Finally, we complete

the proposed UEP model by extending the UEP problem from unicast scenario to multi-

cast scenario, in which the full potential of layered video coding is utilized by allowing the

transmission network to multicast one copy of the layered video for groups of heterogeneous

mobile devices. We specifically propose a new technique to dynamically adjust and combine

the protection FEC packets for reference and dependent video layers for video multicast in

mobile communication networks.

5.1 Coding and Prediction in SVC

Before proposing the coding and dependency-aware UEP for layered video coding, we need

to have a deep understanding of layered video coding by inspecting a real layered video codec

standard. Moreover, while the concept of coding and dependency-aware UEP remains the

same, as will be shown in Sec. 5.2.1, modeling coding dependency and calculating importance

of video packets must be tailored for individual codec standards. In this section, we use SVC

[112] – the state-of-the-art layered video coding standard. The general design of SVC was

covered in Chapter 3. Therefore, we continue with the coding and prediction in SVC.

Fig. 5.1 schematically illustrates the prediction in SVC in a tree structure. In SVC,

enhancement macroblocks are created based on either temporal or spatial prediction. The

quality enhancement information may be included if it is available. Temporal prediction is

94

inherited from the AVC standard, in which the type of the predicted picture determines the

possible prediction methods. For example, if the predicted picture is an I-frame, then the

slices in this picture are I-slices and the macroblocks within each slice will be predicted from

other macroblocks of the same slice. If the predicted picture is a P-frame, the slices in this

picture might be I- or P-slices, at the discretion of the encoder. Similarly if the predicted

picture is a B-frame, the slices might be I-, P- or B-slices, again at the discretion of the

encoder. When encoding P- and B-slices, the intra-picture prediction is augmented with

unidirectional and bidirectional inter-picture motion compensated predictions. Moreover, the

encoder can also apply special predictions (modules enclosed in boxes with dashed borders

in Fig. 5.1) in the order presented in Fig. 5.1 to derive motion vectors and reference picture

lists from the reference macroblocks.

EnhancementMacroblock

Temporal Prediction Spatial Prediction Quality Prediction

Intra-picture prediction

Intra-picture prediction

Inter-picturemotion-compensated

prediction

Inter-picture motion vector prediction

Inter-layer intra-prediction

Inter-layerintra-prediction

Residual texture prediction

I-slice P-, B-slice I-slice P-, B-slice

+

+

Inter-picture reference index

prediction

+

Inter-layermotion-compensated

prediction

Inter-layer motion vector prediction

+

+

Inter-layer reference index prediction

+

Inter-layer residual signal prediction

+

Figure 5.1: Prediction tree of the scalable video coding standard. The blocks with dashedlines may or may not exist at the discretion of the encoder.

95

The general prediction modules available for spatial layering are similar to those for

temporal layering, as illustrated in Fig. 5.1. Inter-layer prediction is used to predict the

higher resolution spatial layer from the lower resolution layers. If a slice is an I-slice, inter-

layer intra-prediction is used to create enhancement macroblocks. If the slice is a P- or

B-slice, inter-layer motion-compensated prediction can be used too. SVC provides three

additional inter-layer prediction modules to take advantage of similarities among the spatial

layers, namely, motion vector prediction, reference index prediction, and residual signal

prediction.

As shown in Fig. 5.1, in a dyadic setting of spatial layering, each 16 × 16 macroblock

of a dependent layer can be estimated from an 8 × 8 co-located submacroblock from the

reference layer using upsampling operations. If the co-located submacroblock in the refer-

ence layer is part of an intra-coded macroblock, the enhancement macroblock is obtained by

inter-layer intra-prediction. If the co-located submacroblock is part of an inter-picture coded

macroblock, then inter-layer motion-prediction can be used. In this case, the enhancement

layer contains only a residual signal, and the enhancement macroblock is constructed by

upsampling the reference submacroblock and scaling its motion vector components by 2.

Furthermore, the upsampled macroblock in a dependent layer may extract the reference pic-

ture list and motion vectors from the reference submacroblock. Finally, as shown in Fig. 5.1,

the inter-layer residual prediction allows the enhancement block to use the upsampled resid-

ual signal of the reference macroblock as a predictor for its own residual signal. Thus, only

the difference between the predicted and real residual signal of enhancement macroblocks

need to be coded in the enhancement layer.

While spatial prediction can use a downsampled version of the same macroblock from

a lower spatial layer as a reference, this may not always be the best prediction strategy.

As a matter of fact, the inter-layer predictor usually has to compete with the temporal

predictor. Especially for sequences with slow motion and high spatial detail, the temporal

96

prediction signal likely represents a better approximation of the original signal than the

inter-layer predictor does through upsampling a reference macroblock. Hence, the intra-

layer dependency imposed by the temporal prediction is more important than the inter-layer

dependency generated by the inter-layer predictor. This particular observation is the key

motivation for considering macroblock level coding dependency instead of layer dependency

when applying UEP to video packets.

8

8 16

16

x2

Reference spatial layer

Dependent spatial layer

(i,j)

(2i,2j)

(2i+1,2j)

(2i,2j+1)

(2i+1,2j+1)

x2

x2

x2

x2

Figure 5.2: Spatial prediction with dyadic settings in SVC. Each 16×16 rectangle representsa single macroblock.

For quality prediction, SVC quantizes the residual texture signal of enhancement layers

with a relatively smaller quantization parameter (QP) than the one used in the lower quality

layers so that more detail is retained in the enhancement layer. SVC supports three quality

scalable coding modes: CGS (coarse-gain scalable), MGS (medium-grain scalable) and FGS

(fine-grain scalable). Since CGS does not provide a proper bitrate adaptability and FGS is

computationally expensive, MGS is the better choice for most scenarios. MGS allows the

transform coefficients of the enhancement layer to be divided into multiple subsets, thus

allowing the encoder to create multiple quality enhancement layers from the residual signal

and allowing the decoder to partially receive the quality information by dropping some

packets.

The presented brief review of coding and prediction in SVC reveals that strong depen-

97

dency exists among pictures within a layer and across layers. We argue that ignoring the

internal design of video codec standards leads to less effective UEP for three reasons. First,

temporal predictor and spatial predictor modules of SVC compete for the prediction of mac-

roblocks. Hence, not all macroblocks of a frame in the higher spatial layer depend on the

co-located submacroblocks in the reference spatial layer. The same dependency property

also holds across temporal layers. Second, while P- and B- frames are allowed to use past

and future pictures (in playback order) as references, the final coding decision is at the

discretion of the encoder. For example, to encode a video with very high motion activity

among consecutive frames, the encoder may use I- slices inside of a B-frame since the visual

information from past and future pictures are not as useful to predict the current picture.

Finally, when using inter-picture and inter-layer motion compensated prediction in temporal

and spatial prediction, the encoder may decide to use extra information (such as motion

vectors, reference picture lists and the residual signal) from the reference macroblock or sub-

macroblock. Though these observations are drawn from inspecting the SVC codec standard,

similar observations can be drawn for other codec standards.

The deep inspection of coding and prediction structure of SVC indicates that the large

scale dependency among video layers is not sufficient to determine the importance of video

packets for UEP. We will examine the correctness of this statement through experiments in

Sec. 5.2.2. By considering dependency at the submacroblock/macroblock level (the finest

processing unit of the H.264 video coding standard family), we propose a more effective UEP

model that provides better protection for more important video packets.

5.2 Coding-Aware UEP for Layered Video Streaming

As discussed in Chapter 3, video encoders exploit the redundancy of visual information in

time and scale domains. For example, when the camera pans slowly through the scene or

the scenery is stationary, the captured video sequence can be highly compressed without

98

noticeable loss of quality due to high visual similarity of consecutive frames. Encoded video

frames can be divided into reference and dependent pictures, where dependent pictures are

reconstructed using the reference pictures. However, in modern video coding standards, a

picture may be a reference picture for some pictures and may also be a dependent picture

depending on some other pictures at the same time. Furthermore, in a layered video coding

standard reference pictures and dependent pictures can be organized into reference layer(s)

and dependent layer(s), respectively. Video layers are separate substreams inside a video

bitstream for independent transmission. Similar to that of dependent pictures, the video

bitrate is reduced by storing the residual error of the predicted video signal in dependent

layers.

Dependent layers usually provide higher quality but rely on their respective reference

layers for successful reconstruction of the transmitted video frames. Since dependent layers

consist of pictures depending on those of the reference layers, any noise (lost or corrupted

video packets) in a reference layer may hinder the decodability of one ore more dependent

layers, even if the dependent layers were received correctly. In this case, the resources

consumed to transmit and decode the dependent layers are wasted. Such a decodability

dependency justifies the need for using stronger protection mechanism for reference layers,

i.e., unequal error protection (UEP). As discussed in Sec. 2.2, numerous UEP methods are

proposed to provide different level of protection according to the importance of the reference

and dependent video layers [14, 28, 53, 86, 134]. However, these proposals do not consider

the internal design of the video coding standards. In fact, they confront the problem as

a general unequal protection problem between reference and dependent layers where, for

example, data partitions in H.264/AVC [124] or multiple descriptions in MDC [23] can be

substituted by the concept of layers in scalable video coding (SVC) [112].

In fact, the importance of a piece of video content is determined by not only the lay-

ering structure, but also visual features and encoding decisions [162]. To accurately model

99

the importance of visual information in a video sequence, we look deeper into the coding

and prediction structure of the state-of-the-art layered video coding standard, i.e., Scalable

Video Coding (SVC), and model the dependency among macroblocks and submacroblocks

using a weighted dependency graph. Based on the fine granular dependency presented in

the proposed dependency graph, we propose a dependency-aware UEP model that protects

macroblocks and submacroblocks according to their importance. The experimental results

show that the proposed UEP model significantly outperforms the conventional layer-weighted

UEP models.

5.2.1 Coding and Dependency-Aware Unequal Error Protection

In this section, we present the design of the proposed coding and dependency-aware UEP

algorithm for layered video coding. The design is based on a precise model of the dependency

structure in the to-be-protected encoded video sequence. First, we describe how the coding

dependency is modeled as a weighted acyclic dependency graph. Next, we illustrate how to

calculate the importance of macroblocks and the video packets containing them, and how

to use the calculated importance to provide appropriate protection for each video packet.

Finally, we present a practical implementation of the coding and dependency-aware UEP.

Please note that although the proposed coding and dependency-aware UEP model is ex-

plained in the context of SVC, it can be applied to other single-layer or layered video coding

standards with minor changes in the calculation of the dependency weight.

Modeling Coding Dependencies

To model coding dependency in SVC, we introduce a dependency graph G = (V,A), which

is a weighted directed acyclic graph (DAG). Each node mi ∈ V represents a macroblock and

each directed arc ai,j ∈ A from mi (dependent macroblock) to mj (reference macroblock)

indicates that mi depends on mj2. Indirect dependency exists between two macroblocks mi

2 Note that this notation differs from that used in Fig. 2.6, where the dependency arcs are from referenceframes to dependent ones.

100

and mk if there is a path from mi to mk in graph G.

Generating the dependency graph: To generate the dependency graph G, we extract coding

dependencies among macroblocks while encoding a video sequence. More specifically, when

encoding a new macroblock using SVC encoder, we add a new node to the dependency

graph. For each prediction decision made by the encoder, we add an arc to the graph from

the dependent macroblock to the reference macroblock. Fig. 5.3 shows a sample dependency

graph with 6 nodes (macroblocks). Due to different prediction mechanisms employed in SVC,

the effect of losing a reference macroblock on the dependent macroblocks may be severe or

negligible. Therefore, we need to assign a proper weight to each dependency arc to indicate

the level of dependency between two macroblocks. The resulting graph is a DAG, i.e., no

cycles exist in G, since no two macroblocks can directly or indirectly depend on each other.

1

45

2

6

3

Figure 5.3: An example of a dependency graph with 6 nodes, where m1 serves as an absolutereference macroblock and m6 is not used by any other macroblock.

Setting weights of dependency arcs: The SVC encoder is equipped with a rate distortion

module that calculates the rate distortion cost of each permissible prediction mode as follows:

Ci,j = Di,j + 0.85× 2(QP−12)/3 ×Ri,j (5.1)

where i is the index of reference macroblock mi, j is the index of dependent macroblock mj,

Di,j denotes the sum of absolute difference between pixels in the reference and dependent

101

macroblocks, QP is the quantization parameter, and Ri,j is the actual number of bits required

to represent the residual signal of mj. The prediction cost Ci,j is inversely proportional to

the prediction dependency, i.e., lower prediction cost implies stronger dependency between

two macroblocks.

Due to the delicate internal structure of SVC, sometimes the proper weight of a de-

pendency arc in G cannot be calculated as easy as in Eq. 5.2. Both temporal and spatial

prediction modules may use only a portion of a reference macroblock to predict a dependent

macroblock. Moreover, the motion compensated prediction is not limited to the boundaries

of the reference macroblocks. The dependency among macroblocks may be categorized into

four cases, as illustrated in Fig. 5.4. The weight of each dependency relation in each case

may be calculated as follows:

Reference macroblock(s)

Dependentmacroblock

(a)

(b)

Reference macroblock(s)

Dependent macroblock

(c)

(d)

Figure 5.4: Different types of dependencies among macroblocks in SVC. (a) Using a full mac-roblock as a reference, (b) Using a macroblock created from portions of 2 or 4 macroblocksas a reference, (c) Using a submacroblock as a reference (after proper upsampling), and (d)Using multiple macroblocks as reference.

• Using a full macroblock as a reference (Fig. 5.4(a)): The simplest prediction form

is using a macroblock to predict another macroblock. In this case, we calculate the

weight of each dependency relation as the inverse of the prediction cost, i.e.:

wi,j =1

Ci,j(5.2)

102

• Using a macroblock created from portions of 2 or 4 macroblocks as a reference

(Fig. 5.4(b)): The prediction modules may use a 16 × 16 area located on the bor-

ders of two or four macroblocks as a reference macroblock. In this case, we add a

dependency arc from the dependent macroblock to each of the macroblocks serving

as a partial reference. The weight of each dependency arc is calculated as follows:

wi,j =1

Ci,j× Si,j

256(5.3)

where Si,j is the number of pixels of the reference macroblock that belongs to mi.

Compared to Eqn. 5.2, the weight of each arc representing a partial dependency is

scaled according to the amount of visual information in the dependent macroblock

predicted from the partial reference.

• Using a submacroblock as reference (after proper upsampling) (Fig. 5.4(c)): A sub-

macroblock may be upsampled to serve as a whole reference macroblock. If the

reference submacroblock belongs to a macroblock, there is only one arc from the

dependent macroblock to the macroblock containing the reference submacroblock,

as illustrated in Fig. 5.4(c). In this case, the weight of the arc is calculated the same

way as in the case that a full macroblock is used as a reference. Thus, the weight of

the arc is calculated as in Eqn. 5.2. If the reference submacroblock is located across

boundaries of 2 or 4 macroblocks, similar to the case illustrated in Fig. 5.4(b), we

add a dependency arc from the dependent macroblock to each of the macroblocks

serving as a partial reference. The weight of each dependency arc is calculated as

in Eqn. 5.3, except that the constant in the denominator is 64 (representing the

smaller size of 8× 8 co-located submacroblock).

• Using multiple macroblocks as reference (Fig. 5.4(d)): A dependent macroblock

may use multiple reference macroblocks and combine the result by, for example,

103

taking an average over the predicted samples. In this case, the importance of each

reference macroblock depends on the successful receipt and decoding of the other

reference macroblocks. The quality of the reconstructed macroblock improves as

more reference macroblocks become available. In this case, we add one dependency

arc from the dependent macroblock to each of the reference macroblocks. The weight

of each dependency arc is calculated as follows:

wi,j =1

Ci,j× 1

N(5.4)

where N is the number of reference macroblocks. Since the availability of each ref-

erence macroblock is not known before transmitting the video packets, all reference

blocks are equally important. Thus, each arc has an equal share of the full weight.

Since Ci,j is explicitly calculated as part of the encoding procedure, the calculation of

wi,j is very quick in all four scenarios.

Dependency-aware Unequal Error Protection

Similar to AVC, in SVC, macroblocks of each video frame are grouped into one or more slices,

where each slice is encapsulated into a variable code length (VCL) network abstraction lasyer

(NAL) unit and is sent over the network as a video packet. In this section, we propose a

new dependency-aware UEP model. The proposed UEP model determines the importance

of video packets according to the dependency captured in the dependency graph G. The

algorithm involves three steps. First, we calculate the importance of macroblocks according

to the dependency graph G. Next, we calculate the importance of video packets based on

the importance of the encapsulated macroblocks. Finally, we apply UEP to video packets

according to their importance. The details of each step are provided below.

Calculating the importance of macroblocks: To calculate the importance of macroblocks, we

create a topological sort T , in which mi is listed before mj if mi depends on mj (i.e., there

104

is an arc from mi to mj in G). Since G is a DAG, we can create T in linear time with

complexity of O(|V | + |A|), using algorithms like Kahn’s topological sort algorithm [71].

Visiting nodes in T is the same as traversing G from macroblocks that are not used as a

reference for any other macroblock (nodes with output degree zero) to macroblocks that do

not use any reference macroblock (nodes with input degree zero).

We note that determining the importance of macroblocks based on the dependency graph

resemble the calculation of page scores in information retrieval, where each page in the web

is represented by a node and each link in the page is represented by an edge in the graph

of web pages. For this reason, we adapt a variation of the Page Rank [27] algorithm to

determine the importance of all macroblocks.

21

4

6

7

9

3

5

10

8

0.65 1

0.28

0.45 0.25

0.130.13

0.20.1

0.14

21

4

6

7

9

3

5

10

8

0.4

0.4

0.7

0.7

0.7

0.7

0.4

0.3

0.3

0.3

0.1

0.1

0.1

0.1

0.3

(a) (b)

Figure 5.5: An example of a 10-node weighted dependency graph G. Nodes represent mac-roblocks and arcs represent the dependencies. (a) Before propagating the weights. (b) Afterpropagating the weights by traversing the nodes in topological order and updating the weightof reference nodes according to Eq. 5.5.

Given the dependency graph G = {V,A} and the topology sorting T , we set the initial

weights of all nodes to wmi= 1/|V |. For the example illustrated in Fig. 5.5, the initial weight

of all 10 nodes is 0.1. We visit nodes in the dependency graph G according to their order in

105

T . For the 10-node sample dependency graph G in Fig. 5.5, one possible topological order is

{9, 10, 7, 6, 5, 8, 3, 4, 1, 2}. The topological sorting T guarantees that a reference macroblock

is visited after all of its dependent macroblocks. This allows us to calculate the importance

of a macroblock reflecting all direct and indirect dependencies associated with it. Hence,

the computation starts from the edge of the graph (the macroblocks that no macroblock

depends on them) and moves towards the core of the graph (the macroblocks that do not

depend on any macroblock). When visiting a node mi ∈ V , if there exists an arc ai,j ∈ A,

we update the importance wmjof the macroblock j as follows:

wmj= wmj

+∑i

wmi× wi,j∑k wi,k

(5.5)

The importance of a macroblock j is the initial weight of macroblock j plus the weighted

average of the importance of all of its dependent macroblocks, where the weight of the

importance of each dependent macroblock i is a normalized weight of arc ai,j among all

outgoing arcs of node i. According to this algorithm, the weight of node 9 in Fig. 5.5 is 0.1

since it has input degree zero. The weight of node 10 is 0.1 + (0.4 ∗ 0.1/1.1) = 0.136 ' 0.14.

The computational complexity of this algorithm is O(|A|). Once all nodes are visited, the

importance values of all macroblocks are calculated.

Calculating the importance of video packets: The SVC encoder puts each video slice in one

or sometimes more video packets. Hence, the importance of video packets is the same as

the importance of the contained video slice. Since a slice si is a collection of macroblocks

(si = {mj}), the importance of si is determined by the importance of all macroblocks that

form si. The average importance of contained macroblocks is a trivial candidate for cal-

culating the importance of a video slice. However, due to different prediction types, the

importance measure of video slices obtained from averaging importance of contained mac-

roblocks sparsely distribute in a large range. This tricks the UEP model to apply substantial

different levels of protection to slices. Therefore, we moderated the importance measure by

106

applying the natural logarithm operator to the average importance of the contained mac-

roblocks as follows:

wsi = ln(1 +

∑j wmj

nsi) (5.6)

where mj ∈ si, and nsi is the number of macroblocks contained in si. Given the importance

of slice si, the importance of the container video packet pi is wpi = wsi .

There are two other types of video packets that do not contain slice sample data. A

video packet may contain important header data for a series of consecutive coded pictures

(a coded video sequence) or one or more individual pictures within the series. In this case,

the importance of these video packets is the maximum weight of depending slices. A video

packet may also contain supplemental enhancement information that can be ignored by the

decoder. In this case, the importance of such video packets is set to the minimum weight of

all slices.

Unequal protection of video packets: The amount of protection bytes (Npi) associated to each

video packet pi should be proportional to the importance of the packet and packet’s length.

Thus, we calculate Npi as follows:

Npi =wpi × lpi∑j wpj × lpj

×N (5.7)

where N is the total number of available protection bytes, and wpi and lpi are the importance

and length of packet pi. The iterator j in the denominator iterates over all the video packets,

including pi.

Practical Implementation of the Proposed UEP Model

Although building the dependency graph and calculating importance measure takes O(|V |+

|A|), the computation time may still be considerably long since both |V | (number of mac-

roblocks in a video) and |A| (number of dependency links) can be very large. Moreover, the

107

process requires a large amount of memory space to store the dependency graph and to carry

out the calculations. In this section, we address these implementation challenges.

As illustrated in Fig. 5.6, we note that due to the GOP structure of H.264 standard

family, the proposed dependency graph exhibits a special form of clustering. There is a

dependency path among the key pictures in a video sequence, and all other dependency arcs

are confined between two consecutive key pictures. This clustering property inspires us to

implement the algorithm in a divide-and-conquer manner. We may consider dependency

among macroblocks from key pictures and dependency of macroblocks within each GOP as

a smaller instance of the dependency-aware UEP model. We may solve each instance quickly

consuming reasonable amount of memory.

Key picture(I-frame)

Key picture(P-frame)

Key picture(P-frame)

Figure 5.6: The prediction dependencies in a SVC video sequence with two spatial and threetemporal layers. Dependency links between key pictures are shown in black. The grey linksrepresent dependency among pictures between two consecutive key pictures.

Assume that we have the computing power to launch m computing instances of the pro-

posed algorithm in parallel with the video encoder. We dedicate one computing instance to

calculate importance of macroblocks from key pictures and the importance of corresponding

video packets. For clarity, we refer to macroblocks from key pictures as key macroblocks from

here on. We refer to this computing instance as the main instance since it will run until the

importance of all video packets are calculated. The calculation of importance of a key mac-

roblock requires the importance of its dependent macroblocks – key macroblocks in future

108

GOPs and macroblocks within the same GOP. Recursively, the calculation of the importance

of a key macroblock in a future GOP also requires the importance of key macroblocks in

its future GOPs and macroblocks within the same GOP. The recursive relation still requires

the main instance to consider all macroblocks in a video sequence. Thus, this instance re-

quires large computing and memory resources. We note that the indirect dependencies decay

rapidly when the playback times of two key pictures are further apart. Therefore, when cal-

culating importance of macroblocks, the main instance can consider key macroblocks from n

future GOPs and ignore the subsequent ones. We refer to n as the key picture decay factor.

The value of n may be tuned according to visual features presented in the video sequence.

For example, when running the algorithm for high motion video sequences, n can be as small

as 4, because the visual similarity between consecutive video frames decays very fast.

When the video encoder proceeds with encoding the GOPs from the beginning of the

video, in parallel to the main instance of the proposed algorithm, each of the other m − 1

instances determines the importance of macroblocks within one GOP and the importance of

corresponding video packets. As soon as a computing instance has the importance of mac-

roblocks within one GOP, it passes the results to the main instance so that the importance

of key macroblocks of the GOP may be updated. Once the importance of key macroblocks

of GOPs 1 . . . n is updated, the main instance runs the algorithm on the key macroblocks of

GOPs 1 . . . n and finalizes the importance of key macroblocks and respective video packets

of GOP 1. GOP 1 is then removed from the graph. If n ≥ m − 1, the instance responsible

for GOP 1 is re-assigned to the next unprocessed GOP. The same process is repeated for

GOPs 2 . . . n + 1 and so on until all GOPs are removed from the graph. In this way, the

main instance always keeps track of macroblocks of at most n+1 key pictures, and the other

m− 1 instances always keep track of macroblocks of GOP size plus one pictures.

If n < m, when the importance of key macroblocks of GOP n is updated, the main

instance runs the algorithm on the key macroblocks of GOPs 1 . . . n and finalizes the impor-

109

tance of key macroblocks and respective video packets of GOP 1. Then it removes GOP 1

from the graph, frees the instance responsible for macroblocks of GOP 1, and associates the

next available GOP to this instance. In this way, the main instance always keeps track of

macroblocks of at most n + 1 key pictures and other m − 1 instances always keep track of

macroblocks of GOP size plus one pictures. When all GOPs are processed, the main instance

will also terminate.

Since the key picture decay factor truncates the continuity of the dependency in the

video sequence, the result is sub-optimal. In other words, we trade optimality for less com-

putational complexity and memory usage. For example, if we assume that each macroblock

has two outgoing links on average, the proposed algorithm needs 16 GB of memory to store

the dependency graph G for two hours of 24 fps HD video sequence. In comparison, the

divide-and-conquer implementation requires less than 7 MB of memory when the GOP size

and the parameter n are both equal to 8.

5.2.2 Performance Evaluation

To evaluate the performance of the proposed dependency-aware UEP model, we conduct a

series of experiments using JSVM 9.19.15 software package, which is the reference software

package for scalable video coding. As mentioned in Chapter 3, JSVM provides encoding,

decoding, and video analysis tools. As illustrated in Fig. 5.7, we added two modules (the

dependency logger module and the dependency graph module) to the SVC encoder provided

by JSVM. The dependency logger module is tapped into the main video coding modules of the

encoder, including GOP encoder, frame encoder, layer representation encoder, slice encoder,

and macroblock encoder. This module extracts all dependency information by monitoring

prediction and coding decisions during the encoding process. The dependency graph module

uses the macroblocks dependency information compiled by the dependency logger module to

generate the dependency graph as described in Sec. 5.2.1. This module also calculates the

importance of video packets using the algorithm we presented in the last section. We also

110

implement the algorithm according to the practical suggestions illustrated in the previous

section for less computation overhead and memory demand. In these experiments, the key

picture decay factor is manually tuned.

GOP Encoder

Frame Encoder

Layer Repres.Encoder

Slice Encoder

MB Encoder

Dependency Logger Module

SVC Encoder

Dependency Graph Module

Unequal Error

ProtectionModule

RawVideo

SequenceNAL Unit

Encoder

UEP-enabled Video Transmission Simulator

UnprotectedVideo

Packets

UDP Packetization

Module

Noisy Channel Simulator

ProtectedVideo

Packets

VideoPacket

Extractor

SVCDecoder

ReconstructedVideo

Sequence

MacroblockDependencies

Slice Weights

Figure 5.7: The architecture of the performance evaluation system. Components in gray aremodified JSVM components and those in white are developed from scratch.

We also developed a UEP-enabled video transmission simulator, as illustrated in Fig. 5.7.

The simulator consists of four modules: UEP module, UDP packetization module, noisy

channel simulator, and video packet extractor. The simulator simulates the transmission

of unequally error protected video packets over a noisy network channel. The encoded

video packets with respective importance factors are sent to the UEP module. This module

associates the proper amount of protection to each video packet using systematic Reed-

Solomon codes. Since this paper focuses on determining the importance of a video packet

rather than redesigning the coding scheme, an optimal coding scheme (RS) is used with

recovery efficiency of 1. The coding scheme can be replaced with any coding scheme like

LDPC or Raptor codes. The protected video packets are then sent to the UDP packetization

111

module. This module disperses each video packet in 512-byte UDP datagrams at a rate such

that the delay to the decoder is no more than 400 milliseconds [64]. The simulation can

be done in TCP as well. With TCP, although there is no end-to-end packet loss, a video

packet that arrives after its playback deadline (due to TCP flow or congestion control)

is considered lost by the decoder. Simulating the noisy channel in UDP gives us better

control on the packet loss rate. Next, UDP datagrams are virtually transmitted over a

packet erasure channel, where some packets are randomly dropped according to the channel’s

packet loss rate. We consider the error model an independent and identically distributed

random variable. On the receiver side, the video packets are extracted from the received

UDP datagrams, and the video stream is reconstructed using a SVC decoder. The decoder

is modified to calculate and report the peak signal-to-noise ratio of the reconstructed video.

The seven raw video sequences, as listed in Table 5.1, are selected from different genres.

They also exhibit diverse values of the detail and motion activity visual features [67]. As

described in Chapter 2, the detail feature provides a summary of the histogram descriptors in

the video sequence, and the motion activity feature primarily captures the degree or intensity

of scene changes. The reference raw video sequences are in YUV 4:2:0 format with frame

size of 1280×720 pixels and frame rate of 24 fps [3], the standard sampling scheme for H.26x

video coding standards. Since it takes hours to analyze a complete video and apply UEP,

we select 10 seconds (240 frames) of each video in order to conduct all experiments in a

reasonable amount of time. For video sequence that have movie titles at the beginning, i.e.,

BB and ED, the first 1000 frames are skipped. Otherwise, the frames are selected from the

beginning.

The modified JSVM package and the transmission simulator enabled us to conduct a

series of experiments to determine the effectiveness of the dependency-aware UEP model for

the seven reference videos. All experiments are conducted on a server cluster of 10 nodes.

Each server node is equipped with four Intelr Xeonr E5640 CPUs with four cores at 2.67

112

Table 5.1: Reference video sequences and their properties.Content Genre Detail Motion Activity

Big Buck Bunny (BB) Animation 3.52 1.63Elephants Dream (ED) Animation 3.73 2.39Pedestrian Area (PA) Scene 3.15 4.42

Rush Hour (RH) Scene 3.17 3.12Park Joy (PJ) Scene/Nature 4.24 3.73Riverbed (RB) Nature 4.72 4.13Sunflower (SF) Nature 4.04 2.57

GHz and 16 GB of 1066 MHz memory. To avoid the impact of multi-core operations on core

performance and cache hit, we utilize only one core on each CPU and at most three CPUs

on each machine.

Revisiting our claim: Why consider macroblock dependencies?

At the beginning of this section, we claimed that the large scale dependencies between

different layers of a SVC video do not reflect the relative importance of each video packet

properly. To verify this claim, we encoded all reference video sequences with a fixed layering

configuration of DTQ = (2,4,0)3. Then we measured the amount of different dependency

types in each encoded video sequence. If the layer dependencies are sufficient to create an

exact enough portrait of the dependency relations between video packets, as assumed by

previous works, we can expect to see a similar trend for different videos and the differences

must be insignificant.

The results are reported in Table 5.2. Even though the layered structure of the video

sequences is exactly the same, there is a significant difference between the type of dependen-

cies that exist in encoded videos. For example, for video sequences with fewer details, like

BB, in most situations the encoder selects spatial prediction over the temporal prediction.

However, for more complicated video sequences, like PJ and RB, the situation is reversed.

Furthermore, the behavior of temporal and spatial predictors is significantly different. In

3 We did not use quality layers since each quality layer creates a fixed number of quality dependenciesbetween dependent and reference layers.

113

BB, 80.7% of the predictions are based on motion prediction, which suggests that the en-

coder was able to find similar reference macroblocks in past or future frames. In contrast,

in RB 83.7% of the predictions are intra-prediction, which suggests that the high amount of

motion activity and details of the video sequence prohibits the encoder from using motion

prediction. As expected, this leads to a lower encoding efficiency and higher bitrate of the

encoded video. These results indicate that in a fixed layering configuration, the importance

of a specific video slice, which will be carried by a specific video packet, might be signifi-

cantly different for different video sequences. Therefore, we claim that we need to consider

macroblock dependencies to calculate the importance of video packets more precisely.

Table 5.2: Dependency statistics for different video sequences using a fixed layering config-uration of DTQ = (2,4,0).

Dependency BB ED PA RH PJ RB SFTemporal 30.4% 46.5% 62.0% 39.7% 71.8% 82.2% 34.2%IPP1 7.7% 30.3% 53.1% 21.2% 66.5% 70.6% 13.1%IPMP2 22.7% 16.2% 8.9% 18.5% 25.3% 11.6% 21.1%+MV-RIP3 18.6% 14.7% 5.6% 13.8% 17.4% 11.1% 16.5%

Spatial 69.6% 53.5% 38.0% 60.3% 28.2% 17.8% 65.8%ILIP4 1.6% 5.0% 11.5% 4.8% 12.1% 13.1% 2.4%IPMP 68.0% 48.5% 26.5% 55.5% 16.1% 4.7% 63.4%+MV-RIP 57.4% 41.9% 23.7% 49.0% 13.9% 4.3% 55.3%+RSP5 53.1% 36.5% 22.1% 44.8% 12.7% 4.3% 48.9%

1Intra picture prediction. 2Inter-picture motion prediction.3Motion vector and reference index prediction.4Inter-layer intra prediction. 5Residual signal prediction.

Performance of Dependency-aware UEP

To evaluate the performance of the proposed dependency-aware UEP model, we encoded

the reference video sequences listed in Table 5.1 using the fixed layering configuration of

DTQ = (1, 3, 1). We set the key picture decay factor to 16, i.e., we assumed the importance

of each key picture is negligible for the frames that are more than 16 GOPs away, which is

equal to 128 frames or 5.3 seconds in playback. In this experiment, we vary the packet loss

rate in the packet erasure channel from 0% to 28% with step size of 4%. In all experiments,

114

the channel capacity is set to the video bitrate plus half of the expected packet loss rate

so that there is sufficient bandwidth to carry the video streaming and the corresponding

UEP data. To minimize vaiability in the results due to random packet drops, we virtually

transmitted each video sequence over the channel 10 times and report the average peak

signal-to-noise ratio (Y-PSNR) of the received video. In theory, due to better protection

of important video packets, the decoded videos should be in higher quality when using the

dependency-aware UEP model. For comparison purpose, we implement four other error

protection models:

• Equal error protection (EEP): All video packets are associated with the same im-

portance measure for equal protection.

• Packet-weighted UEP (PW-UEP): The importance of each video packet is propor-

tional to the number of bytes used to store the video packet [95, 96]. This model

also resembles similar models that use the number of bytes needed to represent a

slice or a frame as the importance measure of video packets.

• Layer-weighted UEP (LW-UEP): The importance of each layer is determined by

considering the layers that depend on it [53, 166]. This model also resembles the

UEP models based on error propagation zone as described in Sec. 2.2.

• Optimal model (Optimal UEP): The optimal UEP ensures that the most important

packets are always protected so that the distortion in playback is minimal. Hence,

only the least important packets are affected if losses occur during transmission. To

simulate the optimal model, we select and drop the video packets that cause the

least rate distortion. The number of packets to be dropped is determined by the

channel loss rate.

Implementation of EEP and PW-UEP is straight-forward. For LW-UEP, we use the

algorithm proposed in [53]. This algorithm involves solving the distortion optimization

115

problem over all video packets of a video sequence using all possible FEC (forward error

correction) packets. Since the distortion decrease cannot be determined prior to transmission,

we determine the parameter setting for this UEP model by running the algorithm for the

RH video sequence with DTQ = (1, 3, 1). Since the algorithm depends on only the layering

information and the number of FEC packets, the same parameter setting shall provide the

least distortion for other videos that share the same laying configuration. We select video

RH because its visual properties and dependency statistics are close to the mean value of

the other videos according to Table 5.1.

Fig. 5.8 provides the Y-PSNR measurements of videos reconstructed from video packets

transmitted over channels with different loss rates. As expected, EEP shows the worst per-

formance since it protects all video packets equally. PW-UEP also provides low video quality.

This is caused by large video packets from the quality layer, which out-weights the importance

of smaller video packets of other layers. Compared to EEP and PW-UEP, LW-UEP leads to

much better video quality, since it considers layer dependency and provides more protection

to video packets from reference layers. This is the improvement observed in existing UEP

proposals for layered video coding. For all the videos, the proposed dependency-aware UEP

model outperforms EEP, PW-UEP, and LW-UEP, and closely approximates the performance

of optimal UEP. More specifically, the average quality of the video transmitted using the

proposed model is 3.76 dB better than that of LW-UEP when the packet loss rate is 28%.

The results in Fig. 5.8 suggest that modeling coding dependencies provides a more accurate

importance measure for UEP, which leads to better video quality.

Next, we examine the computational overhead of the proposed dependency-aware UEP

model. According to Table 5.3, while the average computational overhead is less than 2.2%,

the overhead noticeably varies among different videos. Consulting the visual properties of

the videos in Table 5.1, we note that there is a strong correlation between the computational

complexity and the amount of detail and motion activity of the videos. Hence, there are

116

10

15

20

25

30

35

40

45

0 4 8 12 16 20 24 28

EEPPW-UEPLW-UEPProposed ModelOptimal UEP

Vid

eo Q

ualit

y (Y

-PSN

R)

Packet Loss Rate (%)

(a) BB Video Sequence

10

15

20

25

30

35

40

45

0 4 8 12 16 20 24 28

EEPPW-UEPLW-UEPProposed ModelOptimal UEP

Vid

eo Q

ualit

y (Y

-PSN

R)

Packet Loss Rate (%)

(b) ED Video Sequence

10

15

20

25

30

35

40

45

0 4 8 12 16 20 24 28

EEPPW-UEPLW-UEPProposed ModelOptimal UEP

Vid

eo Q

ualit

y (Y

-PSN

R)

Packet Loss Rate (%)

(c) PA Video Sequence

10

15

20

25

30

35

40

45

0 4 8 12 16 20 24 28

EEPPW-UEPLW-UEPProposed ModelOptimal UEP

Vid

eo Q

ualit

y (Y

-PSN

R)

Packet Loss Rate (%)

(d) PJ Video Sequence

10

15

20

25

30

35

40

45

0 4 8 12 16 20 24 28

EEPPW-UEPLW-UEPProposed ModelOptimal UEP

Vid

eo Q

ualit

y (Y

-PSN

R)

Packet Loss Rate (%)

(e) RB Video Sequence

10

15

20

25

30

35

40

45

0 4 8 12 16 20 24 28

EEPPW-UEPLW-UEPProposed ModelOptimal UEP

Vid

eo Q

ualit

y (Y

-PSN

R)

Packet Loss Rate (%)

(f) SF Video Sequence

Figure 5.8: Performance of different UEP models over a packet erasure channel with varyingpacket loss rate and fixed layering configuration of DTQ = (1,3,1).

117

more dependency among macroblocks in videos with higher detail and motion activity (i.e. a

dense dependency graph). For these videos, there are more dependency arcs to be processed

for those videos. This implies higher computation overhead.

Table 5.3: Computational overhead of the proposed UEP model compared to the videoencoding time

Video Sequence BB ED PA PJ RB SFComputation Overhead 1.8% 1.6% 2.1% 3.2% 3.4% 2.4%

Generality of the proposed UEP model

In this section, we investigate the sensitivity of the proposed UEP model to the layering

configuration and frame size. We compare the Y-PSNR of the received videos under the

setting used in the previous section with those under two other settings. In one setting,

we change only the frame size from HD to 480p, i.e. from 1280 × 720 pixels to 854 × 480

pixels. In the other setting, we change only the layering configuration from DTQ = (1, 3, 1)

to DTQ = (2, 4, 3). We present the Y-PSNR readings from LW-UEP and the proposed

UEP model with 28% packet loss rate in Table 5.4. Since LW-UEP protects video packets

according to the layering dependency, by comparing with LW-UEP, we can determine the

sensitivity of the proposed UEP model to layering configuration.

Table 5.4: Y-PSNR of the transmitted videos when varying the video specification.Parameters UEP Model BB ED PA PJ RB SFHD videos, LW-UEP 24.1 19.2 21.4 16.6 17.9 28.4

DTQ = (1,3,1) Proposed 26.4 21.5 26.6 21.8 23.9 30.1480p videos, LW-UEP 23.7 16.3 18.9 14.7 17.1 26.0

DTQ = (1,3,1) Proposed 25.9 18.2 23.1 18.3 22.4 27.6HD videos, LW-UEP 27.3 19.9 22.9 17.2 19.7 30.4

DTQ = (2,4,3) Proposed 29.9 22.6 28.3 23.8 27.2 33.1

From Table 5.4, we observe that reducing the frame size leads to lower Y-PSNR in the

reconstructed videos for both LW-UEP and the proposed UEP model. We also observe

that Y-PSNR decreases slightly more when using the proposed UEP model. Hence, the

118

proposed UEP model is more sensitive to changes in frame size. When changing the layering

configuration from DTQ = (1, 3, 1) to DTQ = (2, 4, 3), the video bitrate increases, resulting

in higher video quality. According to Table 5.4, increasing the number of layers improves

the Y-PSNR by 1.61 dB and 2.43 dB on average for LW-UEP model and the proposed

model, respectively. The change also increases the gap between the average quality from

these two models, from 3.76 dB to 4.58 dB. This is because increasing the number of layers

introduces more macroblocks and more dependency relationships. In this case, the proposed

dependency-aware UEP model provides more precise UEP for video packets than LW-UEP

does.

5.3 Adaptive FEC for Layered Video Multicast

In Sec. 5.2, we proposed a novel method to calculate the importance of video packets of

different video layers more precisely. Through the experimental results, we also demonstrated

the better performance of the proposed model compared to other state-of-the-art UEP models

suggested for layered video coding. Layered video coding can be used in a video unicast

scenario (where the video is prepared and sent by the media server toward a specific end

user device) to address the fluctuation in channel bandwidth and packet loss rate. However,

the full capabilities of layered video coding can be used in a video multicast scenario, when a

single layered video is prepared by the media streamer and different groups of mobile devices

with similar connectivity, usually referred to as multicast groups, are receiving different

number of video layers according to their conditions. Again, the number of layers received

by each multicast group might change according to the connection quality.

In a mobile communication network, the last mile antenna uses heterogeneous wireless

channels with fluctuating capacity and packet loss rate for video multicast. Furthermore, the

mobile devices have different antenna, computation, display and battery capabilities. There-

fore, a proper video multicast solution must consider efficient use of the wireless channels

119

along with providing good quality of experience for the devices with heterogeneous capabili-

ties. Last but not least, the energy consumption of the devices connected to these networks

must be considered as a critical resource. In this section, we propose an adaptive video

multicast system that delivers layered video content to mobile devices through a mobile

communication network. The proposed system serves as an adaptive error protection layer

over the unequal error protection model suggested for video streaming unicast in Sec. 5.2.

To this end, a novel design is proposed to use the calculated importance of video packets

and prepare the redundant coded blocks needed for forward error correction in a multicast

video streaming scenario. Furthermore, this design makes the system flexible to dynamic

changes in channel loss rate with minimum coding overhead. Specifically, the employed

coding scheme is empowered by a novel structure for the coefficient matrix that decreases

the delay and the computational complexity of coding operations. The experimental results

show that the new system offers a flexible streaming service, decreases the computational

cost of preparing the FEC codes, significantly decreases the video transmission delay, and

conserves energy on mobile devices.

5.3.1 Adaptive FEC for Video Multicast

The proposed adaptive video streaming system utilizes three key enabling techniques: layered

video coding, erasure codes for forward error correction, and a novel mechanism to generate

the required coded blocks for forward error correction. Without loss of generality, we assume

that the multicast video is prepared in a number of layers, where the base layer is needed

to decode the video, and enhancement layers, if received successfully, improves the video

quality. Hence, the more layers a device receives, the higher the quality that the video is

played at. Mobile devices may determine the number of desired layers based on the network

connectivity, battery lifetime, other resources availability, or user preference. Since such an

algorithm is orthogonal to the proposed model, we assume that the desired number of layers

120

is known a priori in our system. Within each layer, a segment represents a time slice of the

video playback (say 1 second), which encompasses one or more video packets. Furthermore,

we assume that the proportional importance of each video packet is determined by the model

proposed in Sec. 5.2. Hence, the importance of the video segment can be calculated from

the contained video packets as shown in Eq. 5.6.

Similar to the existing proposals for using erasure codes in wireless communication [13,19],

we employ systematic erasure codes to cope with channel losses. This in turn also reduces

energy consumption and delay due to fewer decoding operations. However, the channel loss

may significantly vary over time. Unlike existing proposals, in order to dynamically ac-

commodate different loss rates, we adjust the level of coding according to the video stream

building blocks and the channel conditions. Towards this goal, an obvious solution is to

prepare coded packets for each loss rate at the source. The coded packets are then trans-

mitted according to the current loss rate in each channel. Such a solution is costly in terms

of computation needed to prepare the coded packets for each and every packet loss rate, let

aside the lateral problem of finding a proper step size for the packet loss rate. To address this

problem, our system generates the coded packets for each GOP in each layer in a progressive

manner that allows the streaming node to combine as few as possible video packets together

and serve the coded blocks according to the channel loss rate.

Adaptive Transmission of Protection Blocks

Given the sporadic losses in wireless channels, the media streaming server (or the edge media

server located at the edge of the network) is supposed to be capable of sending each layer of

the video with an arbitrary level of redundancy. For this reason, we propose a fine granular

coding scheme that adaptively produces data blocks and coded blocks for any loss rate. As

depicted in Fig. 5.9, each layer is divided into a sequence of segments, each consisting of

one or more GOPs. Each GOP is further divided into m blocks, denoted by white blocks

121

with letter labels in Fig. 5.94. The coefficient matrix used by the new coding scheme is a

combination of original blocks and the ladder-shaped coding. In fact, the original blocks are

interleaved with coded blocks generated using a ladder-shaped coefficient matrix. Hence,

one coded block is generated for each original block. In each layer, each coded block i of

GOP g is a combination of all i original blocks of GOPs 1 – g from the same layer, i.e.

cgi =

j=g−1∑j=1

εjiBj + εgiB

gi (5.8)

where εji is a set of randomly chosen coding coefficients for GOP j in a finite field such as

GF(2), GF(64) or GF(256), and Bgi is the first i blocks in GOP g. The coded blocks can be

used to recover any lost block among the first m GOPs.

In existing erasure coding schemes for layered videos [127,137,138], coding is performed

on fixed-size segments encompassing k GOPs, where k is a positive integer. Such schemes

introduce extra delay. For instance, if the segment size is not a multiple of GOPs, portion of

a GOP may appear in segment i, and the remaining portion of the GOP appears in segment

i + 1. Now two consecutive segments must both be received and decoded to recover this

GOP. Furthermore, for segments enclosing multiple GOPs, the GOPs can only be decoded

after the segments are completely received. Moreover, they cannot adapt the amount of

protection associated to each segment by the importance of the video packets contained in

each segment. To address these issues, our coding scheme operates on individual GOPs to

minimize the delay overhead.

To deliver a GOP over a lossy channel with loss rate r, after sending br−1c original blocks

in the order appearing in the coefficient matrix, the source sends a coded block that is a

linear combination of the previously sent blocks. As shown in Fig. 5.9, the rate at which the

coded blocks are transmitted fluctuates according to the loss rate. As discussed earlier, the

coding scheme assumes that the mobile devices provide feedback on the connection quality.

4 Please note that while the proposed model is described over the GOPs, the model can be used on anyother unit of multiple video packets, such as multiple GOPs, unit video segments, or even multiple videosegments.

122

A

B

C

D

E

F

G

H

GO

P 1

Layer 1

GO

P 2

Segm

ent i

A

B

C

D

E

F

G

H

Layer 2

AB

CD

EFGH

ABC

DE

FGH

Mobile Device

Source

wireless channel

r = 0.5r = 0.25

Mobile Device

Source

wireless channelr = 0.33

r = 0.33r = 0.5

Figure 5.9: The proposed coding scheme for FEC blocks in layered video streaming.

Integrating Adaptive FEC with UEP

If an algorithm such as one proposed in Sec. 5.2 is employed to determine the importance

of video packets, the importance of each GOP can be determined as the average of the

importance of video packets used to represent the GOP. In this case, instead of sending one

forward error correction coded block per br−1c original blocks, we determine the amount of

protection blocks that can be sent for each GOP similar to Eq. 5.7:

NPgopi

=wgopi × lgopi∑j wgopj × lgopj

×NP (5.9)

NBP

gopi=NPgopi

lB

where NP is the total number of available protection bytes, NPgopi

is the number of protection

bytes that can be sent for GOP i according to its importance, wgopi , and NBP

gopiis the number

of FEC blocks that can be sent for this GOP considering the size of each data block, lB. wgopi

and lgopi are the importance and length of GOP i in bytes. The iterator j in the denominator

iterates over all the GOPs, including gopi. Next, we can determine the period of sending a

FEC coded block for each GOP by dividing the coded blocks that can be sent for each GOP

123

among its original blocks evenly, and sending the coded blocks interleaved with the original

blocks as described earlier.

The Choice of Coding Scheme

Reed-Solomon erasure codes perform coding operations in GF(256). However, due to the

field size, the encoding and decoding require matrix multiplication and inversion, which are

computationally expensive. To conserve energy on mobile devices while taking advantages

of erasure codes, we employ systematic Fountain codes in GF(2) in our coding scheme. The

encoding and decoding are simply XORing original blocks or coded blocks, respectively. The

coding coefficients are just 0’s and 1’s. For example, in Fig. 5.9, the coded block sent after

block A and block B for layer 1 is the XOR of the two blocks. If block B is lost during

transmission, it can be recovered by XORing block A and the coded block. Nonetheless, the

small field size increases the chance that the coded blocks are linearly dependent, making

the received coded blocks not decodable. For this reason, within each GOP, additional coded

blocks are produced by XORing a subset of blocks. It has been shown that for any number

of original data blocks k > 10, with α additional coded blocks the probability of decodability

of received k + α blocks is 1− 2−α [87].

In addition to the flexibility in adapting to loss rates over different wireless channels, the

proposed coding scheme has two other advantages over existing proposals. First, regardless

of the size of the finite field, the computational complexity is greatly reduced since the

main video content is delivered in its raw form. This means that decoding is only needed

when an original block is lost. More precisely, one decoding operation is needed for each

lost block. Second, the fine granular coding scheme is designed to minimize the delay due

to coding operations. Since a GOP is the smallest decodable unit that a media player

operates on, aligning the size of the decodable unit of the erasure codes to the same size

allows the GOPs to be delivered to the application as soon as possible. Hence, our proposed

coding scheme encourages progressive decoding by providing much more recovery points in

124

each video layer. This feature not only reduces the delay in decoding the video, but also

reduces energy consumed due to transmissions and computations, especially on resource-

scarce mobile devices.

5.3.2 Case Study: Application in a Mobile Network

In this section we tailor our streaming system to a state-of-the-art mobile communication

network. Specifically, the recent generations of mobile communication networks such as 4G

cellular networks use Orthogonal Frequency Division Multiple Access (OFDMA)-based multi-

carrier modulation in their downlinks. Hence, we apply our model to OFDMA to compare

its performance with other proposed FEC models for layered video multicast.

With OFDMA, the communication channel is divided into sub-channels. Sub-channels

with the same characteristics, including signal strength, loss rate, and coverage, are grouped

into a frequency range. Data packets are transmitted in sub-channels according to their

importance and the location of their destination [11]. In a live streaming session, the video

content is broadcast to all participating devices from the source, much like TV broadcasting.

Once the video content becomes available at the source, the coded blocks are computed and

stored. The original blocks and coded blocks are then pushed into the OFDMA network that

is responsible for delivering the blocks to mobile devices. Here, we explain how the blocks

are assigned to sub-channels in an OFDMA network for transmission.

In reality, an OFDMA network consists of thousands of sub-channels and many frequency

ranges. Without loss of generality and to keep the example illustrative, we assume that

there are eight 512 kbps sub-channels who are grouped into four frequency ranges. The area

covered by the OFDMA network is divided into four multicast groups with group 1 covered

by all four frequency ranges, group 2 covered by the first three frequency ranges, group 3

covered by the first two ranges, and group 4 covered by only the first range. The loss rate of

each range also varies from group to group. The group division and the corresponding loss

125

rates are given in Table 5.3.2. For the sake of simplicity, we assume that the video packets

have the same importance. We will cover the case of video packets with variable importance

in Sec. 5.3.3.

Table 5.5: Packet loss rate of the multicast groupsMulticast Frequency Frequency Frequency FrequencyGroups Range 1 Range 2 Range 3 Range 4Group 1 0.1 0.1 0.1 0.2Group 2 0.1 0.1 0.2 —Group 3 0.1 0.2 — —Group 4 0.2 — — —

To keep the discussion close to reality, we consider the full-HD Pedestrian Area (PA)

video sequence from Sec. 3.1. We encode the PA video sequence with a layered setting

of two spatial layers, four temporal layers, and four quality layers, i.e., DTQ = (1, 3, 3).

Furthermore, we set the quantization parameter of each quality layer such that the bitrate

of the layered video containing that quality layer and all spatial and temporal layers is less

than the maximum bandwidth that can be received by each multicast group. Table 5.3.2

provides the bitrate and objective video quality of each subset of PA video layers when there

is no packet loss. In this table, PA-N stands for the video stream with the quality layers

up to and including quality layer N and QP is the quantization parameter of the respective

quality layer. All video subsets are full-HD (1920× 1080) and 24 frames per second.

Table 5.6: The specification of PA layered video substreams (full-HD, 24 fps)Video Substream DTQ QP Bitrate (Kbps) Y-PSNR (dB)

PA-0 (1, 3, 0) 44 817.2 30.79PA-1 (1, 3, 1) 37 1771.8 34.68PA-2 (1, 3, 2) 33 2594.2 38.19PA-3 (1, 3, 3) 28 3464.0 43.25

In GF(2), our experimental results indicate that the best block size is 256 bytes (or 2

Kilobits of video data). Table 5.3.2 provides the actual bit rates and the number of original

blocks and coded blocks needed for different video substreams when transmitted on a channel

126

with different loss rates. Furthermore, it reports the amount of bandwidth that is needed to

receive and decode each video substream successfully with a probability of 0.99999.

Table 5.7: PA substream specification for different quality layersQuality Layer 0 1 2 3per GOPBitrate Kbps 817.2 954.6 822.4 969.8# of original blocks 137 160 138 145# of coded blocks (r = 0.1) 14 16 14 15# of coded blocks (r = 0.2) 28 32 28 29

α for 0.99999 decodability 6 7 7 6Bandwidth needed (Kbps) (r = 0.1) 950.0 1108.8 956.1 1127.4Bandwidth needed (Kbps) (r = 0.2) 1039.6 1212.3 1045.7 1231.0

We assign layers to sub-channels for transmission according to the importance of the

layer and the quality of the sub-channels. In order to have a good coverage and overall

viewing experience, the video substream with the first quality layer (here, PA-0) is assigned

to the sub-channels with the best coverage and quality, i.e., sub-channels in frequency range

1, as shown in Fig. 5.3.2. The idle capacity of frequency range one, if there is any, is

used to address the packet loss rate by sending the redundant coded blocks interleaved

with the original blocks as illustrated in the previous section. If there is any bandwidth

remaining after proper number of redundant coded blocks are sent, such that the highest

packet loss rate among the multicast groups is addressed, the remaining bandwidth is used

to send video packets from the next quality layer. If there is not enough bandwidth in the

channels associated to a multicast group to send a layer with proper amount of protection,

the protection data uses the next available channel as shown in Fig. 5.3.2. As a rule of thumb,

proper reception of a reference layer has priority over reception of the next dependent layer.

In essence, this algorithm is greedy, trying to support the multicast groups with lower quality

connections. Therefore, it starts from the most important layer and assigns it to the best

sub-channel(s) that is still available. According to Table 5.3.2, the highest loss rate for all

multicast groups is 0.2, so the number of coded blocks served is for this rate. The loss rate

127

can change over time. In Fig. 5.3.2, the loss rate for all multicast groups are reduced to 0.1

after 1 second. Our system dynamically adjusts the number of coded blocks to the new loss

rate, resulting in less network traffic for segment 4.ra

nge

4

Frequency

rang

e 3

rang

e 2

rang

e 1

Seg 1

Time

1 sec 2 secSeg 2 Seg 4Seg 3

GOP

Quality Layer 0

Quality Layer 1

Quality Layer 2

Quality Layer 3

Quality Layer 0

Quality Layer 1

Quality Layer 2

Quality Layer 3sub-channel

sub-channel

Coded blocks to address 10% packet loss rate

Coded blocks to address 20% packet loss rate

Figure 5.10: Assigning layers in OFDMA.

5.3.3 Performance Evaluation

We evaluate the performance of the proposed model in a trace-driven simulated LTE network

according to the OFDMA model. The total channel capacity is 4 Mbps, and the channel

is divided into eight 512 Kbps frequency ranges. The capacity of each sub-channel is 256

bytes, which is small enough to ensure all bandwidth will be used to carry the video stream.

The reference video stream is the Pedestrian Area (PA) video sequence with four quality

layers as presented in Table 5.3.2. For comparison purpose, we also implement the case of

128

no source coding, as well as the block-diagonal [127] and ladder-shaped [138] coding schemes

for layered video coding. Block-diagonal [127] and ladder-shaped [138] coding schemes are

illustrated in Fig. 5.3.3.

(a) Block DiagonalCoefficient Matrix

k1

d1

k2

d2

12

34

(b) Ladder ShapedCoefficient Matrix

k1

d1

k2

d2

12

34

Coefficients for Layer 1

Coefficients for Layer 2

Redundancy coefficientsfor Layer 1

Zeros

Redundancy coefficientsfor Layer 2

1

2

3

4

Figure 5.11: Block diagonal and ladder shaped coefficient matrices for two video layers L1

and L2, in which each video segment is divided into k1 and k2 data blocks, respectively.These matrices are multiplied into k1 +k2 data blocks to create k1 +k2 reconstruction blocksand d1 + d2 redundant coded blocks for forward error correction.

For each video segment of each layer, the block diagonal coefficient matrix is a combi-

nation of a lower triangular and a general matrix, as illustrated in Fig. 5.3.3(a), where each

coded block i of layer l is a linear combination of blocks 1 to i of the same video segment.

There will be exactly k such coded blocks for each segment that consists of k original blocks.

Additional redundant coded blocks are produced using Reed-Solomon coding over GF(256)

and are sent to recover any loss during the transmission of the first k coded blocks. The

lower triangular design of this approach is expected to provide progressive decoding when

receiving the first k coded blocks. However, if any of the first k coded blocks are lost, the

decoding must wait for the redundant coded blocks to recover the loss, which introduces an

extra delay.

The ladder shaped coefficient matrix is an extension over the block diagonal coefficient

matrix, as illustrated in Fig. 5.3.3(b). It trades computation and bandwidth for increasing

redundancy in higher priority layers by including the higher priority layers in the coded

blocks for lower priority layers [138], i.e., each coded block i of layer l is a linear combination

129

of blocks from the same video segment in layers 1 to l−1 and blocks 1 to i in layer l. Hence,

the base layer of SVC is decodable with higher probability since it is included in every coded

block.

As can be seen in Fig. 5.3.3, the block-diagonal coding is applied across GOPs, and the

ladder-shaped coding is applied across GOPs and layers. To keep the comparison fair, we

use systematic random linear Fountain coding [87] over GF(2) for all the methods. The

block size is set to 256 byte and α for Fountain code is set to 16 for every 256 blocks for

99.999% of decodability, i.e., for each 16 blocks one block has been added to compensate

for probable linear dependent coded blocks. Each video segment represents 1 second of the

video playback and contains three GOPs, where according to Table 5.3.2 each GOP requires

137 — 160 original blocks for different quality layers.

In order to keep the simulation close to reality, we design the simulator to use traces of

packet loss rate from a real mobile network and the energy profile from a Samsung Galaxy

Nexus I9250 smartphone. We recorded the loss rate every second while driving around in a

car. The recorded packet loss rate trace is provided in Fig. 5.3.3. We use the same channel

division given in Table 5.3.2 of Sec. 4.2, in which the loss rate in the low-quality sub-channels

is twice that in the high-quality sub-channels. The reference smartphone was equipped with

a dual-core 1.2 GHz processor, 1 GB of memory, a 1750 mAh Li-Ion battery, and Android 4.2

(Jelly Bean) as the operating system. The measured energy profile is reported in Table 5.3.3.

Table 5.8: Energy profile of the reference mobile device.Energy Consumption (per GB)

4G Download 11.3297%GF(2) Decoding 0.0613%GF(256) Decoding 3.6193%

Video Quality Performance

We begin the evaluation with the objective quality of the transmitted video offered by each

coding scheme when varying the loss rate from 0 to 0.2 in all frequency ranges. We observe

130

Pack

et L

oss

Rat

e (%

)

0

2

4

6

0 30 60 90 120 150 180Time (seconds)

Figure 5.12: Three minutes trace of the packet loss rate.

that without any erasure coding, the Y-PSNR level drops to 15 dB when the average packet

loss rate is 0.2. Fig. 5.3.3 shows that source erasure codes can preserve the video quality

over lossy channels. The proposed scheme outperforms the block-diagonal scheme by 0.9 dB

when transmitting the PA-0 video sequence, and by 1.25 dB when transmitting the PA-3

video sequence. This is due to the fine-granular coding scheme used in the proposed scheme

at the GOP level. The video quality obtained by the ladder shaped coding scheme is on

average 0.3 dB better than the proposed scheme when the packet loss rate is as high as 0.2.

This can be justified by considering that the higher priority layers can be recovered not only

by extra coded blocks from the same layer, but also by coded blocks from all subsequent

layers. The proposed scheme trades 0.3 dB Y-PSNR (less than 1%) for a significant decrease

in overhead, delay, and battery consumption of mobile devices, as shown next.

Transmission Overhead

In this experiment, we compare the transmission overhead, the extra coding information

sent along with coded blocks, of the proposed scheme with that of the block diagonal and

ladder shaped coding schemes. As shown in Fig. 5.3.3, the proposed scheme introduces the

least overhead. The savings are due to the use of channel state information (packet loss

rate, as presented in Fig. 5.3.3) for each video layer. The amount of coded content served

131

Ave

rage

PSN

R (

dB)

26

27

28

29

30

0 5 10 15 20Ladder ShapedBlock DiagonalProposed Model

Average Packet Loss Rate (%)

(a) PA video sequence with one quality layer.

Ave

rage

PSN

R (

dB)

28

29

30

31

32

33

34

35

0 5 10 15 20

Ladder ShapedBlock DiagonalProposed Model

Average Packet Loss Rate (%)

(b) PA video sequence with two quality layers.

Ave

rage

PSN

R (

dB)

32

33

34

35

36

37

38

39

0 5 10 15 20

Ladder ShapedBlock DiagonalProposed Model

Average Packet Loss Rate (%)

(c) PA video sequence with three quality layers.

Ave

rage

PSN

R (

dB)

37

38

39

40

41

42

43

44

0 5 10 15 20

Ladder ShapedBlock DiagonalProposed Model

Average Packet Loss Rate (%)

(d) PA video sequence with four quality layers.

Figure 5.13: The objective video quality when using different layered protection mechanismsand varying the loss rate from 0% to 20%.

132

0

60

120

180

240

Group 1 Group 2 Group 3 Group 4

Ladder ShapedBlock DiagonalProposed Scheme

Tran

smis

sion

Ove

rhea

d (K

bps)

Figure 5.14: Transmission overhead of different multicast groups using different layeredprotection mechanisms.

is dynamically adjusted for each GOP according to the current loss rate. This effectively

avoids sending unnecessary coded content. In summary, the proposed scheme reduces the

transmission overhead by at least 15% compared to block diagonal and ladder shaped coding

schemes.

Delay

A live streaming system is also sensitive to delay. We consider the total delay as the sum-

mation of the transmission delay, the waiting delay, and the decoding delay. Transmission

delay is the transmission time of encoded blocks and the corresponding coding information.

Waiting delay is the time that a mobile device spends waiting for coded blocks if some of

the k original blocks of a GOP are lost during transmission. The decoding delay is the time

taken by a mobile device to reconstruct the missing blocks before sending the GOP to the

video player. Please note that in this evaluation we do not consider the encoding delay, i.e.,

we assume that the stream source has the proper encoded blocks ready to be transmitted

133

Play

back

Del

ay (

ms)

0

150

300

450

600

Waiting DelayTransmission DelayDecoding Delay

G#1 G#2 G#3 G#4Block Diagonal

G#1 G#2 G#3 G#4Ladder Shaped

G#1 G#2 G#3 G#4Proposed Scheme

Figure 5.15: Transmission delay and waiting delay of different multicast groups using differ-ent layered protection mechanisms.

when the channel state information arrives from the mobile device.

Fig. 5.3.3 compares decoding delay, transmission delay, and waiting delay for one video

segment. As expected, the dynamic adjustment with the channel noise level and also the

interleaved transmission of coded packets nearly eliminates the waiting delay for the proposed

scheme. Furthermore, transmission delay is significantly lower than that of other schemes,

due to the lower transmission overhead and GOP-level block placement algorithm. Finally,

the decoding delay of the proposed scheme is significantly less than that of block diagonal and

ladder shaped schemes, since these schemes encode more packets together, hence increasing

the computational complexity of encoding and decoding tasks. In summary, the proposed

scheme reduces the delays at least 66% compared to the block diagonal coding scheme, and

88% compared to the ladder shaped coding scheme.

134

Energy Efficiency on Mobile Devices

The energy efficiency of the proposed scheme is obvious due to its low transmission and

computation overheads. We note that due to the use of the systematic coding over GF(2),

the energy consumed by the decoding process is negligible for the proposed coding schemes.

The system primarily serves the SVC video in its original form to reduce the need for decoding

on mobile devices. The amount of coded content served is dynamically adjusted over time

according to the current loss rate, and a mobile device performs decoding only if any of the

original blocks were lost. Overall, the energy usage of the proposed scheme is close to the

case with no coding. The energy saving compared to the block-diagonal and ladder shaped

schemes are up to 5% and 11%, respectively, assuming that both the coding schemes are

modified to use systematic coding over GF(2).

0

6

12

18

24

Group 1 Group 2 Group 3 Group 4

Ladder ShapedBlock DiagonalProposed SchemeNo Coding

Batt

ery

Usa

ge (

% p

er h

our

of v

ideo

)

Figure 5.16: Energy consumed by the reference mobile device per hour of streaming session.

Computational Complexity

We conclude the performance evaluation with a study on the computational complexity of

the proposed adaptive streaming model. In this experiment, we compare the computational

135

cost of block diagonal coding, ladder shaped coding, and our proposed coding scheme, all

using systematic random linear Fountain codes. We setup the source on a medium Amazon

EC2 instance to stream the video content in the appropriate form according to the loss rate

model characterized by our trace (Fig. 5.3.3). We scale the mobile network from one device to

300 devices. The simulated mobile devices are randomly scattered in four multicast groups,

and initiate the streaming session for the reference video with the source at arbitrary times.

As depicted in Fig. 5.3.3, the encoding time is a constant in the proposed scheme, regard-

less of the network size. This can be justified according to Fig. 5.9, which depicts that the

proposed scheme generates the coded blocks required to support all the possible loss rates

and the respective placements of coded blocks at once. Nonetheless, when the network con-

sists of less than 45 mobile devices, the block diagonal or the ladder shaped coding schemes

are more cost effective, and there is an upfront cost for the fine-granular coding scheme.

However, once the coded blocks are generated, it is used throughout the entire session, i.e.,

no more coding is required on the server side. In contrast, in the other two coding schemes,

coded blocks are generated on demand, which includes delay and unnecessary computing.

In summary, the proposed scheme scales well with a slight trade-off in small networks.

5.4 Summary

In this chapter, we turned our attention to the second phase of the life cycle of a video

streaming episode, i.e., service transmission from the edge of the cloud to the end user

device. We specifically investigated the problem of high quality video delivery over noisy

wireless communication channels. Compared to the related research works listed in Sec.

2.2.3, this research is the first work that deeply investigates the internal design of the video

codec in use. As demonstrated in the experimental results, ignoring the internal design

of video codec standards leads to less effective UEP due to inaccurate estimation of the

importance of visual information encapsulated in each video packet. In this chapter, we

136

0

5

10

15

20

1 40 80 120 160 200 240 300

Ladder ShapedBlock DiagonalProposed Scheme

Enco

ding

Tim

e ov

er G

F(2)

(s)

Number of Mobile Devices

Figure 5.17: Time needed to prepare all the redundant coded blocks.

tried to address this inefficiency.

Toward this goal, we first looked deeper into the coding and prediction mechanism of

state-of-the-art layered video coding standard, i.e., SVC. Next, towards high quality video

streaming over wireless networks and smarter protection of video packets over noisy commu-

nication channels, we proposed a novel coding and dependency aware unequal error protec-

tion algorithm. The proposed algorithm calculates the importance of different video packets

and associates proper protection to each video packet, respectively. Experimental results

show that the proposed algorithm outperforms the state-of-the-art unequal error protection

algorithms in terms of the visual quality of the transmitted video. Finally, we completed

the proposed UEP model by extending the UEP problem from unicast scenario to multi-

cast scenario, in which the full potential of layered video coding is utilized by allowing the

transmission network to multicast one copy of the layered video for groups of heterogeneous

mobile devices. To this end, we proposed a new technique to dynamically adjust and com-

bine the protection FEC packets for reference and dependent video layers for video multicast

137

in mobile communication networks. The experimental results show that the proposed model

improves the quality of the transmitted video and also reduces the energy consumption on

mobile phones.

In the following chapter, we investigate cooperative video streaming among adjacent

mobile phones. This problem is a part of service consumption phase in the life cycle of a

video streaming episode, as illustrated in Fig. 1.1.

138

Chapter 6

Video Reception in Smartphones

The final part of a video transmission session is reception and playback of the video on

the end user device. When it comes to mobile video streaming, the variable loss rate and

bandwidth fluctuation of cellular networks pose challenges to video streaming, since such

systems are very sensitive to delays and losses. On the one hand, an extensive body of

research is focused on improving the signal quality in cellular networks. On the other hand,

another domain of research is concentrated on utilizing short-range links such as WiFi and

Bluetooth to increase the overall link capacity around individual smartphones.

In this chapter, we are interested in the latter approach. The utilization of short-range

links leads to higher data rate and shorter delays, essential performance metrics for mul-

timedia streaming. Furthermore, such an arrangement improves the energy efficiency on

smartphones [48], since the energy per bit ratio of short-range links is less than that of cel-

lular links. Two main approaches are suggested to use short-range links: WiFi offloading

through Internet-connected WiFi routers, and using ad-hoc WiFi networks among collabo-

rating adjacent smartphones.

In this chapter, we investigate the utilization of ad-hoc WiFi networks of collaborating

smartphones. Such an ad-hoc network can be created when a group of smartphones in

proximity of each other are interested in playing the same video stream at the same time,

i.e., a sport match. This system utilizes cellular links to carry streaming content from

the Cloud to smartphones, and WiFi links to enable cooperation among smartphones. As

discussed in Sec. 2.2, this idea has been investigated in several research works to provide

short delays [16, 63], to share error recovery codes [80], to receive the content over multiple

paths [129], and to pre-fetch the content based on social ties [60]. Compared to past research

139

Downlinks (3G/4G/LTE)

Cooperation Links (WiFi, Bluetooth)

Video Source

Figure 6.1: An overview of a collaborative streaming system for smartphones

works, the research proposed in this chapter copes with the variable loss rate and bandwidth

fluctuation by taking advantage of a novel two-level coding scheme. Furthermore, it reduces

the energy consumption by offloading coding operations to the Cloud and by using a light-

weight distributed scheduling algorithm to manage collaboration and content sharing among

the nodes.

Fig. 6 depicts an overview of a collaborative multimedia streaming system for smart-

phones. The multimedia content is made available by the media streaming server in Cloud,

and is streamed to smartphones through cellular networks. Devices that are within each

other’s WiFi/Bluetooth signal range may be considered as a group. Assuming all devices

are cooperative, they can form a data swarming session to share their received content over

WiFi/Bluetooth connections. Such a way of combining different wireless communication

channels was initially proposed in [15,62]. The benefit of doing so is to increase the receiving

throughput of each device, while reducing the demand on the cellular network.

Towards resource efficient video streaming, in this chapter we seek to reduce the energy

consumed by video streaming on smartphones by utilizing multiple wireless communication

channels as described above. Without loss of generality, we resort to 3G as our cellular

network technology and WiFi as our short-range communication. We will focus on the

interactions among smartphones within a single WiFi network, as the design will be same

for each WiFi network.

140

There are two main challenges in designing an energy-efficient collaborative video stream-

ing system. First, all wireless connections are subject to channel fading and have considerably

higher loss rates than wired connections. For each lost packet, a retransmission has to be

performed, if the conventional forward error correction (FEC) cannot recover the distorted

packet. Many coding techniques have been proposed to cope with such losses, without im-

posing additional delays or retransmissions. In particular, network coding (NC) is known

for its advantage in maximizing the gain of each retransmission [101, 103] in wireless com-

munication, since it allows any retransmitted packet to be used to recover any lost packet.

Second, the effectiveness of bandwidth utilization, in such a collaborative network of

smartphones, is subject to segment scheduling among various links. Towards efficient band-

width utilization, Keller [72] proposed the MicroCast system, in which one specific phone

manages the scheduling of segment transmissions on all other phones in the network. The

design principle is to avoid using cellular links as much as possible, and to encourage data

swarming within the WiFi network.

Although MicroCast achieves good streaming rates by utilizing both cellular and WiFi

channels, as well as the overhearing feature in wireless communication, it suffers from two

main deficiencies. First, the scheduling algorithm is a centralized algorithm that introduces

unnecessary messaging and management overhead. Second, it is not energy efficient since it

imposes unnecessary computational burdens on the phones due to the large field size used

in NC operations.

In the proposed system, the energy saving is achieved through a two-level systematic NC

and transmission scheme. At the top level, the multimedia content is streamed from the

Cloud hosting the video source to a WiFi group formed by smartphones. At the bottom

level, received content is shared among smartphones within a WiFi network. The content is

transmitted in both verbatim form and coded form in Galois Field GF(2). Furthermore, to

manage collaboration and content sharing in the WiFi network, we propose a light-weight

141

distributed scheduling algorithm.

Finally, the proposed system is mathematically modeled as an optimization problem

for optimal resource allocation and scheduling. The optimal rate allocation and scheduling

(RAS) algorithm determines the amount of data and the data to be transmitted on each

link. Overall, the system minimizes both the streaming traffic in the cellular network and

the energy consumed by streaming applications on mobile devices. The experimental re-

sults show that significant energy saving is achieved in the proposed system. Moreover, the

proposed RAS algorithm prolongs the streaming session for the entire cooperative group.

6.1 Energy-Efficient Collaborative Streaming

In this section, we propose our energy-efficient streaming system for smartphones, hereafter

referred to as nodes. In this system, the energy saving is achieved through utilizing a two-

level systematic NC scheme and the computing power of the Cloud. Without making any

assumption regarding the video codec used to generate the video stream, our system sched-

ules the transmission at the segment level. Each video segment represents a small duration

(say 1 second) of the video playback. In a live streaming session, all nodes share the same

window of interest, i.e., the same set of segments. The remain of this section explains the

two levels of NC, the transmission scheme, and the heuristic scheduling algorithm in detail.

The optimized resource allocation and scheduling algorithm is discussed in the next section.

6.1.1 General transmission scheme

In this section, for the sake of simplicity, we assume a simple pull-based streaming scheme,

as shown in Fig. 6.1.1. Based on a scheduling algorithm, nodes explicitly request segments

that are due for playback soon from the video source, which is located in the Cloud. Since

transmissions over cellular links consume more energy than transmissions over WiFi links,

to conserve battery lifetime, nodes ideally should collectively download a single copy of

142

the video and exchange segments with each other over WiFi links. A heuristic scheduling

algorithm is detailed in Sec. 6.1.3. The optimized scheduling algorithm is covered in the next

section.

Upon receiving a request, the video source passes the segment to a network coding engine

in the Cloud, hereafter called NC coder. NC coder serves the segment using systematic code

in Galois Field GF(256) to minimize the transmission overhead and also the need for encoding

and decoding at both the sender side (the NC coder) and the receiver side (smartphones),

respectively. We assume that the link between the video source and the NC coders is a high

throughput and low delay link. More detail is provided in Sec. 6.1.2.

On the receiver side, the node can reconstruct the video segment as soon as k linearly

independent blocks are received over the cellular network. In a live streaming session, all

nodes play roughly the same segment and are interested in receiving the same set of segments

that are due for playback in the immediate future. Hence, nodes form a swarming session, in

which they exchange received segments with each other over the WiFi network. To cope with

losses in this network, each segment is served in both verbatim form and coded form. Similar

to the NC coder, a node divides a segment into k blocks of the same size and serves the

segment using systematic code. However, to minimize the encoding and decoding cost, the

coding operations are performed in GF(2). Receivers of the segment only need to perform

the decoding process if the first k blocks received are not all verbatim blocks. More detail is

provided in Sec. 6.1.2.

6.1.2 Two-Level Coding Scheme

Overall, nodes collaboratively request different segments from the Cloud and exchange with

each other in the WiFi network. A video segment is delivered in two phases. In the first

phase, blocks of the segment are transmitted in both verbatim form and coded form in

GF(256) over the cellular network. In the second phase, the node which receives sufficient

143

Video Source

Node (Seed)

Other nodes

Requestsegment si

{k blocks +ϵc coded blocksin GF(256)

Reconstruct si,code ϵw blocks

in GF(2)}k blocks +ϵw coded blocksin GF(2)

NC Coder

Code ϵc coded blocks in GF(256)

Figure 6.2: Streaming segment i from the Cloud to all collaborative nodes in the proposedtransmission scheme

blocks of this segment decodes the segment and then serves blocks of this segment in both

verbatim form and coded form in GF(2) over the WiFi network.

Top level: coding in the Cloud

We assume that the video stream is generated by the video source at rate v bps, for a

group of m cooperating smartphones N = {N1, N2, · · · , Nm}. Each node Ni ∈ N has a

cellular connection with download capacity Ci bps and packet loss rate li from the cellular

towers. We assume that these connections have the required error correction mechanism in

place, i.e., the cellular wireless channels are erasure channels. Furthermore, we assume that

the cellular towers manage the wireless interference to provide nodes with interference-free

connections. Consequently, we can assume that there are N parallel connections from the

cooperating nodes to the NC coders in the Cloud.

To ensure a complete data swarming in the WiFi network, a complete copy of the video

must be streamed from the Cloud through the cellular network, i.e.,

|E|∑i=1

Pi ≥ (1 + li)F (6.1)

where E is the set of cellular links carrying the video stream, Pi is the total number of bytes

transmitted on each link, F is the video file size, and li is the loss rate of each link.

144

Upon receiving a request for segment sj from node Ni ∈ N , the NC coder divides segment

sj into kj original blocks of the same size and serves nj = kj + εc systematic coded blocks

over the wireless downlink to the node. The value of εc is determined by the loss rate li. In

general, the expected number of blocks sent over a cellular link for a segment sj is:

E[nj] = kj + εc (6.2)

= kj +E[li]

1− E[li]kj

where E[li] is the expected loss rate from the NC coder to a node Ni. For example, if the

loss rate is 0.2, there should be 1.25 ∗ k blocks served by the source in order to ensure at

least k blocks are received. The first k blocks are served in their original form, and the

additional 0.25 ∗ k blocks will be served in coded form in order to recover any lost during

the transmission of the 1.25 ∗ k blocks. To generate coded blocks, the NC coder produces

each coded block as a linear combination of the k original blocks using random coefficients

from a finite field GF(2x).

Since each link i has a different loss rate 0 ≤ li ≤ 1 and the link quality varies over

time, the NC coder may have to generate and send different number of coded blocks for

each segment. To better utilize the bandwidth in cellular network, the coding is performed

in a larger field of GF(256) to ensure that with high probably the redundant coded blocks

received by a node are linearly independent. A node stops receiving and reconstructs the

video segment as soon as kj = k′j+ε′c linearly independent blocks are received over the cellular

network, where 0 ≤ k′j ≤ kj is the number of original blocks received and 0 ≤ ε′c ≤ εc is the

number of coded blocks received. Since RLNC is based on Reed-Solomon code, a maximum

distance separable code, the need for explicit retransmission of a lost packet is eliminated as

all transmitted blocks are equally important and useful in decoding the original segment.

Bottom level: coding in WiFi

Now, segment i is considered in the WiFi network. The node that receives this segment,

hereafter referred to as the seed for the segment, is now responsible for disseminating the

145

segments to other nodes in the WiFi network. In MicroCast [72], the seed broadcasts coded

blocks produced in GF(256) to all nodes in the WiFi network. If a node does not receive

enough coded blocks for decoding a segment, it explicitly requests the seed to serve more

coded blocks of the segment. We note that this system requires all nodes to perform the en-

coding and decoding operations in GF (256). However, since the system is taking advantage

of the overhearing feature in the WiFi network, and the network is formed by nearby nodes,

the loss rate is expected to be much lower than the cellular network. Our experiment on the

Galaxy Nexus phone indicates that encoding and decoding in GF(256) for every GB of data

consume 5.06% and 3.6% of the battery, respectively. Given the size of the WiFi network

and the lower loss rate, we believe that NC in GF(2) is sufficient. For this reason, we argue

that it is not necessary to perform NC in GF(256) in the WiFi network.

In our system, a seed of a segment sj first reconstructs the original segment. The seed then

sends the kj blocks from the segment over the WiFi network without encoding them. Nodes

that successfully receive these kj blocks can reconstruct the segment without performing any

decoding. Due to channel losses, some nodes may not receive the kj original blocks, and they

may miss different blocks. Reconciling the missing blocks for each node will be costly and

requires a direct communication between the node and the seed. Instead, the seed produces

and sends εw coded blocks in addition to the kj original blocks. The coded blocks at this

level are produced by simply XORing a random subset of the kj blocks, i.e., encoding in

GF(2). The value of εw is determined by the loss rate lw in the WiFi network. Hence, the

expected number of blocks sent in a WiFi network for a segment sj is:

E[mj] = kj + εw (6.3)

= kj +E[lw]

1− E[lw]kj

where E[lw] is the expected loss rate in the WiFi network. Here, we assume that an algorithm

for estimating the loss rate is in place. After a timeout period, if a node still does not have

sufficient blocks to reconstruct the original segment, it will ask the seed to send more coded

146

blocks.

Overall, with the two-level NC and transmission scheme, the encoding process is offloaded

to the Cloud, and the decoding process will take place just for the lost blocks of each segment

once in GF(256) at the seed and once in GF(2) at the other nodes. From our experiment

on the Galaxy Nexus phone, both encoding and decoding in GF(2) for every GB of data

consume approximately 0.06% of the battery, which is much more power efficient than that in

GF(256) on each node. In this system, to disseminate a segment sj to all smartphones, there

are E[nj] ∗ kj/(1 − li) blocks traveled from the Cloud to the seed, and E[mj] ∗ kj/(1 − lw)

blocks are shared in the WiFi network. This is very close to the minimum transmission

required to deliver a segment to all nodes, i.e., nj + mj, where nj is the number of blocks

sent over a cellular link with the minimum loss rate, and mj is the number of blocks shared

in a WiFi network with the minimum loss rate.

6.1.3 Distributed Scheduling Algorithm

In MicroCast [72], each node requests video segments from the Cloud according to the

segment assignment that is centrally managed by a specific node. If any node fails to retrieve

a particular segment, this node is informed and re-assigns the download task to another node.

However, given the overhearing feature in wireless communication, the segment download

scheduling can be managed in a distributed way among the nodes in the WiFi network.

Here, we propose such a distributed algorithm to coordinate the segment downloads.

Each node randomly picks a missing segment sj in the shared window of interest, with

preference toward segments that are due for playback immediately. Before pulling segment

sj from the Cloud, the node either broadcasts a message or unicasts the message to a random

node in the WiFi network. Upon receiving or overhearing the message, all nodes will avoid

scheduling segment i for transmission. As soon as the node that initiated the transmission

completely received the segment, it informs other nodes with another message. If such a

147

message is not received or overheard within a timeout period, all nodes will treat segment

i as a regular missing segment and proceed with regular segment transmission scheduling.

This distributed algorithm not only coordinates all nodes to collectively download a copy

of the video over time, but also allows load balancing among the nodes. The nodes with

longer battery lifetime or better cellular connections can be more active in pulling segments

from the Cloud, while nodes with low battery or poor cellular connections can benefit from

the data swarming in the WiFi network. However, this is based on the assumption that all

nodes are cooperative, and are not selfish or malicious in any form. The proposed distributed

scheduling algorithm is presented in Alg. 1.

In this algorithm, the Schedule() function decides whether a node should retrieve a

segment from the Cloud. The decision may be based on different heuristic criteria, or based

on a mathematical optimization model. In continue we propose the optimization model

and solve the optimization problem for resource allocation and scheduling as a centralized

algorithm. Later in Sec. 6.3 we compare the performance of the proposed optimal model

with three heuristic scheduling algorithms.

6.2 Optimal Resource Allocation and Scheduling

In this section, we present a mathematical model for the collaborative streaming system pro-

posed in the previous section. The proposed system is modeled as an optimization problem

for optimal resource allocation and scheduling. The optimal rate allocation and scheduling

(RAS) algorithm determines the amount of data and the data to be transmitted on each link.

Overall, using optimal RAS the proposed system minimizes both the streaming traffic in the

cellular network and the energy consumed by streaming applications on mobile devices.

148

Algorithm 1 Distributed Scheduling Algorithm on node Ni

Require: S: segment list from the window of interestRequire: SM : list of missing segments, initially equals to SRequire: SQ: list of segments requested by nodes in WiFiRequire: Schedule(): schedules a segment transmission from the Cloud

1: while streaming do2: if Schedule() == true then3: Select a segment sj from SM4: Announce “download sj”5: Move sj from SM to SQ6: Download sj from the Cloud7: end if

8: if sj received then9: Remove sj from SQ

10: if received from cellular downlink then11: Send sj to WiFi12: end if13: end if

14: if timeout sj then15: Announce “missing sj”16: Move sj from SQ to SM17: end if

18: m: an announcement received or overheard19: if m == “downloaded sj” then20: Move sj from SM to SQ21: else if m == “missing sj” then22: Move sj from SQ to SM23: end if24: end while

149

6.2.1 Modeling the Cooperative Streaming System

A video source S, hosted in a cloud, provides the streaming service at the rate of r bps

to a set of cooperating mobile devices N . The video is divided into segments, representing

a short duration of the playback. Each mobile device maintains a playback buffer storing

segments that are due for playback in the immediate future. The size of this buffer also

marks the window of interest of each mobile device. To achieve smooth playback on the

mobile devices, each segment must be received and decoded prior to its playback deadline.

We now present the two layers of the cooperative streaming system: the cellular network

and the cooperative network.

In the Cellular Network

The connection between mobile device i, referred to as node ni hereon, and the video source

is a cellular link gi with capacity ci and loss rate pi. We further categorize nodes into

two groups: active nodes and passive nodes. An active node communicates with the video

source to download video segments with link capacity ci > 0. A passive node relies on active

nodes in the cooperative network for receiving the content, maybe due to limited battery

power, lack of data plan, or weak cellular signals. To keep the network model simple, we

assume that a link with ci = 0 still exist between the video source and the passive nodes.

Since interference management among cellular links is not the focus of this work, we assume

that the base station properly manages the wireless channels and assigns sub-channels to

connections. This allows us to work with |N | parallel, yet independent, connections (within

a sub-channel) between the source and the cooperative network.

To simplify the problem setup, we assume the links between video source and base stations

to be high-capacity, low-delay links. Different cellular standards suggest different channel

access control methods, including TDMA, CDMA, and OFDMA, and how to send the data

over the channel may affect the energy consumption of mobile terminals. However minimizing

energy consumption of mobile terminals through proper utilization of wireless channel by

150

the base station is orthogonal to our discussion. Thus we assume that the base stations use

the channel efficiently.

Pass

ive

Activ

e

Activ

e

Activ

e

Video Source

active 3G link (ci > 0, pi > 0)passive 3G link (ci = 0, pi = 1)WiFi broadcast channel (ci,j > 0, pi,j > 0)

Figure 6.3: Network model.

In this system, the source serves only one copy of the video to active nodes, and the

nodes cooperatively exchange received segments in the network to deliver missing segments

on all devices. Hence, the source schedules the transmission of segments according to their

playback deadline and the rate allocation on each cellular link. As shown in Fig. 6.4, the

video source collects channel state and energy usage information from all active nodes, based

on which the RAS algorithm determines to which nodes the video segment will be pushed.

Hence each cellular link transmits different complete or partial segments within the current

window of interest. For clarity, we assume a segment represents one time unit of the playback.

Over time, exactly one copy of the video is streamed over the cellular network. Here, we

assume that an algorithm for estimating the next state of channel, based on the channel

state history, is in place.

Before pushing video segments, the source first divides a segment into k original blocks

of the same size and serves the segment using systematic code in Galois Field GF(256).

The coded blocks, a random linear combination of the original blocks, are sent along with

the original blocks to node ni that is selected by the source. Please note that there is an

151

Pass

ive

Activ

e

Activ

e

Activ

e

Video SourceRAS Alg

seg

i

seg

j

seg

k

CSI & energy usage info.coded segments in GF (256), unicastcoded segments in GF (2), broadcast

Figure 6.4: Modeling the cooperative streaming system.

extensive body of research on how to divide the video segment into blocks or how to encode

them together to minimize signal distortion caused by packet losses. We assume such an

algorithm is in place and focus on determining the number of blocks to be transmitted.

The expected number of coded blocks required to ensure the segment can be successfully

recovered by node ni is given in Eqn. 6.4:

εi =E[pi]

1− E[pi]k (6.4)

where E[pi] is the expected packet loss rate of the cellular link gi. In other words, the source

sends k original blocks and εi coded blocks, and node ni receives 0 ≤ k1 ≤ k coded blocks

and 0 ≤ k2 ≤ k original blocks over link gi, and k1 + k2 = k. If k1 ≤ k, i.e., not all

of the k received blocks are original blocks, the missing original blocks can be recovered by

solving the linear system formed by the k1 original blocks and the k2 coded blocks. To better

characterize the effectiveness of the coding scheme, we define delivery rate di as the ratio of

the streaming data among all received data, including control messages, sequence numbers,

coding information, etc. Hence, in the cellular network, node ni receives segments at rate si

over gi, and the received data is recovered at rate di ∗ si.

152

In the Cooperative Network

In the cooperative network, each pair of nodes (ni, nj), ni, nj ∈ N , are reachable over a

WiFi link {wi,j}, with capacity ci,j > 0 and packet loss rate pi,j ≥ 0. Hence, the cooperative

network is a fully connected network. We assume that each WiFi link is bi-directional,

i.e., wi,j ' wj,i, with ci,j = cj,i and pi,j = pj,i. In the WiFi network, we consider the

time division multiple access (TDMA) model for the following reason. While IEEE 802.11

networks are based on orthogonal frequency-division multiplexing (OFDM), and can transmit

the information on multiple carrier frequencies, the usual smartphones are equipped with

just one cellular and one WiFi antenna. This means that when broadcasting in the WiFi

network, each smartphone is in either the sending mode or the receiving mode, but not both.

To avoid collision in the broadcast session, only one node can send at a time, while all other

nodes are in the receiving mode. Hence, the broadcasting node can use all the available

frequency range of the WiFi channel. Nodes in our system take advantage of this property

and employ a broadcast mechanism to share received segments.

Upon receiving a complete or partial segment, node ni becomes the seed of this segment in

the cooperative network, and is responsible to disseminate it to all other nodes. Before doing

so, node ni first reconstructs the k original blocks and serves the segment using systematic

code in Galois Field GF(2). The coded blocks, XOR of a random subset of the original

blocks, are broadcasted in the WiFi network. Our experiments on the Galaxy Nexus phone

indicate that encoding and decoding in GF(2) is almost 100 times more energy efficient than

that in GF(256). Since the loss rate in the WiFi network is expected to be much lower than

that in the cellular network, the field size of 2 is sufficient. The expected number of coded

blocks required to ensure the segment can be successfully recovered by all other nodes is

given in Eqn. 6.5.

εi =E[pwi ]

1− E[pwi ]k (6.5)

where E[pwi ] is the expected packet loss rate when node ni broadcasts. The loss rate pwi is

153

Table 6.1: Summary of notationsNotation DescriptionN a set of cooperating mobile devicesgi cellular downlink for node ni ∈ Nwi,j WiFi link between nodes ni and njci, ci,j link capacity (in bps) of gi and wi,jpi, pi,j packet loss rate of links gi and wi,jdi, di,j delivery rate of coded data on links gi and wi,jr the streaming rate (in bps)

Pi(r) power consumption (in Watt or W ) of node niαi, βi, γi energy efficiency factors (in J/bit) of node ni

si cellular download rate (in bps) of node nibi broadcast rate (in bps) of node ni in WiFi networkτi timeshare (in sec) of node ni in WiFi networkψ shared session elongation coefficientli battery level of node ni (in J)

λ, η, µ, ξ Lagrange multipliers

the maximum of all WiFi links incident to node ni, i.e., pwi = maxj pi,j,∀nj ∈ N−ni. In other

words, node ni broadcasts k original blocks and εi coded blocks to ensure that all nodes in

the WiFi network receive at least k (coded or original) blocks. To be energy efficient, a node

may stop receiving once it has k linearly independent blocks. Up to now, every node in the

WiFi should have a copy of this segment, and its streaming is completed. In fact, node ni

broadcasts a segment at rate bi in the cooperative network, and node nj ∈ N−nireceives the

segment at rate bi(1− pi,j). The segment is recovered at rate di,jbi(1− pi,j), where di,j is the

delivery rate for transmitting coded blocks over wi,j.

Due to the use of network coding over cellular and WiFi links, all blocks of the same

segment are equally useful in recovering the segment. This feature simplifies the application

of the RAS algorithm, to be proposed in Sec. 6.2.3, since no data reconciliation is needed

to reconstruct a video segment. For clarity, we summarize the notations used in this section

in Table 6.2.1, in which some are already introduced and some will be defined in the energy

usage minimization problem (Sec. 6.2.2).

154

6.2.2 The Power Consumption Minimization Problem

To minimize the energy consumption in the cooperative network described in Sec. 6.2.1, we

need an optimal rate allocation and segment scheduling algorithm in the cellular network

and the WiFi network. To do so, we first formulate the power consumption minimization

problem. We define Pi(r) (in Watt or W ) as the energy consumed by a cooperative node

ni ∈ N to receive the video stream at rate r, and P (r) =∑

ni∈N Pi(r) as the objective energy

function. According to [29], we consider the energy consumption of data transmission and

coding operations as a linear function of transmission and coding rate. The objective function

is defined as follows:

Pi(r) = αisi + βibi + γi∑

jdj,ibj(1− pj,i), (6.6)

∀ni ∈ N , nj ∈ N−ni

where αi, βi, and γi are energy efficiency factors (in J/bit) of node ni when receiving and

decoding each bit of data over a cellular link, encoding and broadcasting each bit over

the WiFi network, and receiving and decoding each bit in the WiFi network. Note that

the idle power is not included in this formula, since the objective here is to minimize the

power consumption due to data transmission. However, the idle power may affect lifetime

of mobile devices, and we will discuss this in Sec. 6.2.4. According to the setup in Sec. 6.2.1,

155

the optimization problem can be formulated as follows:

mins,b

∑iPi(r),∀ni ∈ N (6.7)

s.t. (1) r ≤∑

isi, ∀ni ∈ N

(2) r ≤ si +∑

jdj,ibj(1− pj,i),∀ni ∈ N , nj ∈ N−ni

(3) si ≤ bi minj(di,j(1− pi,j)),∀ni ∈ N , nj ∈ N−ni

(4) si ≤ dici(1− pi),∀ni ∈ N

(5) bi ≤ τi minj ci,j,∀ni ∈ N , nj ∈ N−ni

(6)∑

iτi ≤ 1,∀ni ∈ N

The first constraint requires the cumulative download rate over cellular links to be larger

than the streaming rate r. This is inevitable since the source must send at least one copy

of the video into the cooperative network. The second constraint implies that for smooth

playback at each node, the cumulative receiving rate from the cellular link and the WiFi

broadcast must be larger than the streaming rate r. The third constraint enforces the flow

conservation in the WiFi network, and requires each node to broadcast any packet received

over the cellular link in the WiFi network. The fourth constraint is the capacity constraint

in the cellular network, and specifies an upper bound for the receiving rate of node ni.

The last two constraints define the broadcast rate of each node in the WiFi network. The

fifth constraint indicates that a node cannot broadcast at a rate higher than the lowest

capacity among its outgoing WiFi links, to ensure that all nodes can receive the broadcast

data during the allocated time slot. As discussed in Sec. 6.2.1, broadcasting in the WiFi

network is essentially a TDMA model. According to this model, τi ≥ 0 in the last constraint

is the time share that node ni may use the WiFi channel. This formulation minimizes

power consumption (in Watt) instead of total energy consumption (in Joule). By definition,

J = W · s, i.e., energy consumption grows proportionally with the length of a streaming

session. Therefore, for any streaming session, minimizing power consumption will lead to

156

minimized energy consumption.

Although Eqn. 6.7 minimizes power consumption in the streaming session described in

Sec. 6.2.1, it suffers from short lifetime of the shared streaming session. The solution of

Eqn. 6.7 will favour high-capacity nodes, and undoubtedly quickly drains battery of these

nodes, leading to lower capacity in the WiFi network. Once these nodes consume all of

their energy and leave the cooperative network, the energy consumption will increase, as the

system now consists of only low-capacity nodes. Our experimental results in Sec. 6.3 also

confirms this phenomena. Therefore, we introduce a new constraint to prolong the shared

time in the cooperative streaming session as follows:

li ≥ ψl,∀ni ∈ N (6.8)

where li is the battery power (in Watt) of node ni, ψ ≥ 0 is a global shared session elongation

constant, and l is the average battery power of cooperating nodes. Clearly, ψ = 0 turns off

the shared session elongation control, while ψ = 1 invites low-capacity nodes to contribute.

We then complete the formulation by adding Eqn. 6.8 to Eqn. 6.7 and replacing Pi(r) with

Eqn. 6.6. Moreover, the symmetric property on each link (wi,j ' wj,i) allows us to further

simplify the model by replacing cj,i, pj,i, and dj,i with ci,j, pi,j, and di,j, respectively. The

standard form of the final problem formulation is presented in Table 6.2.2.

Since the second order partial derivative of the objective function and all the constraints in

Table 6.2.2 equals to zero, this problem is a convex optimization problem. Hence, exhaustive

search algorithms will be too complex due to size of the search space. Thus, we solve the

problem through its Lagrange dual function. We note that the fourth constraint in Table

6.2.2 specifies the upper bound for rate allocation in cellular network, and the fifth and

sixth constraints specify the upper bound for rate control in the WiFi network. Since these

constraints are defined by the network capacity, we keep these three and relax the remaining

constraints using a Lagrange multiplier λ for constraint (1) and three Lagrange multiplier

vectors η, µ and ξ for constraints (2), (3) and (7) respectively. Then Lagrangian L for this

157

Table 6.2: Energy consumption optimization problem for video streaming in a cooperativenetwork

minsi,bi

∑i(αisi + βibi + γi

∑jdi,jbj(1− pi,j)), s.t.

∀ni ∈ N , nj ∈ N−ni

(1) r −∑

isi ≤ 0,∀ni ∈ N

(2) r − (si +∑

jdi,jbj(1− pi,j)) ≤ 0,∀ni∈N , nj∈N−ni

(3) si − bi minj(di,j(1− pi,j)) ≤ 0,∀ni ∈ N , nj ∈ N−ni

(4) si − dici(1− pi) ≤ 0,∀ni ∈ N

(5) bi − τi minj ci,j ≤ 0,∀ni ∈ N , nj ∈ N−ni

(6)∑

iτi − 1 ≤ 0,∀ni ∈ N

(7) ψl − li ≤ 0,∀ni ∈ N , nj ∈ N−ni

problem can be written as:

L(r, λ, η, µ, ξ) = (6.9)∑i(αisi + βibi + γi

∑jdi,jbj(1− pi,j))

+ λ(r −∑

isi)

+∑

iηi(r − si −

∑jdi,jbj(1− pi,j))

+∑

iµi(si − bi minj(di,j(1− pi,j)))

+∑

iξi(ψl − li), ∀ni ∈ N , nj ∈ N−ni

Then the Lagrangian of the problem can be reformulated by expanding and reordering the

158

terms towards rate allocation variables, i.e., si and bi. Thus, we have:

L(r, λ,η, µ, ξ) = (6.10)∑isi(αi − λ− ηi + µi)

+∑

ibi(βi − µi minj(di,j(1− pi,j)))

+∑

i(γi − ηi)

∑jdi,jbj(1− pi,j)

+ r(λ+∑

iηi)

+∑

iξi(ψl − li), ∀ni ∈ N , nj ∈ N−ni

Now, we can define the Lagrange dual function of the energy usage minimization problem

as:

D(λ, η, µ,ξ) = minsi,bi

L(r, λ, η, µ, ξ) (6.11)

s.t. (1) si ≤ dici(1− pi)

(2) bi ≤ τi minj ci,j

(3)∑

iτi ≤ 1, ∀ni ∈ N , nj ∈ N−ni

To minimize the energy consumption on mobile devices, each node ni ∈ N should broad-

cast data at the highest possible rate so that it can keep the antenna of other nodes in sleep

mode as long as possible, i.e., reducing the idle power. This allows us to remove the second

constraint in Eqn. 6.11 and replace bi with τi minj ci,j in the Lagrangian L. Now, we have

only two variables: s and τ , and the Lagrange dual problem can be decomposed into two

simpler problems to find the optimal rate allocation in the cellular network and the WiFi

network. Hence, the optimization problem for the cellular links and WiFi links rate control

159

can be respectively written as:

minsi

∑isi(αi − λ− ηi + µi) + r(λ+

∑iηi) (6.12)

+∑

iξi(ψl − li)

s.t. si ≤ dici(1− pi),∀vi ∈ N

minτ

∑iτi min

jci,j(βi − µi min

j(di,j(1− pi,j))) (6.13)

+∑

jτj∑

i(γi − ηi) min

ici,jdi,j(1− pi,j)

+∑

iξi(ψl − li)

s.t.∑

iτi ≤ 1,∀ni ∈ N , nj ∈ N−ni

Because the proposed Lagrange dual function does not provide a strong duality, regardless

of the value of λ, η, µ, and ξ, there is a gap between the optima of the primal problem (Table

6.2.2) and its Lagrange dual function (Eqn. 6.11). We use a two-level iterative optimization

method to set the values of the Lagrange multipliers such that this gap is minimized, as will

be described in the RAS algorithm (Sec. 6.2.3).

6.2.3 The Rate Allocation and Scheduling (RAS) Algorithm

Based on the solution for the power consumption minimization problem, we propose the

optimal rate allocation and scheduling (RAS) algorithm for our cooperative streaming sys-

tem. Before pushing each video segment, the video source checks and updates (if necessary)

information about each node ni ∈ N , including channel state information (CSI), partial

WiFi links state information (i.e., ci,j and pi,j, if i < j), energy consumption coefficients

(i.e., αi, βi, and γi), and the remaining battery power (i.e., li). It then formulates the power

consumption minimization problem using the information collected from all nodes. To solve

the energy consumption minimization problem, we start from a chosen set of values for λ,

η, µ, and ξ (e.g., by initializing all multipliers to one), and solve Eqn. 6.12 and Eqn. 6.13

160

using a linear solver. We then use subgradient optimization method to update the Lagrange

multipliers as follows:

λt+1 = max{0, λt + ρtλ(r −∑

isti)} (6.14)

ηt+1i = max{0, ηti + ρtη,i(r − gti −

∑jdi,jb

tj(1− pi,j))}

µt+1i = max{0, µti + ρtµ,i(s

ti − bti minj(di,j(1− pi,j)))}

ξt+1i = max{0, ξti + ρtξ,i(ψl − li)}

where ni ∈ N and nj ∈ N−ni, and ρt∗ is the step size series. According to Eqn. 6.14, the

Lagrange multiplier λ is updated according to the difference between the streaming rate and

the cumulative receive rate over cellular links. So λ is the size of the source queue that

buffers segments to be sent by the source in the cellular network. The Lagrange multiplier ηi

specifies the number of missing blocks at node ni as it is updated according to the number

of blocks needed to reconstruct a segment. The Lagrange multiplier µi is updated according

to the difference between the receiving rate on cellular link gi and the broadcast rate at node

ni, so it can be used as the output queue size at node ni. Finally, the Lagrange multiplier ξi

has an inverse relation with the battery conservation of node ni, i.e., smaller ξi encourages

node ni to contribute in the download process in lower battery level conditions.

After each round of update on λ, η, µ, ξ for all nodes, we update Eqn. 6.12 and Eqn. 6.13

with the new values and feed them to the linear solver again. This iterative process repeatedly

improves the values of si and bi. To make the algorithm converge, we must have∑∞

t=0 ρt∗ =∞

and limt→∞ ρt∗ = 0. One example for such a sequence is ρt∗ = 1

t+1. For faster convergence,

we utilize the step size adaptation formula proposed by Held and Karp [47]. We believe

that such a centralized solution is feasible in a Cloud-assisted streaming service due to the

computing power of the Cloud.

After finding the optimal rate allocation, the source now prepares the video segment by

dividing it into k blocks. It then produces coded blocks in GF(256) according to the loss

rate on the cellular link connecting the chosen node ni with download rate si. The coded

161

blocks are sent along with the original blocks over this cellular link. The rate allocation and

scheduling information for the WiFi network is also sent to the selected node. Upon receiving

k linearly independent blocks of a video segment, the node decodes the original blocks and

encodes them again in GF(2) according to its broadcast loss rate. The coded blocks are

broadcast along with the original blocks over this WiFi network at rate bi provided by the

RAS algorithm. Upon receiving or overhearing the broadcasted segment, nodes decode the

segment and buffer it for playback. The proposed algorithm for optimal rate allocation and

scheduling is presented in Alg. 2.

Algorithm 2 Rate Allocation Algorithm on Video Source

Require: cg, pg, α, β, γ, l, g, τ : Arrays of size |N |Require: cw, pw: Arrays of size |N | × |N |Require: Alloc<int, int>: Key-value vectorRequire: packets: Array of data packetsRequire: OptimalRAS(cg, pg, α, β, γ, l): runs the optimal rate allocation and scheduling

algorithm for the next video segment and returns g and τRequire: Encode(vs, g, p): Encodes the video segment vs according to g and p and

returns Alloc and packets

1: while streaming do2: for each node ni ∈ N do3: Receive data structure info from ni4: cgi , p

gi , αi, βi, γi, li ← info.{cg, pg, α, β, γ, l}

5: for j = 1→ i− 1 do6: cwi,j, c

wj,i ← info.cwi,j

7: pwi,j, pwj,i ← info.pwi,j

8: end for9: end for

10: g, τ ← OptimalRAS(cg, pg, α, β, γ, l)11: Alloc, packets ← Encode(vs, g, p)12: indx ← 013: for each key i in Alloc do14: cnt ← value of key i in Alloc15: Push packets[indx, indx+ cnt] to node ni16: Send τi to node ni17: indx ← indx + cnt18: end for19: end while

162

6.2.4 Overhead Analysis

The proposed RAS algorithm requires each active node to send battery usage and battery

information to the streaming source. This may be done periodically or just once. If this

information is provided to the source only once, the rate allocation on cellular links will not

consider power drained due to video decoding, rendering, and playback on the screen. If

this information is sent to the source periodically, the rate allocation will be dynamically

updated according to the current status of mobile devices. However, if bandwidth is scarce in

the network or signals are weak, the one-time update is preferred to reduce communication

overhead in the cellular network.

When providing update to the source, a node ni, in a cooperative network consisting

of N nodes, collects information (wi,j, ci,j, pi,j, and di,j) of WiFi links between itself and

each of the N − 1 nodes, the information (ci, pi, di, si) on the cellular link gi from the

source to itself, as well as its broadcast rate bi in the WiFi network and its battery level

li. Each of these values can be stored in 4 bytes. Hence, the size of an update message is

4 × (4(N − 1) + 6) bytes. For instance, in a cooperative network that consists of 10 nodes

and all nodes are active nodes, an update from a node is 168 bytes in size. Overall, there

are 1680 bytes flowing from the cooperative network to the source for each update period.

Assume that the update period is 10 seconds, which is frequent enough. The update shares

1680 Bps of the bandwidth. If we assume the bit rate for a typical streaming session 1 Mbps,

the communication overhead is 0.13% if the source serves only one copy of the video to the

cooperative network. Bear in mind that without the RAS algorithm, the source will stream

more than one copy of the video to the active nodes over the cellular network. Therefore, we

argue that this overhead is negligible. Furthermore, we note that the periodic update will

not lead to extra transmission delays, since the streaming source can use the previous RAS

results before receiving any new update.

Finally, we consider mobility of the mobile devices. Over time, the cooperative group

163

may move from one cell to another cell in the cellular network. For quick handover, the old

base station sends the new base station link information (wi,j, ci,j, pi,j, and di,j) for each of

the N2 WiFi links, the Lagrange multipliers (ηi, µi, ξi) for each of the N nodes and λ, as

well as the broadcast rate bi, the share time τi, and the battery level li of each node in the

WiFi network. Each of these values needs 4 bytes. Hence, the size of an update message is

4N2 + 6N + 1 bytes. For instance, in a cooperative network that consists of 10 nodes and

all nodes are active nodes, the old base station sends the new base station 1844 bytes during

the handover. Again, this overhead is considered negligible compared to the large volume of

the streaming traffic.

6.3 Performance Evaluation

In this section, we evaluate the performance of the proposed system, especially the effec-

tiveness of the RAS algorithm. In all experiments, unless specified otherwise, we assume

the packet loss rate of each cellular link and the WiFi broadcast channels are pi = 0.2 and

pw = 0.2, respectively. The average round trip time in the cellular network and the WiFi

network are 237 ms and 89 ms, respectively.

To evaluate our system in a close-to-reality setting, we use the energy profile from a real

smartphone. This phone is equipped with a dual-core 1.2 GHz processor, 1 GB of memory,

and a 3.7 V 1750 mAh Li-Ion battery that provides 23.31 KJ of energy. The operating

system is Android 4.2 (Jelly Bean). To collect the energy profile, we transmit (upload and

download) a large file using this phone in different network settings and monitor the energy

use. The energy consumption is measured three times using a fully charged phone, and

the average is reported in Table 6.3. For transmissions in cellular network, we control the

download rate of the phone using a software. We use the energy profile of the phone with

different download rates to simulate different types of nodes in the cooperative network.

According to Table 6.3, type I node achieves the maximum throughput, and the throughput

164

Throughput Energy Consumption(Mbps) (KJ/GB)

3G, Download (I) 4.72 2.123G, Download (II) 3.8 2.643G, Download (III) 1.5 6.433G, Download (IV) 1.0 9.653G, Download (V) 0 —WiFi, Download 18.25 0.48WiFi, Upload 12.29 0.67

GF(2), Encoding 393.58 0.014GF(2), Decoding 408.16 0.014GF(256), Encoding 5.03 1.03GF(256), Decoding 7.04 0.84

Table 6.3: Throughput and energy efficiency of wireless transmissions and coding operations

lowers from type II to type IV. We note that nodes with lower throughput tend to be less

energy efficient, as it will take longer to transmit the same file. Type V node represents a

passive node that does not have any cellular connection, and relies on other types of nodes

for the streaming service. From Table 6.3, we also observe that data transmission over WiFi

has higher throughput and consumes less energy, which justifies the benefits of offloading

the cellular transmission to the WiFi cooperative network formed by mobile devices.

To measure the energy consumption of coding operations over GF(2) and GF(256), we use

NCUtils [7], a network coding library written in Java. When profiling the coding throughput

and the energy efficiency on the phone, file size and block size are chosen such that the coding

throughput and the linear independence among the coded blocks are maximized. Table 6.3

shows that coding operations in GF(2) are faster and more energy efficient due to their

linear computational complexity, whereas the coding operations in GF(256) have quadratic

complexity. The communication overheads (coding coefficients and extra coded blocks) of

network coding in GF(2) and GF(256) are 7.03% and 3.125%, respectively. The higher

overhead in GF(2) is the result of high dependency of coded blocks, i.e., extra coded blocks

must be generated and sent to guarantee 99.999% of decodability. Based on the energy profile

in Table 6.3, even with the extra overhead, sharing coded blocks in GF(2) is still a green

165

choice as coding operations are significantly more energy efficient than that of GF(256).

In our simulation, we measure the energy consumption (in kJ) instead of power consump-

tion (in kW ), since it presents the total energy consumed throughout a streaming session.

To provide a clear view on the energy saving on mobile devices, we take the average of the

energy consumption consumed by each mobile device in the system. We also measure the

streaming delay (in ms), defined as the time it takes to deliver a specific segment to all the

collaborating nodes after it is scheduled by the RAS algorithm for transmission.

6.3.1 Cooperative Streaming using Different Coding Strategies

We begin the study with an investigation on the effect of different coding strategies on the

power consumption in a cooperative network of homogeneous devices, i.e., the cooperating

nodes are all from type I, as defined in Table 6.3. We turn off the shared session elongation

constraint (i.e., ψ = 0) in order to focus on the impact of different coding strategies and

cooperative arrangements. We feed the simulated source a high-definition video of length

6077 seconds with bitrate 4.36 Mbps. For comparison purpose, we implement four different

cooperation schemes listed below.

• Streaming over 3G: In this scheme, all mobile devices download the video directly over

cellular downlinks. There is no cooperation among mobile devices.

• Cooperation without network coding: In this scheme, mobile devices cooperate over

WiFi, but no coding is employed, i.e. di = di,j = 1,∀i ∈ N , j ∈ N−i . Without network

coding, all nodes must perform data reconciliation by contacting individual nodes for

particular missing segments.

• Cooperation using RLNC: In this scheme, random linear network coding is employed

over both cellular and WiFi links and coded blocks are always coded in GF(256). This

scheme is proposed in [72].

• Cooperation using two-level NC: In this scheme, as illustrated in Sec. 6.1, systematic

166

Reed-Solomon codes and systematic Fountain codes are used in the cellular and WiFi

networks, respectively.

In this experiment, we first vary the size of the cooperative network from 1 node to 20

nodes. Fig. 6.3.1 shows that when streaming over cellular links, the energy used by each node

is constant. Compared to cooperation using RLNC, noticeable energy saving is offered by

our system, primarily due to the smaller field size and the systematic network coding utilized

in two-level NC scheme. The cooperation with RLNC employed in Microcast [72] consumes

much more energy due to its coding complexity, while our system consumes almost no extra

energy compared to the no-coding scheme with 10 or less nodes. If the system consists of

more than 10 nodes, in the absence of network coding, the limited WiFi capacity forces

the nodes to use cellular downlinks to receive missing packets, leading to increased battery

consumption of no coding scheme.

0

4

7

11

14

1 4 7 10 13 16 19 20

Ave

rage

Ene

rgy

Con

sum

ptio

n (K

J) Streaming over 3G Coop. & No CodingCoop. & RLNC Coop. & Two-Level NC

Number of Cooperating Nodes

6% 49%

58%

71%

Figure 6.5: Impact of cooperation arrangements and coding strategies on average energyconsumption

Fig. 6.3.1 shows a break-down of the average energy consumption. This confirms that the

cooperation among mobile devices greatly reduces the energy usage due to cellular transmis-

sions. Both RLNC and two-level NC minimize the traffic and the respective energy usage in

the cellular and WiFi networks. The simplified coding operations in two-level NC consume

very small amount of energy without consuming considerable extra energy due to slightly

167

0

3

6

9

0.040.08

3.513.74

2.022.062.022.06

1.16

2.27

0.420.850.420.85

3.65

0.89

8.50

Ave

rage

Ene

rgy

Con

sum

ptio

n (K

J) 3G Transmission WiFi Transmission Network Coding

Streamingover 3G

Coop. & RLNC(n = 10 & n = 20)

Coop. & 2-Level NC (n = 10 & n = 20)

Coop. & No NC(n = 10 & n = 20)

Figure 6.6: A break-down of the average energy consumption

0

2

3

5

6

2 5 8 11 14 17 20

Aggressive CollaborationEqual CollaborationBattery-Centric CollaborationOptimal Scheduling

Number of Cooperating Nodes

Ave

rage

Ene

rgy

Con

sum

ptio

n (K

J)

Figure 6.7: Effectiveness of the RAS algorithm

168

increased WiFi transmissions.

6.3.2 Centralized Optimal RAS vs. Distributed Heuristic Algo-

rithms

In this experiment, we compare the optimal RAS algorithm which benefits from central-

ized resource allocation and scheduling, with three heuristic algorithms for the distributed

Schedule() function in Alg. 1, i.e., aggressive collaboration, equal collaboration, and battery

centric collaboration strategies. In aggressive collaboration, nodes aggressively collaborate

with each other to download video segments, i.e., nodes download from the cellular link

whenever possible. The bandwidth of the downlink at each node determines how many seg-

ments a node will download from the Cloud. On average, the download share of each node

can be characterized as in Eqn. 6.15.

ri,dri,d + ri,w

' Ci∑j∈S Cj

(6.15)

where ri,d is the number of segments retrieved using a cellular downlink, ri,w is the number of

segments received or overheard over the WiFi network, Ci is the cellular downlink capacity

of node Ni, and S ⊆ N is the set of collaborating nodes that have Ci > 0.

In equal collaboration, fairness is enforced among nodes, i.e., each node downloads the

same amount of video content from the Cloud. The contribution of a node is determined by

number of nodes with cellular connections to the Cloud. On average, the download share

of each node can be characterized as in Eqn. 6.15. For each node Ni ∈ N the Schedule()

function returns true if the following condition holds:

ri,dri,d + ri,w

≤ 1

NC

(6.16)

where NC is the number of collaborating nodes that have Ci > 0. Assuming that nodes

can overhear messages in the WiFi network, every node can have a close estimation on the

number of nodes in the network.

169

Finally, in battery-centric collaboration, the objective is to maximize the overall battery

lifetime in the system. In other words, nodes with relatively low battery level will avoid

scheduling transmissions on their cellular links, while nodes with relatively longer remaining

battery lifetime will stream from the Cloud on behalf of the WiFi network. For each node

Ni ∈ N the Schedule() function returns true if the following condition holds:

Bi ≥ BC (6.17)

where Bi is the battery level of node i and BC is the approximate average of the battery

levels among nodes with Ci > 0. This scheduling algorithm requires each node to announce

its battery level, which can be piggybacked in any of the announcement messages.

The experiments are conducted in various heterogeneous cooperative networks consisting

of 20 nodes of types II—V defined in Table 6.3 with equal probability. To imitate the real-

world model of battery charges, we use a normal distribution with µ = 67 and σ2 = 10 [46]

to model the remaining battery lifetime. Again, we turn off the shared session elongation

constraint (i.e., ψ = 0) in order to focus on the impact of different rate allocation algorithms.

In order to ensure that almost all nodes have sufficient bandwidth on their cellular links to

sustain the streaming rate, we feed the simulated source a high-definition video of length

5482 seconds with a lower bitrate of 1.61 Mbps.

The specification of the nodes used in this experiment are summarized in Table 6.3.2.

Furthermore, we assume that the nodes join the system according to the order of node

identifiers, i.e., if the cooperative network contains 10 nodes, these nodes are nodes 1 to 10

from Table 6.3.2.

Fig. 6.3.1 shows that, compared to the aggressive, equal, and battery-centric collaboration

algorithms, our proposed RAS algorithm saves up to 15.0%, 22.4%, and 39.7% of energy,

respectively. It is interesting that the battery-centric collaboration algorithm, the most

intuitive heuristic approach to conserve battery, is the least energy-efficient solution.

Next, we compare the streaming delay offered by each scheduling algorithm. Fig. 6.3.2

170

Table 6.4: Specification of heterogeneous nodes for experiment II

Node ID 1 2 3 4 5 6 7 8 9 10Node Type III V IV II V III V II III VCellular Throughput 1.5 0 1 3.8 0 1.5 0 3.8 1.5 0Battery Level (%) 42.3 53.6 77.9 32.8 32.2 54.3 39.0 56.2 45.7 43.5

Node ID 11 12 13 14 15 16 17 18 19 20Node Type IV II IV V III II III IV V IIICellular Throughput 1 3.8 1 0 1.5 3.8 1.5 1 0 1.5Battery Level (%) 50.1 60.0 71.3 47.5 51.4 29.2 52.9 49.0 52.4 41.5

shows that the aggressive scheduling offers the least delay since each node downloads as

fast as possible from the source. However, the tradeoff is less remaining battery lifetime.

The battery-centric scheduling algorithm is the worst among all four algorithms, which also

explains why this algorithm leads to higher energy consumption. For example, according to

Table 6.3, it costs type II nodes much less power to download segments over cellular links

than type IV nodes do. Hence, the longer it takes to download a segment, the more power is

consumed. Our optimal RAS algorithm approximates the best case very well, as it schedules

nodes to download segments over cellular links according to their cellular download rate si

and energy efficiency (KJ/GB), i.e., it prefers type II nodes over type III nodes and type

III nodes over type IV nodes. In summary, our algorithm outperforms equal collaboration

and battery-centric algorithms by 4.0% and 27.1% lower transmission delay, respectively.

However, it incurs 13.9% extra average delay compared to aggressive scheduling, due to its

tendency to conserve battery power on mobile devices.

6.3.3 Impact of the Session Elongation Constraint

Finally, we turn our attention to the impact of the shared session elongation constraint on

average energy usage, video segment transmission delay, and the uptime of mobile devices.

We resort to the same setting as in Sec. 6.3.2, and reduce the network size to 7 nodes to allow

close examination of individual nodes. The type and the default battery level of each node

is listed in Table 6.3.3. In this experiment, we vary the value of shared session elongation

171

400

700

1,000

1,300

1,600

1 4 7 10 13 16 19 20

Aggressive CollaborationEqual CollaborationBattery-Centric CollaborationOptimal Scheduling

Number of Cooperating Nodes

Ave

rage

Tra

nsm

issi

on D

elay

(m

s)

Figure 6.8: Average transmission delay of video segments offered by different schedulingalgorithm

coefficient (i.e., ψ) from 0 to 1.0. As discussed in Sec. 6.2, ψ = 0 removes the constraint

and ψ = 1 enforces the mobile devices with higher battery level to use the more expensive

cellular downlink to receive the video segments and serve it in the WiFi network.

Table 6.5: Specification of heterogeneous nodes

Node ID 1 2 3 4 5 6 7Node Type I II II III III IV IV

Battery Level (%) 71.4 66.3 37.2 60.5 89.4 55.1 77.9

Fig. 6.9 shows the decrease in energy level (computed based on the battery level) through-

out a streaming session for four different values of the shared session elongation coefficient ψ.

Fig. 6.9(a) shows that without this constraint, nodes with low energy efficiency (e.g., nodes

of types III and IV) live longer as the optimal RAS algorithm slaves the high energy-efficient

nodes to deliver the streaming content in the WiFi network. Although there are three nodes

still alive beyond 16 hours, the other four die as early as 6 hours. Hence, the cooperative

session for the entire network ends after 6 hours, although some devices live much longer.

The average streaming duration, denoted by D, is about 14 hours. According to Fig. 6.9(b),

6.9(c) and 6.9(d), as we increase the session elongation coefficient ψ, the lifetime of all nodes

converges to approximately 16 hours. The longest streaming session can be achieved when

ψ = 1. The only tradeoff is that nodes 5, 6, and 7 live shorter as they are invited to con-

172

0

6

12

18

24

30

0 2 4 6 8 10 12 14 16 18

Ener

gy L

evel

(K

J)

Streaming Duration (Hours)

Node 1 Node 2 Node 3Node 4 Node 5 Node 6Node 7

(a) ψ = 0, D = 14 : 12′

0

6

12

18

24

30

0 2 4 6 8 10 12 14 16

Ener

gy L

evel

(K

J)

Streaming Duration (Hours)

Node 1 Node 2 Node 3Node 4 Node 5 Node 6Node 7

(b) ψ = 0.33, D = 15 : 11′

0

6

12

18

24

30

0 2 4 6 8 10 12 14 16

Ener

gy L

evel

(K

J)

Streaming Duration (Hours)

Node 1 Node 2 Node 3Node 4 Node 5 Node 6Node 7

(c) ψ = 0.66, D = 15 : 27′

0

6

12

18

24

30

0 2 4 6 8 10 12 14 16

Ener

gy L

evel

(K

J)

Streaming Duration (Hours)

Node 1 Node 2 Node 3Node 4 Node 5 Node 6Node 7

(d) ψ = 1.0, D = 15 : 55′

Figure 6.9: Length of the streaming session when varying shared session elongation coefficientψ

173

tribute their battery to assist the streaming session. Nonetheless, the entire group can now

have a longer streaming session. Please notice that in Fig. 6.9(a) and Fig. 6.9(b) node 7

cannot stream the reference video alone since its downlink capacity is less than the video bit

rate.

Fig. 6.3.3 depicts the average energy consumption throughout the streaming sessions

simulated using different values for shared session elongation coefficient. When ψ = 0, the

average energy consumption is low as the cooperative streaming in the WiFi network is

empowered by the few energy-efficient nodes. Once these nodes (nodes 1 — 4) consume

their battery power and leave the system, less energy efficient nodes become involved to

keep the session going, resulting in higher average energy consumption. With ψ > 0, the

average energy consumption is higher at the beginning of the streaming session, compared

to the case with ψ = 0. Since the optimal RAS algorithm involves all nodes according to

their battery level and energy efficiency, the workload is proportionally distributed among

nodes. Consequently, the average energy consumption remains almost constant when ψ = 1.

0

0.75

1.50

2.25

3.00

1 2 4 6 8 10 12 14 16 18

Hou

rly

Ave

rage

Ene

rgy

Con

sum

ptio

n (K

J)

Streaming Duration (Hours)

Ѱ = 0 Ѱ = 0.33Ѱ = 0.66 Ѱ = 1.0

⇣P1

⇣P2⇣P3

⇣P4

Figure 6.10: Average energy consumption for different values of the shared session elongationcoefficient

174

400

600

800

1,000

1,200

2 4 6 8 10 12 14 16 18

Ave

rage

Tra

nsm

issi

on D

elay

(m

s)

Streaming Duration (Hours)

Ѱ = 0 Ѱ = 0.33Ѱ = 0.66 Ѱ = 1.0

Figure 6.11: Average transmission delay of video segments when varying the shared sessionelongation coefficient

Fig. 6.3.3 compares the average transmission delay of video segments throughout the

streaming sessions simulated using different values for shared session elongation coefficient.

We observe that higher ψ value leads to longer transmission delays, since the optimal RAS

algorithm utilizes slower nodes to avoid fast battery drainage on the powerful nodes. Among

all nodes, the nodes with low throughput values will be slow in delivering their segments.

However when ψ = 1, the delay decreases after a while, as the nodes approach the same

battery level and more nodes are involved in the segment download process. Contrarily,

when ψ = 0, the delay increases after a while due to the decease of powerful nodes. The

overall average transmission delay for ψ = 0, ψ = 0.33, ψ = 0.66, and ψ = 1.0 are measured

as 683 ms, 732 ms, 746 ms, and 809 ms, respectively. The increase in average delay when ψ

changes from 0 to 1 is less than 126 ms, which will not degrade the quality of the streaming

session.

6.4 Summary

In this chapter, we first proposed a two-level coding and transmission scheme to stream

multimedia content to smartphones. At the top level, the media server produces coded

blocks of video segments (in the cloud) using systematic network coding in GF(256) and

175

serves coded blocks to the smartphones. At the bottom level, received content is shared

among smartphones within a WiFi network. The content is transmitted in both verbatim

form and coded form using systematic network coding in GF(2). Furthermore, we minimize

the network coding operations on smartphones by offloading the encoding process to the

server side (i.e., the cloud) and using XOR-only network coding in the WiFi network only

when necessary.

Furthermore, we proposed the optimal rate allocation and segment scheduling for mini-

mizing both the streaming traffic in the cellular network and the energy consumed by stream-

ing applications on mobile devices. More specifically, the proposed algorithm determines the

number of segments and the actual segments to be transmitted on each link. The actual

delivery of segments are achieved through the two-level coding scheme proposed in the first

section, ensuring an energy-efficient recovery of the streaming content on mobile devices. We

evaluate the system using a simulator driven by energy profile from real devices and com-

pare the performance of the proposed centralized optimal models with different distributed

heuristic scheduling models.

Our experimental results show that the proposed two level coding system saves up to

73% of battery usage on each phone compared to the current streaming mechanism over the

cellular network, and up to 52% compared to previous research works [72]. Due to the efficient

use of CPU time, the maximum achievable throughput in our system is approaching the

maximum capacity of the WiFi network, i.e., the system is capable to support higher quality

streaming. Furthermore, the proposed system provides short delay and long-lasting video

streaming by balancing the remaining battery lifetime among a group of cooperating nodes.

Moreover, compared to the proposed heuristic rate allocation and scheduling algorithms,

the proposed optimal RAS algorithm leads to significant energy saving on mobile devices.

Finally, our study on the impact of the session elongation constraint shows that enforcing

proportional contribution among all nodes effectively prolongs the streaming session for the

176

entire cooperative group.

While the proposed model significantly improves resource efficiency of video streaming

for collaborating smartphones, the provided bandwidth gain can also be traded for higher

video quality.

177

Chapter 7

Concluding Remarks and Future Works

Towards high quality and resource efficient video streaming service in mobile networks, in

this thesis we emphasized what can be brought from video coding and compression to differ-

ent parts of video streaming life cycle. This objective requires investigating the underlying

mechanisms and properties of the advanced video coding standards. Toward this goal, in

Chapter 3 we first introduced our carefully selected dataset of full-HD raw video sequences.

The dataset was assembled such that the selected video sequences represent a variety of con-

tent types that a video streaming server might need to encode and stream. Next, we looked

deeper into layered video coding and performed a systematic study on the use of advanced

video coding (H.264/AVC) and scalable video coding (SVC) for full HD video streaming.

We learned that in spite of the results reported in previous research works for using SVC for

low resolution videos (e.g., CIF and 4CIF), SVC requires less computational resources than

AVC in the encoding phase and also benefits from higher video quality in higher resolutions.

The main barriers for broad use of layered video coding in general and SVC in particular

are up to two times higher decoding time due to the more complex prediction loops and the

lack of embedded hardware decoders.

We notice that the results and observations reported in Chapter 3 can be generalized

to higher resolutions (such as 4K and 8K) for H.264/AVC and SVC. Even though these

standards do not support these resolutions, we tweaked the reference software to confirm

the findings in higher resolutions. However, it is not easy to generalize the reported ob-

servations to the new generation of video coding standards, i.e., H.265/HEVC and SHVC.

As discussed in Appendix A, many details have been modified in H.265/HEVC and SHVC.

Most importantly, the basic coding unit of video frames has changed from macroblocks and

178

sub-macroblocks to coding tree units (CTUs). CTUs can use larger block structures of up to

64× 64 samples and can better sub-partition the picture into variable sized structures. Such

a major change in the basic coding unit, along with other improvements and modifications,

may change the effect of layering and different video compression parameters and settings.

Looking into the application of internal dynamics of video coding standards in the life

cycle of a video streaming episode, in Chapter 4 we proposed a novel distributed video

transcoding scheme. The proposed scheme takes advantage of visual similarity among mac-

roblocks in a video sequence to reduce bitrate and transcoding time of the transcoded video.

As pioneering work in this research direction, we proposed an algorithm to extract the

dependency among macroblocks in an encoded video, based on which we determined the

dependency between successive GOPs. GOPs then were clustered according to their depen-

dency to create variable-size video chunks so that visually similar GOPs were put in one

chunk. Through experiments, we demonstrated that the proposed scheme reduces the video

bitrate and transcoding time. While the proposed model was evaluated using layered video

coding standard SVC, the same principle can be applied to other video coding standards such

as H.264/AVC or H.265/HEVC. The details of the model, however, needs to be adjusted to

the prediction mechanisms embedded in each video coding standard such that the similarity

between frames and GOPs are correctly calculated.

One related direction of future research is to determine the proper balance between video

transcoding and layered video coding. As discussed in details in Chapters 2 and 3, layered

video coding allows the edge media streamer to select a proper set of layers to match the

connection bandwidth and the hardware capabilities of the end user device. However, this

flexibility comes with a cost in terms of bitrate overhead for higher layers and computational

complexity of the coding tasks. The parallel solution to this problem is video transcoding that

transcodes the video either on the fly or using a set of predefined compression settings. Video

transcoding also comes with high computational cost along with more storage requirement

179

if it is offline. We need a model to wisely decide between adding or modifying a video layer,

adding a new encoding setting to the offline transcoding engine, or transcoding the video on

the fly to serve the specific end user. A straightforward solution toward this problem is using

a set of predefined parameters to create a constrained optimization problem and solve the

problem every time a decision needs to be made. As demonstrated in Chapter 4, considering

the video properties and the internal dynamics of the selected video coding standard might

strongly affect the cost function of the problem. The same statement can be made for a

heuristic solution.

In Chapter 5, we turned our attention to the video transmission phase of a video streaming

session. First, we precisely studied the coding and prediction mechanism of state-of-the-art

layered video coding standard, i.e., SVC. Next, toward smarter protection of video packets

over noisy communication channels and better quality of the transmitted video, we proposed

a novel coding and dependency aware unequal error protection algorithm. The proposed

algorithm calculates the importance of different video packets and associates protection to

each video packet, respectively. Experimental results show that the proposed algorithm

outperforms the state-of-the-art unequal error protection algorithms in terms of the quality

of the transmitted video. Finally, we completed the proposed UEP model by extending

the UEP problem from unicast scenario to multicast scenario, in which the full potential of

layered video coding is utilized by allowing the transmission network to multicast one copy

of the layered video for groups of heterogeneous mobile devices. To this end, we proposed a

new technique to dynamically adjust and combine the protection FEC packets for reference

and dependent video layers for video multicast in mobile communication networks.

The main idea proposed in Chapter 5, is utilizing the internal dependencies of compressed

video sequences to determine the importance of video packets. This idea can be applied

to any modern video coding standard, including H.265/HEVC and SHVC, as this idea is

extracted from video compression techniques. As expected, the new coding tree structure

180

and prediction mechanisms require the details of this method to be tailored again. Therefore,

a potential direction for future research can be the application of the same principle to the

new video coding standards. Furthermore, on the multicast problem, a deeper look into the

modulation schemes used in mobile wireless networks may bring more space to improve the

proposed solution.

Finally, in Chapter 6 we looked into the last part of the video streaming life cycle, i.e.,

video reception in smartphones. In this chapter, we first proposed a two-level coding and

transmission scheme to stream multimedia content to smartphones. At the top level, we used

systematic network coding in a Galois Field, i.e., GF(256). At the bottom level, the received

content is shared among smartphones within a WiFi network and systematic network coding

over GF(2) is utilized to augment the transmission process. Furthermore, we proposed the

optimal rate allocation and segment scheduling to minimize both the streaming traffic in

the cellular network and the energy consumed by streaming applications on mobile devices.

We also demonstrated that the proposed method outperforms similar works by reducing

the mobile traffic in the cellular network, reducing the battery usage on the smartphones,

reducing the transmission delay, and increasing the video quality.

181

Bibliography

[1] Joint Scalable Video Model (JSVM) software, version 9.19.15, Fraunhofer Heinrich-

Hertz-Institut, available online.

[2] Video Quality Measurement Tool, version 1.1, Multimedia Signal Processing Group,

Ecole Polytechnique Federale de Lausanne (EPFL), http://mmspg.epfl.ch.

[3] Xiph.org Test Media collection, http://media.xiph.org.

[4] Video System Characteristics of AVC, ATSC Standard A/72, Part 1:2008.

[5] Coding of audio-visual objects - part 2. ISO/IEC 14492-2 (MPEG-4 Visual), ISO/IEC

JTC 1, Version 3: May 2004.

[6] H.265/HEVC Reference Software (HM). https://hevc.hhi.fraunhofer.de/svn/svn-

HEVCSoftware/.

[7] NCUtils, Network Coding Utilities. http://code.google.com/p/ncutils.

[8] OpenHEVC, H.265/HEVC Encoder and Decoder Software.

https://github.com/OpenHEVC/ffmpeg.

[9] WiFi Direct, http://www.wi-fi.org/discover-and-learn/wi-fi-direct.

[10] x265, H.265/HEVC Encoder and Decoder Software. http://x265.org/.

[11] TS 22.146 V9.0.0 Technical Specification Group Services and System Aspects; Multi-

media Broadcast/Multicast Service; Stage 1, 2008.

[12] R. Ahlswede, N. Cai, S. Y. R. Li, and R. W. Yeung. Network Information Flow. IEEE

Transactions on Information Theory, 46(4):1204–1216, July 2000.

182

[13] S. Ahmad, R. Hamzaoui, and M. Al-Akaidi. Unequal Error Protection Using Foun-

tain Codes with Applications to Video Communication. in IEEE Transactions on

Multimedia, 13(1):92–101, February 2011.

[14] S. Ahmad, R. Hamzaoui, and M. Al-Akaidi. Unequal Error Protection Using Fountain

Codes with Applications to Video Communication. IEEE Transactions on Multimedia,

13(1):92–101, 2011.

[15] G. Ananthanarayanan, V. N. Padmanabhan, L. Ravindranath, and C. A. Thekkath.

Combine: Leveraging the Power of Wireless Peers Through Collaborative Download-

ing. In Proc. of 5th International Conference on Mobile Systems, Applications, and

Services (MobiSys), pages 286–298, San Juan, Puerto Rico, June 11-14, 2007.

[16] G. Ananthanarayanan, V. N. Padmanabhan, L. Ravindranath, and C. A. Thekkath.

Combine: Leveraging the Power of Wireless Peers Through Collaborative Download-

ing. In Proc. of 5th International Conference on Mobile Systems, Applications, and

Services (MobiSys), pages 286–298, San Juan, Puerto Rico, June 11-14 2007.

[17] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering Points

to Identify the Clustering Structure. ACM Sigmod Record, 28(2):49–60, June 1999.

[18] R. Aparicio-Pardo, K. Pires, A. Blanc, and G. Simon. Transcoding Live Adaptive

Video Streams at a Massive Scale in the Cloud. In Proc. of the 6th ACM Multimedia

Systems Conference (MMSys 2015), pages 49–60. ACM, 2015.

[19] S. Arslan, P. Cosman, and L. Milstein. Generalized Unequal Error Protection LT

Codes for Progressive Data Transmission. in IEEE Transactions on Image Processing,

21(8):3586–3597, August 2012.

[20] A. Ashraf. Cost-Efficient Virtual Machine Provisioning for Multi-tier Web Applications

and Video Transcoding. In Proc. of 13th IEEE/ACM International Symposium on

183

Cluster, Cloud and Grid Computing (CCGrid), pages 66–69, Delft, Netherlands, May

13-16 2013.

[21] A. Ashraf, F. Jokhio, T. Deneke, S. Lafond, I. Porres, and J. Lilius. Stream-Based

Admission Control and Scheduling for Video Transcoding in Cloud Computing. In

Proc. of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid

Computing (CCGrid), pages 482–489, Belfast, Northern Ireland, May 13-16 2013.

[22] Z. Avramova, D. De Vleeschauwer, P. Debevere, S. Wittevrongel, P. Lambert, R. Van

De Walle, and H. Bruneel. Performance of Scalable Video Coding for a TV broad-

cast network with constant video quality and heterogeneous receivers. In Proc. of

the10th International Conference on Telecommunications (ConTEL 2009), pages 435–

441, June 2009.

[23] E. Baccaglini, T. Tillo, and G. Olmo. Image and Video Transmission: A Comparison

Study of Using Unequal Loss Protection And Multiple Description Coding. Multimedia

Tools and Applications, 55(2):247–259, 2011.

[24] S. Bae, G. Nam, and K. Park. Effective Content-Based Video Caching with Cache-

Friendly Encoding and Media-Aware Chunking. In Proc. of the 5th ACM Multimedia

Systems Conference, pages 203–212, Singapore, Singapore, March 19-21 2014.

[25] M. Blestel and M. Raulet. Open SVC Decoder: A Flexible SVC Library. In Proc. of

the International Conference on Multimedia (MM 2010), pages 1463–1466, 2010.

[26] C. Boldrini, M. Conti, and A. Passarella. Exploiting Users’ Social Relations to For-

ward Data in Opportunistic Networks: The HiBOp Solution. Pervasive and Mobile

Computing, 4(15):633–657, October 2008.

[27] S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine.

Computer Networks and ISDN Systems, 30(1):107–117, 1998.

184

[28] H. Cai, B. Zeng, G. Shen, Z. Xiong, and S. Li. Error-Resilient Unequal Error Protection

of Fine Granularity Scalable Video Bitstreams. EURASIP Journal on Advances in

Signal Processing, 2006.

[29] A. Carroll and G. Heiser. An Analysis of Power Consumption in a Smartphone. In

Proc. of USENIX Annual Technical Conf. (ATC), pages 21–34, Boston, MA, June

23-25 2010.

[30] D. M. Chandler and S. S. Hemami. VSNR: A Wavelet-based Visual Signal-to-noise

Ratio for Natural Images. IEEE Transactions on Image Processing, 16(9):2284–2298,

2007.

[31] Z. H. Chang, B. F. Jong, W. J. Wong, and M. D. Wong. Distributed Video Transcoding

on a Heterogeneous Computing Platform. In Proc. of IEEE Asia Pacific Conference

on Circuits and Systems (APCCAS), pages 444–447. IEEE, 2016.

[32] M. Chen. AMVSC: A Framework of Adaptive Mobile Video Streaming in the Cloud. In

Proc. of IEEE Global Communications Conference (GLOBECOM), pages 2042–2047,

Anaheim, California, December 3-7 2012.

[33] R. Cheng, W. Wu, Y. Lou, and Y. Chen. A Cloud-Based Transcoding Framework for

Real-Time Mobile Video Conferencing System. In Proc. of 2nd IEEE International

Conference on Mobile Cloud Computing, Services, and Engineering (MobileCloud),

pages 236–245, London, UK, April 7-10 2014.

[34] J. Chesterfield, R. Chakravorty, I. Pratt, S. Banerjee, and P. Rodriguez. Exploiting

Diversity to Enhance Multimedia Streaming Over Cellular Links. In Proc. of 24th IEEE

Conference on Computer Communications (INFOCOM), pages 2020–2031, Miami, FL,

March 13-17, 2005.

185

[35] H. Choi, J. Nam, D. Sim, and I. V. Bajic. Scalable Video Coding Based on High Effi-

ciency Video Coding (HEVC). In IEEE Pacific Rim Conference on Communications,

Computers and Signal Processing (PacRim), pages 346–351, Aug 2011.

[36] Cisco. Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update,

2015—2020, February 2016.

[37] I. R. communication Standardization Sector (ITU-R). Recommendation ITU-R

BT.500-11: Methodology for the Subjective Assessment of the Quality of Television

Pictures, June 2002.

[38] N. Damera-Venkata, T. D. Kite, W. S. Geisler, B. L. Evans, and A. C. Bovik. Image

Quality Assessment Based on a Degradation Model. IEEE Transactions on Image

Processing, 9(4):636–650, 2000.

[39] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.

Communications of the ACM, 52(1):107–113, January 2008.

[40] C. Diaz, J. Cabrera, F. Jaureguizar, and N. Garcia. Adaptive Protection Scheme for

MVC-Encoded Stereoscopic Video Streaming in IP-Based Networks. In Proc. of IEEE

Visual Communications and Image Processing (VCIP), pages 1–6, San Diego, CA,

November 27-30 2012.

[41] a. Q. C. Digital Fountain Incorporated. Application Layer Forward Error Correction

for Mobile Multimedia Broadcasting Case Study, 2009.

[42] H. Dung and S. Vafi. An Adaptive Unequal Error Protection Based on Motion Energy

of H.264/AVC Video Frames. In IEEE Conference on Wireless Communications and

Networking Conference (WCNC’13), pages 4594–4599, 2013.

[43] M. Eberhard, L. Celetto, C. Timmerer, E. Quacchio, and H. Hellwagner. Performance

Analysis of Scalable Video Adaptation: Generic versus Specific Approach. In Proc. of

186

the 9th Int. Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS

2008), pages 50–53, May 2008.

[44] A. Eichhorn and P. Ni. Pick your layers wisely-a quality assessment of h. 264 scal-

able video coding for mobile devices. In Proc. of IEEE International Conference on

Communications (ICC 2009), pages 1–6, Dresden, Germany, June 2009.

[45] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discov-

ering Clusters in Large Spatial Databases with Noise. In Proc. of the 2nd International

Conference on Knowledge Discovery and Data Mining (KDD-96), number 34, pages

226–231, Portland, Oregon, August 2-4 1996.

[46] D. Ferreira, A. K. Dey, and V. Kostakos. Understanding Human-Smartphone Con-

cerns: A Study of Battery Life. Lecture Notes in Computer Science, Pervasive Com-

puting, 6696:19–33, 2011.

[47] M. L. Fisher. An Applications Oriented Guide to Lagrangian Relaxation. in Interfaces,

15(2):10–21, March-April 1985.

[48] F. H. P. Fitzek, P. Kyritsi, and M. D. Katz. Cooperation in Wireless Networks: Prin-

ciples and Applications, chapter Power Consumption and Spectrum Usage Paradigms

in Cooperative Wireless Networks, pages 365–386. Springer, 2006.

[49] G. Gao, W. Zhang, Y. Wen, Z. Wang, W. Zhu, and Y. P. Tan. Cost Optimal Video

Transcoding in Media Cloud: Insights from User Viewing Pattern. In Proc. of IEEE In-

ternational Conference on Multimedia and Expo (ICME), pages 1–6, Chengdu, China,

July 14-18 2014.

[50] A. Garcia, H. Kalva, and B. Furht. A Study of Transcoding on Cloud Environments

for Video Content Delivery. In Proc. of 2010 ACM Multimedia Workshop on Mobile

Cloud Media Computing, pages 13–18. ACM, 2010.

187

[51] C. Gong, G. Yue, and X. Wang. Message-wise Unequal Error Protection Based on Low-

Density Parity-Check Codes. IEEE Transactions on Communications, 59(4):1019–

1030, 2011.

[52] R. Gupta, A. Pulipaka, P. Seeling, L. Karam, and M. Reisslein. H.264 Coarse Grain

Scalable (CGS) and Medium Grain Scalable (MGS) Encoded Video: A Trace Based

Traffic and Quality Evaluation. IEEE Transactions on Broadcasting, 58(3):428–439,

September 2012.

[53] H. Ha and C. Yim. Layer-weighted Unequal Error Protection for Scalable Video Coding

Extension of H. 264/AVC. IEEE Transactions on Consumer Electronics, 54(2):736–

744, 2008.

[54] B. Han, P. Hui, V. A. Kumar, M. V. Marathe, G. Pei, and A. Srinivasan. Cellular

Traffic Offloading through Opportunistic Communications: A Case Study. In Proc. of

5th ACM Workshop on Challenged Networks, pages 31–38, Chicago, IL, September

20-24, 2010.

[55] A. Heikkinen, J. Sarvanko, M. Rautiainen, and M. Ylianttila. Distributed Multi-

media Content Analysis with MapReduce. In Proc. of the 24th IEEE International

Symposium on Personal Indoor and Mobile Radio Communications (PIMRC), pages

3497–3501, London, UK, September 8-11 2013.

[56] T. Ho, R. Koetter, M. Medard, D. R. Karger, and M. Effros. The Benefits of Cod-

ing over Routing in a Randomized Setting. In Proc. of International Symposium on

Information Theory (ISIT 2003), page 442, Yokohama, Japan, June 29 - July 4, 2003.

[57] J.-C. Huang, C.-Y. Wu, and J.-J. Chen. On High Efficient Cloud Video Transcod-

ing. In Proc. of 2015 International Symposium on Intelligent Signal Processing and

Communication Systems (ISPACS), pages 170–173. IEEE, 2015.

188

[58] Z. Huang, C. Mei, L. Li, and T. Woo. CloudStream: Delivering High-Quality Stream-

ing Videos Through a Cloud-based SVC Proxy. In Proc. of IEEE Conference on

Computer Communications (INFOCOM), pages 201–205, Shanghai, China, April 10-

15 2011.

[59] P. Hui, J. Crowcroft, and E. Yoneki. Bubble Rap: Social-Based Forwarding in

Delay-Tolerant Networks. IEEE Transactions on Mobile Computing, 10(11):1576–

1589, November 2011.

[60] P. Hui, J. Crowcroft, and E. Yoneki. Bubble Rap: Social-Based Forwarding in

Delay-Tolerant Networks. IEEE Transactions on Mobile Computing, 10(11):1576–

1589, November 2011.

[61] Y. Huo, M. El-Hajjar, and L. Hanzo. Inter-Layer FEC Aided Unequal Error Protection

for Multilayer Video Transmission in Mobile TV. IEEE Transactions on Circuits and

Systems for Video Technology, 23(9):1622–1634, 2013.

[62] S. Ioannidis, A. Chaintreau, and L. Massoulie. Optimal and Scalable Distribution of

Content Updates Over a Mobile Social Network. In Proc. of 28th IEEE Conference

on Computer Communications (INFOCOM), pages 1422–1430, Rio de Janeiro, Brazil,

April 19-25, 2009.

[63] S. Ioannidis, A. Chaintreau, and L. Massoulie. Optimal and Scalable Distribution of

Content Updates Over a Mobile Social Network. In Proc. of 28th IEEE Conference

on Computer Communications (INFOCOM), pages 1422–1430, Rio de Janeiro, Brazil,

April 19-25 2009.

[64] ITU-T. Recommendation Y.1541 : Network Performance Objectives for IP-based

Services, 2011.

[65] ITU-T. H.264/AVC Reference Software, 2016.

189

[66] S. Jakubczak and D. Katabi. A Cross-layer Design for Scalable Mobile Video. In

Proc. of ACM Multimedia 2011 (MM 2011), pages 289–300, November 28-December

1 2011.

[67] S. Jeannin and A. Divakaran. MPEG-7 Visual Motion Descriptors. IEEE Transactions

on Circuits and Systems for Video Technology, 11(6):720–724, June 2001.

[68] F. Jokhio, A. Ashraf, S. Lafond, and J. Lilius. A Computation and Storage Trade-off

Strategy for Cost-Efficient Video Transcoding in the Cloud. In Proc. of 39th Con-

ference on Software Engineering and Advanced Applications (SEAA), pages 365–372,

Santander, Spain, September 4-6 2013.

[69] F. Jokhio, A. Ashraf, S. Lafond, I. Porres, and J. Lilius. Prediction-Based Dynamic Re-

source Allocation for Video Transcoding in Cloud Computing. In Proc. of the 21st In-

ternational Conference on Parallel, Distributed and Network-Based Processing (PDP),

pages 254–261, Belfast, Northern Ireland, February 27-March 1 2013.

[70] F. Jokhio, T. Deneke, S. Lafond, and J. Lilius. Bit Rate Reduction Video Transcoding

with Distributed Computing. In Proc. of the 20th International Conference on Parallel,

Distributed and Network-Based Processing (PDP), pages 206–212, Garching, Germany,

February 15-17 2012.

[71] A. B. Kahn. Topological Sorting of Large Networks. Communications of the ACM,

5(11):558–562, 1962.

[72] L. Keller, A. Le, B. Cici, H. Seferoglu, C. Fragouli, and A. Markopoulou. MicroCast:

Cooperative Video Streaming on Smartphones. In Proc. of 10th International Con-

ference on Mobile Systems, Applications, and Services (MobiSys), pages 57–70, Low

Wood Bay, United Kingdom, June 25-29, 2012.

[73] L. Keller, A. Le, B. Cici, H. Seferoglu, C. Fragouli, and A. Markopoulou. MicroCast:

190

Cooperative Video Streaming on Smartphones. In Proc. of 10th International Con-

ference on Mobile Systems, Applications, and Services (MobiSys), pages 57–70, Low

Wood Bay, United Kingdom, June 25-29 2012.

[74] J. Kim, R. M. Mersereau, and Y. Altunbasak. Error-Resilient Image and Video Trans-

mission over the Internet using Unequal Error Protection. IEEE Transactions on

Image Processing, 12(2):121–131, 2003.

[75] M. Kim, Y. Cui, S. Han, and H. Lee. Towards Efficient Design and Implementation of

a Hadoop-Based Distributed Video Transcoding System in Cloud Computing Environ-

ment. International Journal of Multimedia and Ubiquitous Engineering, 8(2):213–224,

2013.

[76] M. Kim, S. Han, Y. Cui, H. Lee, H. Cho, and S. Hwang. CloudDMSS: Robust Hadoop-

Based Multimedia Streaming Service Architecture for a Cloud Computing Environ-

ment. Cluster Computing, 17(3):605–628, September 2014.

[77] F. Lao, X. Zhang, and Z. Guo. Parallelizing Video Transcoding Using Map-Reduce-

Based Cloud Computing. In Proc. of IEEE International Symposium on Circuits and

Systems (ISCAS), pages 2905–2908, Seoul, Korea, May 20-23 2012.

[78] H. Y. Lee, H. K. Lee, and Y. H. Ha. Spatial Color Descriptor for Image Retrieval And

Video Segmentation. IEEE Transactions on Multimedia, 5(3):358–367, 2003.

[79] J.-S. Lee, F. De Simone, and T. Ebrahimi. Subjective Quality Assessment of Scalable

Video Coding: A Survey. In Proc. of the 3rd International Workshop on Quality of

Multimedia Experience (QoMEX 2011), pages 25–30, Mechelen, Belgium, September

2011.

[80] S. Li and S. G. Chan. BOPPER: Wireless Video Broadcasting With Peer-to-Peer

Error Recovery. In Proc. of IEEE International Conference on Multimedia and Expo

191

(ICME), pages 392–395, Beijing, China, July 2-5 2007.

[81] S. Li and S. H. G. Chan. BOPPER: Wireless Video Broadcasting With Peer-to-Peer

Error Recovery. In Proc. of IEEE International Conference on Multimedia and Expo

(ICME), pages 392–395, Beijing, China, July 2-5, 2007.

[82] X. Li, P. Amon, A. Hutter, and A. Kaup. Performance Analysis of Inter-Layer Pre-

diction in Scalable Video Coding Extension of H.264/AVC. IEEE Transactions on

Broadcasting, 57(1):66–74, March 2011.

[83] Z. Li, Y. Huang, G. Liu, F. Wang, Z.-L. Zhang, and Y. Dai. Cloud Transcoder:

Bridging the Format and Resolution Gap Between Internet Videos and Mobile Devices.

In Proc. of the 22nd International Workshop on Network and Operating System Support

for Digital Audio and Video (NOSSDAV), pages 33–38, June 7-8 2012.

[84] A. Limmanee and W. Henkel. UEP Network Coding for Scalable Data. In Proc. of

5th International Symposium on Turbo Codes and Related Topics, pages 333–337, Lau-

sanne, Switzerland, September 1-5 2008.

[85] Z. Liu, C. Wu, B. Li, and S. Zhao. UUSee: Large Scale Operational On Demand

Streaming with Random Network Coding. In Proc. of 29th Annual IEEE International

Conference on Computer Communications (INFOCOM), pages 1–9, San Diego, CA,

March 15-19 2010.

[86] E. Maani and A. K. Katsaggelos. Unequal Error Protection for Robust Streaming

of Scalable Video over Packet Lossy Networks. IEEE Transactions on Circuits and

Systems for Video Technology, 20(3):407–416, 2010.

[87] D. MacKay. Fountain Codes. in IEEE Proceedings of Communications, 152(6):1062–

1068, December 2005.

192

[88] V. Magoulianitis and I. Katsavounidis. HEVC Decoder Optimization in Low Power

Configurable Architecture for Wireless Devices. In IEEE 16th International Sympo-

sium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM), pages

1–6. IEEE, 2015.

[89] J. L. Mannos and D. J. Sakrison. The Effects of a Visual Fidelity Criterion of the

Encoding of Images. IEEE Transactions on Information Theory, 20(4):525–536, 1974.

[90] B. Masnick and J. Wolf. On Linear Unequal Error Protection Codes. IEEE Transac-

tions on Information Theory, 13(4):600–607, 1967.

[91] L. Merritt and R. Vanam. x264: A High Performance H.264/AVC Encoder.

http://neuron.net/library/avc/overview, 2006.

[92] A. Moorthy, L. K. Choi, A. Bovik, and G. de Veciana. Video Quality Assessment

on Mobile Devices: Subjective, Behavioral and Objective Studies. IEEE Journal of

Selected Topics in Signal Processing, 6(6):652–671, Oct 2012.

[93] A. Nafaa, T. Taleb, and L. Murphy. Forward Error Correction Strategies for Media

Streaming over Wireless Networks. IEEE Communications Magazine, 46(1), 2008.

[94] S. Nazir, V. Stankovic, I. Andonovic, and D. Vukobratovic. Application Layer System-

atic Network Coding for Sliced H.264/AVC Video Streaming. Advances in Multimedia,

2012:1–9, 2012.

[95] S. Nazir, V. Stankovic, and D. Vukobratovic. Expanding Window Random Linear

Codes for Data Partitioned H.264 Video Transmission over DVB-H Network. In

18th IEEE International Conference on Image Processing (ICIP’11), pages 2205–2208,

2011.

[96] S. Nazir, V. Stankovic, and D. Vukobratovic. Unequal Error Protection for Data Parti-

tioned H.264/AVC Video Streaming with Raptor and Random Linear Codes for DVB-

193

H Networks. In IEEE International Conference on Multimedia and Expo (ICME),

pages 1–6, 2011.

[97] T. Oelbaum, H. Schwarz, M. Wien, and T. Wiegand. Subjective Performance Evalu-

ation of the SVC Extension of H.264/AVC. In Proc. of the 15th IEEE International

Conference on Image Processing (ICIP 2008), pages 2772–2775, October 2008.

[98] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand. Comparison of

the Coding Efficiency of Video Coding Standards — Including High Efficiency Video

Coding (HEVC). IEEE Transactions on Circuits and Systems for Video Technology,

22(12):1669–1684, 2012.

[99] D. K. Park, Y. S. Jeon, and C. S. Won. Efficient Use of Local Edge Histogram

Descriptor. In Proc. of the 2000 ACM Workshops on Multimedia, pages 51–54, 2000.

[100] V. Pavlushkov, R. Johannesson, and V. Zyablov. Unequal Error Protection for Con-

volutional Codes. IEEE Transactions on Information Theory, 52(2):700–708, 2006.

[101] M. V. Pedersen, F. H. P. Fitzek, and J. Heide. On-the-Fly Packet Error Recovery in a

Cooperative Cluster of Mobile Devices. In Proc. of IEEE Global Telecommunications

Conference (GLOBECOM), pages 1–6, Houston, TX, December 5-9, 2011.

[102] M. V. Pedersen, J. Heide, F. H. Fitzek, and T. Larsen. Pictureviewer: A Mobile

Application Using Network Coding. In Proc. of European Wireless Conference, pages

151–156, Aalborg, Denmark, May 17-20, 2009.

[103] V. Pedersen and F. H. P. Fitzek. Implementation and Performance Evaluation of

Network Coding for Cooperative Mobile Devices. In Proc. of IEEE International

Conference on Communications (ICC), pages 91–96, Beijing, China, May 19-23, 2008.

[104] H.-D. Pham and V. Sina. Unequal Error Protection of H. 264/AVC Video Bitstreams

Based on Data Partitioning and Motion Information of Slices. In IEEE International

194

Conference on Signal Processing, Communication and Computing (ICSPCC 2012),

pages 634–639, Hong Kong, China, August 2012.

[105] I. Politis, L. Dounis, and T. Dagiuklas. H.264/SVC vs. H.264/AVC Video Quality

Comparison under QoE-driven Seamless Handoff. Signal Processing: Image Commu-

nication, 27(8):814 – 826, 2012.

[106] A. Pulipaka, P. Seeling, M. Reisslein, and L. Karam. Overview and Traffic Characteri-

zation of Coarse-Grain Quality Scalable (CGS) H.264 SVC Encoded Video. In Proc. of

the 7th IEEE Consumer Communications and Networking Conference (CCNC 2010),

pages 1–5, Las Vegas, NV, January 2010.

[107] A. Ramasubramonian and J. Woods. Video Multicast Using Network Coding. In

Proc. of SPIE Conference on Visual Communications and Image Processing (VCIP),

pages 1–11, San Jose, CA, January 2009.

[108] R. Razavi, M. Fleury, M. Altaf, H. Sammak, and M. Ghanbari. H.264 Video Streaming

with Data-Partitioning and Growth Codes. In IEEE 16th International Conference on

Image Processing, pages 90–912, 2009.

[109] P. Rodriguez, R. Chakravorty, J. Chesterfield, I. Pratt, and S. Banerjee. MAR: A

Commuter Router Infrastructure for the Mobile Internet. In Proc. of 2nd International

Conference on Mobile Systems, Applications, and Services (MobiSys), pages 217–230,

Boston, MA, June 6-9, 2004.

[110] S. S. and R. J. AHG Report on Spatial Scalability Resampling. Technical report,

Report JVT-R006 of the 18th Meeting of the Joint Video Team, 2006.

[111] O. Salim and W. Xiang. A Novel Unequal Error Protection Scheme for 3D Video Trans-

mission over Cooperative MIMO-OFDM Systems. in EURASIP Journal on Wireless

Communications and Networking, (1):269–283, 2012.

195

[112] H. Schwarz, D. Marpe, and T. Wiegand. Overview of the Scalable Video Coding

Extension of the H. 264/AVC Standard. IEEE Transactions on Circuits and Systems

for Video Technology, 17(9):1103–1120, 2007.

[113] P. Seeling, F. H. P. Fitzek, G. Ertli, A. Pulipaka, and M. Reisslein. Video Network

Traffic and Quality Comparison of VP8 and H.264 SVC. In Proc. of the 3rd Workshop

on Mobile Video Delivery (MoViD 2010), pages 33–38, 2010.

[114] H. Seferoglu, L. Keller, B. Cici, A. Le, and A. Markopoulou. Cooperative Video

Streaming on Smartphones. In Proc. of 49th Annual Allerton Conference on Commu-

nication, Control, and Computing, pages 220–227, Urbana Champaign, IL, September

28-30, 2011.

[115] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack. Study of

Subjective and Objective Quality Assessment of Video. IEEE Transactions on Image

Processing, 19(6):1427–1441, 2010.

[116] H. Sheikh and A. Bovik. Image Information and Visual Quality. IEEE Transactions

on Image Processing, 15(2):430–444, February 2006.

[117] H. Shojania and B. Li. Random Network Coding on the iPhone: Fact or Fiction? In

Proc. of 18th International Workshop on Network and Operating Systems Support for

Digital Audio and Video (NOSSDAV), pages 37–42, Williamsburg, VA, June 3-5, 2009.

[118] T. Sikora. The MPEG-7 Visual Standard for Content Description - An Overview.

IEEE Transactions on Circuits and Systems for Video Technology, 11(6):696–702, June

2001.

[119] T. Sikora. The MPEG-7 Visual Standard for Content Description - An Overview.

IEEE Transactions on Circuits and Systems for Video Technology, 11(6):696–702,

2001.

196

[120] M. Slanina, M. Ries, and J. Vehkapera. Rate Distortion Performance of H.264/SVC

in Full HD with Constant Frame Rate and High Granularity. In Proc. of the 8th

International Conference on Digital Telecommunications (ICDT 2013), pages 7–13,

Venice, Italy, April 2013.

[121] H. Soroush, P. Gilbert, N. Banerjee, M. D. Corner, B. N. Levine, and L. Cox. Spider:

Improving Mobile Networking with Concurrent Wi-Fi Connections. ACM SIGCOMM

Computer Communication Review, 41(4):402–403, August 2011.

[122] Statista. Number of Mobile Applications Available in Leading App Stores as of July

2015, July 2015.

[123] M. Stiemerling and S. Kiesel. A System for Peer-to-Peer Video Streaming in Resource

Constrained Mobile Environments. In Proc. of 1st ACM Workshop on User-Provided

Networking (U-NET), pages 25–30, Rome, Italy, December 1-4, 2009.

[124] T. Stockhammer and M. Bystrom. H.264/AVC Data Partitioning for Mobile Video

Communication. In International Conference on Image Processing (ICIP’04), vol-

ume 1, pages 545–548, 2004.

[125] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the High Efficiency

Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for

Video Technology, 22(12):1649–1668, 2012.

[126] N. Thomos, S. Argyropoulos, N. V. Boulgouris, and M. G. Strintzis. Robust Trans-

mission of H.264/AVC Streams using Adaptive Group Slicing and Unequal Error Pro-

tection. EURASIP Journal on Applied Signal Processing, 2006:1–13, 2006.

[127] N. Thomos, J. Chakareski, and P. Frossard. Prioritized Distributed Video Delivery

With Randomized Network Coding. in IEEE Transactions on Multimedia, 13(4):776–

787, August 2011.

197

[128] C. L. Tsao and R. Sivakumar. On Effectively Exploiting Multiple Wireless Interfaces

in Mobile Hosts. In Proc. of 5th Int. Conf. on Emerging Networking Experiments and

Technologies (CoNEXT), pages 337–348, Rome, Italy, December 1-4 2009.

[129] C. L. Tsao and R. Sivakumar. On Effectively Exploiting Multiple Wireless Interfaces

in Mobile Hosts. In Proc. of 5th International Conference on Emerging Networking

Experiments and Technologies (CoNEXT), pages 337–348, Rome, Italy, December 1-4

2009.

[130] I. Unanue, I. Urteaga, R. Husemann, J. D. Ser, V. Roesler, A. Rodriguez, and

P. Sanchez. A Tutorial on H.264/SVC Scalable Video Coding and its Tradeoff between

Quality, Coding Efficiency and Performance. Recent Advances on Video Coding. In-

tech Open Access Publisher, 2011.

[131] G. Van der Auwera, P. David, and M. Reisslein. Traffic and Quality Characterization of

Single-Layer Video Streams Encoded with the H.264/MPEG-4 Advanced Video Coding

Standard and Scalable Video Coding Extension. IEEE Transactions on Broadcasting,

54(3):698–718, Sept 2008.

[132] A. Vetro. MPEG-21 Digital Item Adaptation: Enabling Universal Multimedia Access.

IEEE MultiMedia, 11(1):84–87, January 2004.

[133] P. F. H. P. F. Vingelmann, M. V. Pedersen, J. Heide, and H. Charaf. Synchronized

Multimedia Streaming on the iPhone Platform with Network Coding. in IEEE Com-

munications Magazine, 49(6):126–132, June 2011.

[134] D. Vukobratovic and V. Stankovic. Unequal Error Protection Random Linear Coding

for Multimedia Communications. In IEEE International Workshop on Multimedia

Signal Processing, pages 280–285, 2010.

198

[135] D. e. a. Vukobratovic. Unequal Error Protection Random Linear Coding Strategies for

Erasure Channels. IEEE Transactions on Communications, 60(5):1243–1252, 2012.

[136] F. Wang, J. Liu, and M. Chen. CALMS: Cloud-Assisted Live Media Streaming for

Globalized Demands with Time/Region Diversities. In Proc. of IEEE Conference on

Computer Communications (INFOCOM), pages 199–207, Orlando, Florida, March 25-

30 2012.

[137] H. Wang and C. Kuo. Robust Video Multicast With Joint Network Coding and AL-

FEC. In Proc. of IEEE International Symposium on Circuits and Systems (ISCAS

2008), pages 2062–2065, Seattle, Washington, May 18-21 2008.

[138] H. Wang, S. Xiao, and C. Kuo. Random Linear Network Coding With Ladder-Shaped

Global Coding Matrix for Robust Video Transmission. in Elsevier Journal of Visual

Communication and Image Representation, 22(3):203–212, April 2011.

[139] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image Quality Assessment: from

Error Visibility to Structural Similarity. IEEE Transactions on Image Processing,

13(4):600–612, April 2004.

[140] Z. Wang and A. C. Bovik. A Universal Image Quality Index. IEEE Signal Processing

Letters, 9(3):81–84, 2002.

[141] Z. Wang, E. Simoncelli, and A. Bovik. Multiscale Structural Similarity for Image

Quality Assessment. In Proc. of the 37th Asilomar Conference on Signals, Systems

and Computers, volume 2, pages 1398–1402, November 2003.

[142] WebM. VP9 Video Codec, June 2013.

[143] J. Whitbeck, M. Amorim, Y. Lopez, J. Leguay, and V. Conan. Relieving the Wireless

Infrastructure: When Opportunistic Networks Meet Guaranteed Delays. In Proc. of

199

IEEE International Symposium on a World of Wireless, Mobile and Multimedia Net-

works (WoWMoM), pages 1–10, Paris, France, June 20-24, 2011.

[144] S. Wicker and V. Bhargava. Reed-Solomon Codes and Their Applications. Wiley, 1999.

[145] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the

H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems

for Video Technology, 13(7):560–576, 2003.

[146] M. Wien, H. Schwarz, and T. Oelbaum. Performance Analysis of SVC. IEEE Trans-

actions on Circuits and Systems for Video Technology, 17(9):1194–1203, September

2007.

[147] C. S. Won, D. K. Park, and S.-J. Park. Efficient Use of MPEG-7 Edge Histogram

Descriptor. ETRI journal, 24(1):23–30, 2002.

[148] J. W. Woods. Multidimensional Signal, Image, and Video Processing and Coding.

Academic Press, 2 edition, July 2011.

[149] Y. Wu, C. Wu, B. Li, and F. C. Lau. vSkyConf: Cloud-assisted Multi-party Mobile

Video Conferencing. In Proc. of the 2nd ACM SIGCOMM Workshop on Mobile Cloud

Computing, pages 33–38, Hong Kong, China, August 12 2013.

[150] R. Xu, D. Wunsch, et al. Survey of Clustering Algorithms. IEEE Transactions on

Neural Networks, 16(3):645–678, May 2005.

[151] M. Yang, J. Cai, W. Zhang, Y. Wen, and C. H. Foh. Adaptive Configuration of Cloud

Video Transcoding. In Proc. of 2015 IEEE International Symposium on Circuits and

Systems (ISCAS), pages 1658–1661. IEEE, 2015.

[152] YouTube. YouTube Live Encoder Settings, Bitrates and Resolutions, March 2016.

200

[153] M. R. Zakerinasab and M. Wang. An Update Model for Network Coding in Cloud

Storage Systems. In Proc. of 50th Annual Allerton Conference on Communication,

Control, and Computing (Allerton), pages 1–8, Monticello, IL, October 1-5 2012.

[154] M. R. Zakerinasab and M. Wang. A Cloud-Assisted Energy Efficient Video Streaming

System for Smartphones. In Proc. of IEEE/ACM International Symposium on Quality

of Service (IWQoS 2013), pages 1–10, Montreal, Canada, June 3-4 2013.

[155] M. R. Zakerinasab and M. Wang. DeltaNC: Efficient File Updates for Network-

Coding-Based Cloud Storage Systems. In Proc. of IEEE 21st International Symposium

on Modeling, Analysis and Simulation of Computer and Telecommunication Systems

(MASCOTS), pages 1–5, San Francisco, CA, August 14-16 2013.

[156] M. R. Zakerinasab and M. Wang. An Anatomy of H.264/SVC for Full HD Video

Streaming. In Proc. of IEEE 22nd International Symposium on Modeling, Analysis

and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 1–

10, Paris, France, September 9-11 2014.

[157] M. R. Zakerinasab and M. Wang. Optimal Rate Allocation and Scheduling in Coop-

erative Streaming. In Proc. of IEEE 39th Conference on Local Computer Networks

(LCN 2014), pages 1–4, Edmonton, Canada, September 8-11 2014.

[158] M. R. Zakerinasab and M. Wang. Adaptive Video Streaming in Heterogeneous Mobile

Networks. In Proc. of IEEE Wireless Communications and Networking Conference

(WCNC 2015), pages 1–6, New Orleans, LA, March 9-12 2015.

[159] M. R. Zakerinasab and M. Wang. Dependency-Aware Distributed Video Transcoding

in the Cloud. In Proc. of IEEE 40th Conference on Local Computer Networks (LCN

2015), pages 1–8, Clearwater Beach, FL, October 26-29 2015.

[160] M. R. Zakerinasab and M. Wang. Dependency-Aware Unequal Error Protection for

201

Layered Video Coding. In Proc. of ACM Multimedia 2015 (MM 2015), pages 1–10,

Brisbane, Australia, October 26-30 2015.

[161] M. R. Zakerinasab and M. Wang. Does Chunk Size Matter in Distributed Video

Transcoding? In Proc. of IEEE/ACM International Symposium on Quality of Service

(IWQoS 2015), pages 1–2, Portland, OR, June 15-16 2015.

[162] M. R. Zakerinasab and M. Wang. Inspecting Coding Dependency in Layered Video

Coding for Efficient Unequal Error Protection. In Proc. of 35th IEEE International

Conference on Distributed Computing Systems (ICDCS 2015), pages 1–2, Columbus,

OH, June 29 - July 2 2015.

[163] M. R. Zakerinasab and M. Wang. Practical Network Coding for the Update Problem

in Cloud Storage Systems. In To appear in IEEE Transactions on Network and Service

Management (IEEE TNSM), pages 1–14, March 2017.

[164] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering

Method for Very Large Databases. ACM SIGMOD Record, 25(2):103–114, June 1996.

[165] W. Zhang, Y. Wen, and H.-H. Chen. Toward Transcoding as a Service: Energy-

Efficient Offloading Policy for Green Mobile Cloud. IEEE Network, 28(6):67–73,

November 2014.

[166] X. Zhang, X. Peng, D. Wu, T. Porter, and R. Haywood. A Hierarchical Unequal Packet

Loss Protection Scheme for Robust H.264/AVC Transmission. In 6th IEEE Consumer

Communications and Networking Conference (CCNC 2009), pages 1–5. IEEE, January

10-13 2009.

[167] Y. Zhao, H. Jiang, K. Zhou, Z. Huang, and P. Huang. Meeting Service Level Agreement

Cost-Effectively for Video-on-Demand Applications in the Cloud. In Proc. of IEEE

202

Conference on Computer Communications (INFOCOM), pages 298–306, Toronto,

Canada, April 27 - May 2 2014.

203

Appendix A

High Efficiency Video Coding

In this section we provide a brief review of H.265/HEVC video coding standard [125]. High

Efficiency Video Coding (HEVC) is the latest video coding standard developed by Joint

Collaborative Team on Video Coding (JCT-VC). The same standard is published by ITU-T

Study Group 16 Video Coding Experts Group (VCEG) as ITU-T H.265. Therefore, it is

common to refer to this standard as H.265/HEVC. The initial version of the H.265/HEVC

standard was ratified in 2013.

Similar to H.264/AVC, H.265/HEVC was developed with the goal of providing twice

the coding and compression efficiency of the previous standard. This means compared to

H.264/AVC, H.265/HEVC is expected to provide similar quality of the encoded video while

using half the bitrate. Although the performance analysis results vary depending on the type

of content and the encoder settings, HEVC is reported to decrease the bitrate of the encoded

video from 40% to 60%. From another perspective, if the video bitrate is constant, HEVC

is supposed to provide significantly better video quality. The higher coding efficiency comes

with the support of higher video resolutions and frame rates. The highest H.264/AVC level

(level 5.2) supports 4K videos at 60 fps frame rate with a maximum bitrate of 720 Mbps. In

comparison, the current highest level of H.265/HEVC (level 6.2) supports 4K videos at 300

fps frame rate and 8K videos at 120 fps frame rate with a maximum bitrate of 800 Mbps.

The cost of such a substantial performance improvement is the considerable increase in the

computational cost of encoding and decoding tasks.

Our measurements using the reference software for HEVC, called HM [6], shows 10 to

15 times increase in both encoding and decoding time compared to H.264/AVC. Optimized

HEVC encoder/decoders libraries such as x265 [10] and openhevc [8] tried to reduce this

204

gap to as low as two to three times of complexity, which is still very high as smartphones

are not equipped with processors as the decoding mostly happens on weaker devices such as

smartphones. This problem gets more significant when considering that as of end of 2016

only a few high end smartphones are equipped with H.265/HEVC hardware decoders, while

almost all the current smartphones have hardware decoders for H.264/AVC. The situation

is not that much better for H.265/HEVC rivals such as Google’s VP9 [142].

The block diagram of the HEVC encoder/decoder is illustrated in Fig. A.1. The basics

of video compression in HEVC, like other video coding standards, is a hybrid use of block-

based intra-picture prediction and block-based motion-compensated inter-picture prediction.

Compared to H.264/AVC, the higher coding performance of H.265/HEVC mainly roots in

utilizing larger and more flexible structure for coding blocks, encoding the motion vectors

with much higher precision, and adaptation of improved prediction tools, as briefly discussed

here. First, instead of using macroblocks of 16 × 16 luma samples as the coding blocks,

H.265/HEVC uses a quad-tree structured coding tree supporting coding blocks with up to

64 × 64 luma samples. Larger coding blocks increase the coding efficiency when higher

resolution video is encoded. The HEVC coding tree is illustrated in Fig. A.2:

• Coding tree units and coding tree block (CTB) structure: At the first level, each frame is

partitioned into multiple coding tree units (CTU), where each CTU covers a rectangular

area of N × N luma samples (N = 16, 32, 64)1. The size of the CTU is dynamically

selected according to the coding complexity of different areas of the picture. Typically,

larger CTUs provide better compression and can be used for low complexity areas of

the frame. Smaller CTUs can be used for high complexity areas or regions of interest.

Each CTU contains a coding tree block (CTB) for luma samples along with two chroma

CTBs and the associated syntax elements.

• Coding units (CU) and coding blocks (CB): On the next level, each CTU is divided into

1 The structure of the coding blocks in Google’s VP9 is also very similar.

205

Figure A.1: The block diagram of HEVC encoder / decoder (with decoder elements shadedin light gray) [125].

Figure A.2: Subdivision of a coding tree block (CTB) into coding blocks (CB) and transformblocks (TB). Solid lines indicate CB boundaries and dotted lines indicate TB boundaries.(a) CTB with its partitioning. (b) Corresponding quadtree [125].

206

one or more coding units (CU), where CUs are rectangular but can be of different sizes.

Similar to CTU, each CU contains a coding block (CB) for luma samples along with

two chroma CBs and the associated syntax elements. The size of a CB can be as large

as the CTB, i.e., up to 64× 64 luma samples, or as small as 8× 8 luma samples. The

decision whether to code a picture area using intra-frame or inter-frame prediction is

made at the CU level. Each CU is further partitioned into prediction units (PU) and

also contains a tree of transform units (TU).

• Prediction units (PU) and prediction blocks (PB): Each PU contains one luma and two

chroma prediction blocks (PB). Each prediction block can be as large as 64× 64 luma

samples or as small as 4× 4 luma samples and can be predicted from the same size PB

using intra-frame or inter-frame prediction.

• Transform units (TU) and transform blocks (TB): The prediction residual is coded

using block transforms. A TU tree structure has its root at the CU level. The luma

CB residual may be identical to the luma transform block (TB) or may be further split

into smaller luma TBs. The same applies to the chroma TBs. Integer basis functions

similar to those of a discrete cosine transform (DCT) are defined for the square TB

sizes 4× 4, 8× 8, 16× 16, and 32× 32.

Along with the very flexible prediction block structure that allows a prediction block to

be of a dynamic size from 4× 4 to 64× 64 luma samples HEVC uses an Advanced Motion

Vector Prediction (AMVP) mechanism. In H.264/AVC prior to applying the residual error,

at most two motion vectors can be used to predict the encoded block, where the final result is

the average of luma and chroma samples of the two reference macroblocks. In HEVC, AMVP

allows including derivation of several most probable candidates based on data from adjacent

PBs and the reference picture. A merge mode for MV coding can also be used, allowing the

inheritance of MVs from temporally or spatially neighboring PBs. Furthermore, compared to

half-sample precision of H.264/AVC for motion vectors, HEVC uses quarter-sample precision

207

and stronger filters for interpolation of fractional-sample positions. Therefore, HEVC can

encode motion vectors with much greater precision, giving a better predicted block with less

residual error. Furthermore, compared to 9 intra-picture prediction directions of H.264/AVC,

HEVC provides 35 intra-prediction modes, allowing the reference coding unit for the intra-

frame prediction to be selected much more precisely. Finally, HEVC is equipped with a

new and advanced deblocking filter, called Sample Adaptive Offset (SAO), that reduces the

artifacts of the decoded video on the borders of CUs. These new capabilities altogether allow

HEVC to achieve the same video quality of H.264/AVC with a video bitrate 40% to 60%

less than that of H.264/AVC.

208