Audiovisual quality estimation for mobile video services

10
JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Audiovisual Quality Estimation for Mobile Video Services Michal Ries, Bruno Gardlo Abstract—Provisioning of mobile video services is rather chal- lenging since in mobile environments, bandwidth and processing resources are limited. Audiovisual content is present in most multimedia services, however, the user expectation of perceived audiovisual quality differs for speech and non-speech contents. The majority of recently proposed metrics for audiovisual quality estimation assumes only one continuous medium, either audio or video. In order to accurately predict the audiovisual quality of a multi-media system it is necessary to apply a metric that takes simultaneously into account audio as well as video quality. When assessing a multi-modal system, one cannot model it only as a simple combination of mono-modal models, because the pure combination of audio and video models does not give a robust perceived-quality performance metric. We show the importance taking into account the cross-modal interaction between audio and video modes also code mutual compensation effect. In this contribution we report on measuring the cross-modal interaction and propose a content adaptive audiovisual metric for video sequences that distinguishes between speech and non-speech audio. Furthermore, the proposed method allows for a reference- free audiovisual quality estimation, which reduces computational complexity and extends applicability. Index Terms—audiovisual quality, multimedia, mobile services. I. I NTRODUCTION W HILE measuring or estimation of speech quality has been a standard procedure for many decades, quality estimation of video and audio signals is a relatively new field. In particular in mobile environments multimedia transmis- sions, mostly defined by simultaneous audio and video trans- missions, quality estimation is challenging due to the limited data rate and processing resources. One of the challenges in mobile communications is to improve the subjective quality of audio and audio-visual services. Due to advances in audio and video compression and wide-spread use of standard codecs such as AMR and AAC (audio) and MPEG-4/AVC (video), provisioning of audio-visual services is possible at low bit rates while preserving perceptual quality. The Universal Mobile Telecommunications System (UMTS) Release 4 (implemented by the first UMTS network elements and terminals) provides a maximum data rate of 1 920 kbps shared by all users in a cell while Release 5 offers up to 14.4 Mbps in the downlink direction for High Speed Downlink Packet Access (HSDPA). The following audio and video codecs are supported by UMTS M. Ries is with the Institute of Communications and Radio Frequency Engineering, Vienna University of Technology, Gusshausstrasse 25/389, A- 1040 Vienna e-mail: [email protected] B. Gardlo is with the Department of Telecommunications and Mul- timedia, University of Zilina, Univerzitna 1, 010 26 Zilina e-mail: [email protected] Manuscript received March 15, 2009. video services. For audio they include: AMR speech codec, AAC Low Complexity (AAC-LC), AAC Long Term Prediction (AAC-LTP) [1] and for video they include: H.263, MPEG-4 and MPEG-4/AVC [1]. The appropriate encoder settings for UMTS video services differ for various content and streaming application settings (resolution, frame and bit rate) [2]. End-user quality is influenced by a number of fac- tors [2], [7], [8], [9] including mutual compensation effects between audio and video [4], [5], content, encoding, and network settings as well as transmission conditions [6]. This mutual compensation effect is very noticeable if for example a news speakers face freezes while the speech continues. In this case it is not perceived as very harmful since most of the information is in the speech signal. A similar effect is given in a soccer match where the players freeze but the commentator’s speech continues. If however there is no speaker involved and only the soccer scene is shown, a freeze of the video is perceived as very harmful and quality is rated as very low. Thus, audio and video are not only mixed in the multimedia stream, but there is even a synergy of component media (audio and video) as was shown in [2], [7], [9], [10]. Mutual compensation effects cause perceptual differences in video with a dominant voice in the audio track rather than in video with other types of audio [9]. Video content with a dominant voice include news, interviews, talk shows, and so on [3]. Finally, audio-visual quality estimation models tuned for video content with a dominant human voice perform better than general (universal) models [9], [2]. Therefore, our focus within this work is on the design of audiovisual metrics incorporating simultaneously audio and video content features. We are looking at measures that do not need the original (non-compressed) sequence for the estimation of quality, because this reduces the complexity and at the same time broadens the possibilities of the quality prediction deploy- ment. Furthermore, we investigated novel ensemble based estimations. The ensemble based estimation method shows that such estimators are more beneficial than their single classifier counterparts [13]. The paper is organized as follows: In Section II we describe a typical mobile video streaming scenario and a test setup for video quality evaluation. In Section III the description of video and audio feature extraction is provided. The design of ensemble based audiovisual estimator and its is presented in Section IV. Section V performance and evaluation of ensemble based estimator and comparison with state of the art estimator. Section VI contains conclusions and provides an outlook on future work.

Transcript of Audiovisual quality estimation for mobile video services

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Audiovisual Quality Estimation for Mobile VideoServices

Michal Ries, Bruno Gardlo

Abstract—Provisioning of mobile video services is rather chal-lenging since in mobile environments, bandwidth and processingresources are limited. Audiovisual content is present in mostmultimedia services, however, the user expectation of perceivedaudiovisual quality differs for speech and non-speech contents.The majority of recently proposed metrics for audiovisual qualityestimation assumes only one continuous medium, either audio orvideo. In order to accurately predict the audiovisual quality of amulti-media system it is necessary to apply a metric that takessimultaneously into account audio as well as video quality. Whenassessing a multi-modal system, one cannot model it only as asimple combination of mono-modal models, because the purecombination of audio and video models does not give a robustperceived-quality performance metric. We show the importancetaking into account the cross-modal interaction between audioand video modes also code mutual compensation effect. In thiscontribution we report on measuring the cross-modal interactionand propose a content adaptive audiovisual metric for videosequences that distinguishes between speech and non-speechaudio. Furthermore, the proposed method allows for a reference-free audiovisual quality estimation, which reduces computationalcomplexity and extends applicability.

Index Terms—audiovisual quality, multimedia, mobile services.

I. INTRODUCTION

WHILE measuring or estimation of speech quality hasbeen a standard procedure for many decades, quality

estimation of video and audio signals is a relatively new field.In particular in mobile environments multimedia transmis-sions, mostly defined by simultaneous audio and video trans-missions, quality estimation is challenging due to the limiteddata rate and processing resources. One of the challenges inmobile communications is to improve the subjective quality ofaudio and audio-visual services. Due to advances in audio andvideo compression and wide-spread use of standard codecssuch as AMR and AAC (audio) and MPEG-4/AVC (video),provisioning of audio-visual services is possible at low bit rateswhile preserving perceptual quality. The Universal MobileTelecommunications System (UMTS) Release 4 (implementedby the first UMTS network elements and terminals) providesa maximum data rate of 1 920 kbps shared by all users in acell while Release 5 offers up to 14.4 Mbps in the downlinkdirection for High Speed Downlink Packet Access (HSDPA).The following audio and video codecs are supported by UMTS

M. Ries is with the Institute of Communications and Radio FrequencyEngineering, Vienna University of Technology, Gusshausstrasse 25/389, A-1040 Vienna e-mail: [email protected]

B. Gardlo is with the Department of Telecommunications and Mul-timedia, University of Zilina, Univerzitna 1, 010 26 Zilina e-mail:[email protected]

Manuscript received March 15, 2009.

video services. For audio they include: AMR speech codec,AAC Low Complexity (AAC-LC), AAC Long Term Prediction(AAC-LTP) [1] and for video they include: H.263, MPEG-4and MPEG-4/AVC [1]. The appropriate encoder settings forUMTS video services differ for various content and streamingapplication settings (resolution, frame and bit rate) [2].End-user quality is influenced by a number of fac-tors [2], [7], [8], [9] including mutual compensation effectsbetween audio and video [4], [5], content, encoding, andnetwork settings as well as transmission conditions [6]. Thismutual compensation effect is very noticeable if for examplea news speakers face freezes while the speech continues. Inthis case it is not perceived as very harmful since most of theinformation is in the speech signal. A similar effect is given ina soccer match where the players freeze but the commentator’sspeech continues. If however there is no speaker involvedand only the soccer scene is shown, a freeze of the videois perceived as very harmful and quality is rated as very low.Thus, audio and video are not only mixed in the multimediastream, but there is even a synergy of component media(audio and video) as was shown in [2], [7], [9], [10]. Mutualcompensation effects cause perceptual differences in videowith a dominant voice in the audio track rather than in videowith other types of audio [9]. Video content with a dominantvoice include news, interviews, talk shows, and so on [3].Finally, audio-visual quality estimation models tuned for videocontent with a dominant human voice perform better thangeneral (universal) models [9], [2]. Therefore, our focus withinthis work is on the design of audiovisual metrics incorporatingsimultaneously audio and video content features.We are looking at measures that do not need the original(non-compressed) sequence for the estimation of quality,because this reduces the complexity and at the same timebroadens the possibilities of the quality prediction deploy-ment. Furthermore, we investigated novel ensemble basedestimations. The ensemble based estimation method shows thatsuch estimators are more beneficial than their single classifiercounterparts [13].The paper is organized as follows: In Section II we describea typical mobile video streaming scenario and a test setupfor video quality evaluation. In Section III the description ofvideo and audio feature extraction is provided. The design ofensemble based audiovisual estimator and its is presented inSection IV. Section V performance and evaluation of ensemblebased estimator and comparison with state of the art estimator.Section VI contains conclusions and provides an outlook onfuture work.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2

Fig. 1. Snapshots of selected sequences for audiovisual test: Video clip (left),Soccer (middle), Video call (right).

II. AUDIOVISUAL QUALITY ASSESSMENT

A. Test Methodology

The proposed test methodology is based on ITU-TP.911 [14] and adapted to our specific purpose and limitations.For this particular application it was considered that the mostsuitable experimental method, among those proposed in theITU-T Recommendation, is ACR, also called Single StimulusMethod. The ACR method is a category judgement in whichthe test sequences are presented one at a time and are ratedindependently on a category scale. Only degraded sequencesare displayed, and they are presented in arbitrary order. Thismethod imitates the real world scenario, because the customersof mobile video services do not have access to original videos(high quality versions). On the other hand, ACR introduces ahigher variance in the results, as compared to other methodsin which also the original sequence is presented and serves asa reference by the test subjects [11].After each presentation the test subjects were asked to evaluatethe overall quality of the sequence shown. In order to measurethe quality perceived, a subjective scaling method is required.However, whatever the rating method, this measurement willonly be meaningful if there actually exists a relation betweenthe characteristics of the video sequence presented and themagnitude and nature of the sensation that it causes on thesubject. The existence of this relation is assumed. Test subjectsevaluated the video quality after each sequence in a preparedform using a five grade MOS scale: “5–Excellent”, “4–Good”,“3–Fair”, “2–Poor”, “1–Bad”. Higher discriminative powerwas not required, because our test subjects were used tofive grade MOS scales (school). According to our previousexperiences a five grade MOS scale offers better trade-offbetween the evaluation interval and reliability of the results[2], [12]. The previous works [8], [12] in this field show, thattest subjects hesitate to use the entire range of the 9 or 11ACR scales.For emulating the real word conditions of the UMTS video

service all the audio and video sequences were played at theUE (Vodafone VPA IV). In this singular point the proposedmethodology for audiovisual quality testing is not compliantwith ITU-T P.911 [14]. Furthermore, since one of our inten-tions is to study the relation between audio quality and videoquality, we have decided to take all the tests with a standardstereo headset. During the training session of three sequencesthe subjects were allowed to adjust the volume level of theheadset to a comfortable level. The viewing distance from thephone was not fixed and selected by the test person but wehave noticed that all subjects were comfortable to hold thecell-phone at a distance of 20-30 cm.

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 190.28 12.5 16 16VGA AAC 231.40 15 32 16QVGA AAC 173.80 12.5 32 16VGA AAC 229.80 12.5 32 16QVGA AAC 75.87 12.5 16 16

TABLE IENCODING SETTINGS OF VIDEO CLIP TRAINING SEQUENCE.

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 186.90 12.5 16 16VGA AAC 295.36 15 32 16QVGA AAC 171.16 12.5 32 16VGA AAC 269.96 12.5 32 16QVGA AAC 103.10 12.5 16 16

TABLE IIENCODING SETTINGS OF VIDEO CLIP EVALUATION SEQUENCE.

B. Encoder Settings

All video sequences were encoded using typical settings forthe UMTS environment. Due to limitations of mobile radioresources, bit rates were selected in range 59–320 kbps. Onlycomprehensible audio files were allowed in the set. The testsequences were encoded with H.264/AVC baseline profile 1bcodec. The audio was encoded with AAC or AMR codec.The encoding parameters were selected according for ourformer experiences described in [2] and [9]. In total therewere tested 12 encoding combinations for the training set(see Tables I, III, V) and 13 encoding combinations for theevaluation set (see Tables II, IV, VI). Two sets, an evaluationand a training set were defined. Both sets consisted of differentsequences. In each of the sets we used for each content typethe same subset of video sequences but slightly differentlyencoded (see Table I and II for encoding setting of video clipand Table III and IV for soccer). To evaluate the subjectiveperceptual audiovisual quality a group of 15 people for thetraining set and a group of 16 people for the evaluation setwas chosen. The chosen group ranged different ages (between22 and 30), gender, education and experience. The sequenceswere presented in an random order, with the additional condi-tion that the same sequence (even differently degraded) did notappear in succession. Two rounds of each test were taken. Theduration of each test round was about 20 minutes. The singleevaluation MOS values with variance higher than one betweenround one and round two were excluded. In total there were6% of single evaluation MOS values rejected maintaining themean of the set but significantly decreasing the variance.

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 199.14 15 16 16QVGA AAC 92.30 15 16 16QVGA AAC 181.46 12.5 32 16QVGA AAC 196.98 12.5 16 16QVGA AAC 182.98 15 32 16

TABLE IIIENCODING SETTINGS OF SOCCER TRAINING SEQUENCE.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 198.72 15 16 16QVGA AAC 93.07 15 16 16QVGA AAC 180.90 12.5 32 16VGA AAC 319.32 15 16 16QVGA AAC 195.69 12.5 16 16QVGA AAC 183.32 15 32 16VGA AAC 292.12 12.5 16 16

TABLE IVENCODING SETTINGS OF SOCCER EVALUATION SEQUENCE.

For audiovisual quality tests three different content types(Video clip, Soccer and Video call) were selected withdifferent perception of video and audio media. The videosnapshots are depicted in Figure 1. The first two sequencesVideo clip and Soccer contain a lot of local and globalmovement. The main difference between them is in theiraudio part. In Soccer the speaker voice as well as a loudsupport of audience is present, where the speakers voiceis rather important. Especially important are small movingobjects: players and ball. Figure 2 and Figure 3 shows theresults from subjective tests. In Video clip instrumental musicwith voice is present in the foreground. Figure 4 and Figure 5show results from the subjective test for video clip files.In Video call a human voice is the most dominant. Finally,Figure 6 shows the results from subjective tests for video callfiles.

Fig. 2. Measured MOS results for Soccer video training sequences.

Fig. 3. Measured MOS results for Soccer video evaluation sequences.

Furthermore, the obtained results for Video call and Soccershow that higher resolution has no or little impact at audiovi-

sual quality. This was influenced with granularity of LCD ontest PDA.

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 202.84 12.5 16 16QVGA AMR 59.25 7 5 8

TABLE VENCODING SETTINGS OF VIDEO CALL TRAINING SEQUENCE.

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 191.86 12.5 16 16

TABLE VIENCODING SETTINGS OF VIDEO CALL EVALUATION SEQUENCE.

Fig. 4. Measured MOS results for Video clip training sequences.

Fig. 5. Measured MOS results for Video clip evaluation sequences.

C. Prior Art

In former work [2], [9] we investigated audiovisual qualityon different content classes, codecs and encoding settings1.The obtained subjective video quality results clearly showthe presence of the mutual compensation effect. The depicted

1It is worth pointing out that audio and video sets used in former work aredifferent from the ones used in present work!

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4

Fig. 6. Measured MOS results for Video call sequences.

Fig. 7. MOS results for the Video call content - codecs combinationH.263/AMR

Figures 7, 8, 9 (color code serves only for better visualizationof the results) show results of audiovisual quality assessmentbased on H.263 encoding. In Video call the audiovisual qualityis more influenced by the audio quality than the video qualityas shown in (see Figure 7). This problem is described in moredetail in [2].

Further investigation within this work shows that it isbeneficial to propose one audiovisual model with contentsensitive parameters for various video contents depending onthe presence (Video call) or absence of dominant human voice(Video clip and Cinema trailer). Therefore, within the newwork presented in this contribution an additional parameterwas introduced for detecting speech and non-speech audiocontent (cf. Section III-B).

III. FEATURE EXTRACTION

The proposed method is focused on reference freeaudiovisual quality estimation. The character of the sequenceis determined by content dependent audio and video featuresin between two scene changes. Therefore, the investigationof the audio and video stream was focused on sequencemotion features as well as on audio content and quality. Thevideo content influences significantly the subjective videoquality [2], [15] and the sequence motion features reflectvery well the video content. The well-known ITU-T standard

Fig. 8. MOS results for the Video clip - codecs combination H.263/AAC.

Fig. 9. MOS results for the Cinema trailer - codecs combination H.263/AAC.

P.563 [16] was used for audio quality estimation. The ITU-TP.563 is a standard for the speech quality evaluation but inour setup is also used for the evaluation of the audio quality,since at the time of writing this paper, there were not knownother standard for non-reference audio quality evaluation.Furthermore, a speech/non-speech detector was introducedfor eliminating of different influence of mutual compensationeffect between audio and video in speech and non-speechcontent. Finally, temporal segmentation was used also as aprerequisite in the process of video quality estimation. Forthis purpose a scene change detector was designed with anadaptive threshold based on the video dynamics. The scenechange detector design is described in detail in [2].

A. Video feature extraction

The focus of our investigation is given on the motionfeatures of the video sequences. The motion features canbe used directly as an input into the estimation formulasor models. Both possibilities were investigated in [17], [18]and [2], respectively.The investigated motion features concentrate on the motion

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

vector statistics, including the size distribution and the di-rectional features of the motion vectors (MV) within onesequence of frames between two cuts. Zero MVs allow forestimating the size of the still regions in the video pictures.That, in turn, allows analyzing MV features for the regionswith movement separately. This particular MV feature makes itpossible for distinguishing between rapid local movements andglobal movement. Moreover, the perceptual quality reductionin spatial and temporal domain is very sensitive to the chosenmotion features, making these very suitable for reference freequality estimation because a higher compression does notnecessarily reduce the subjective video quality (e.g. in staticsequences).This particular MV features make it possible to detect rapidlocal movements or the character of global movements. Theselection MV features is based on multivariate statisticalanalysis and the details can be found in [2]. The followingfeatures of MV and BR represent the motion characteristics:

• Zero MV ratio within one shot Z:The percentage of zero MVs is the proportion of theframe that does not change at all (or changes veryslightly) between two consecutive frames averaged overall frames in the shot. This feature detects the proportionof a still region. The high proportion of the still regionrefers to a very static sequence with small significant localmovement. The viewer attention is focused mainly onthis small moving region. The low proportion of the stillregion indicates uniform global movement and/or a lot oflocal movement.

• Mean MV size within one shot N :This is the percentage of mean size of the non-zeroMVs normalized to the screen width. This parameterdetermines the intensity of a movement within a movingregion. Low intensity indicates the static sequence. Highintensity within a large moving region indicates a rapidlychanging scene.

• Ratio of MV deviation within one shot S:This ratio is the percentage of standard MV deviation tomean MV size within one shot. A high deviation indicatesa lot of local movement and a low deviation indicates aglobal movement.

• Uniformity of movement within one shot U :Percentage of MVs pointing in the dominant direction(the most frequent direction of MVs) within one shot.For this purpose, the resolution of the direction is 10o.This feature expresses the proportion of uniform and localmovement within one sequence.

• Average BR:This parameter refers to the pure video payload. The BRis calculated as an average over the whole stream. Fur-thermore, the parameter BR reflects a compression gainin spatial and temporal domain. Moreover the encoderperformance is dependent on the motion characteristics.The BR reduction causes a loss of the spatial and tem-poral information what is usually annoying for viewers.

B. Audio feature extraction

Many reliable estimators for audiovisual quality were pro-posed recently, some of them became standards [19], [20]and [16]. For our purpose a reference free estimation methodcalled “Single ended method for objective speech qualityassessment in narrow-band telephony applications” [16] turnedout to be very suitable. The 3SQM [16] performs audioquality estimation in two stages, the first stage includesintermediate reference system filtering, signal normalizationand voice activity detection. In the second stage of operation,twelve parameters based on the processed input signal arecalculated. These parameters take into account speech level,noise, delay, repeated frames, disruptions in pitch period,artificial components in speech signal (beeps, clicks). These12 parameters are then linearly combined to form the finalaudio quality prediction (in MOS scale).As previous work has shown [7], mutual compensation ef-fects cause differences in perception of video content with adominant voice in the audio track rather than in video withother types of audio [9]. Video content with a dominant voiceinclude news, interviews, talk shows, and so on. Finally, audio-visual quality estimation models tuned for video content with adominant human voice perform better than a universal models[9]. Therefore, our further investigation was focused on thedesign of speech detection algorithms suitable for the mobileenvironment.Due to the low complexity requirement of the algorithm, ourinvestigation was initially focused on time-domain method.For this purpose a pair of audio parameters {kurtosis (κx)[21], High Zero Crossing Rate Ratio (HZCRR) [22]} extractedfrom the audio signal were suitable. The kurtosis of a zero-mean random process x(n) is defined as the dimensionless,scale invariant quantity2

κx =1N

∑Nn=1(x(n)− x)4(

1N

∑Nn=1(x(n)− x)2

)2 , (1)

where in our case, x(n) represents the n-th sample ofan audio signal. A higher κx value is related to a morepeaked distribution of samples as is found in speech signals asdepicted in Figure 10 whereas a lower value implies a moreflat distribution as is found in other types of audio signals.Therefore, kurtosis was selected as a basis for detection ofspeech. However, accurate detection of speech in short-timeframes is not always possible by kurtosis alone.

The second objective parameter under consideration is theHZCRR defined as the ratio of the number of frames whoseZero Crossing Rate (ZCR) is greater than 1.5 times the averageZCR in audio file as [22]

HZCRRM =1

2N

N−1∑n=0

[sgn(ZCR(n,M)−1.5ZCR)+1], (2)

2Note that some texts define kurtosis as κx =1N

∑N

n=1(x(n)−x)4(

1N

∑N

n=1(x(n)−x)2

)2 −3. We shall however follow the definition

in [21].

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 6

2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

κ

CD

F

Kurtosis CDF for Non−speechKurtosis CDF for Speech

C0

Fig. 10. Cumulative density function of kurtosis κx for speech and non-speech audio samples.

where ZCR(n,M) is the rate of the n-th, length-M frame(equation given below), N is the total number of frames, ZCRis the average ZCR over the audio file. The ZCR is given by

ZCR(n,M) =1

M

M−1∑m=0

1<0 [x(nM +m)x(nM +m+ 1)] ,

(3)where m denotes the sample index within the frame and the

indicator function is defined as

1<0[d] =

{1; d < 00; d ≥ 0.

According to our further experiences we use a frame lengthof 10 ms and the framing windows are overlapped by 50%.The 10 ms frame length3 contains sufficient audio sample setfor further statistical processing. Moreover, the longer framingwindow would increase the calculation complexity and lengthof investigated audio sequence nessesary for speech detection.Finally, we propose a two-stage voice detection algorithm (seeFigure 11). During the first stage, non-speech audio frames aredetected by a simple decision based on whether the kurtosis isless than threshold (c0) of 4.96 (see Figure 10). The first stageis capable of recognizing 62.3% of the non-speech framesfrom our corpus with a 97% accuracy rate. The threshold (c0)was set in order to get the best trade-off between calculationcomplexity and estimation accuracy.For the second stage (first solution), we derive a more generaldecision rule based on a hypothesis test (LLR) and we useboth kurtosis and HZCRRM of the frame as elements in afeature vector

F =

[κx

HZCRR

].

For speech signals, we denote the mean vector for the speechfeature vectors as µs and covariance matrix as Σs and fornon-speech feature vectors, we denote the mean vector as µmand covariance matrix as Σm. Furthermore, the LLR test isperformed on the first 20 frames, in order to reduce compu-tational complexity. The log-likelihood ratio is calculated as

3e.g for SR = 16 kHz framing window contains M = 160 samples

follows

∆ =

∑20i=1 log

{1√

(2π)2‖Σs‖exp(g)

}∑20i=1 log

{1√

(2π)2‖Σm‖exp(h)

} , (4)

where g and h are denoted as follows:

g = −1

2(Fi − µs)Σ−1

s (Fi − µs)T , (5)

h = −1

2(Fi − µm)Σ−1

m (Fi − µm)T . (6)

If the LLR is greater than the decision threshold, c = c1 =2.2 (see Figure 11), we declare it as a non-speech frameotherwise we declare it as a speech frame.

Fig. 11. Two-stage speech detector.

The training and evaluation of our speech detector wasperformed on a large audio corpus. Our corpus consists of3 032 speech and non-speech audio files. The speech partof corpus is in the German language and consists of tenspeaker’s. The non-speech part of corpus consists of mainlymusic files of various genres (for example rock, pop, hip-hop,live music). All audio files were encoded using typicalsettings for the UMTS environment. Each audio file wasencoded using three codec types at different sampling rates:AAC, AMR-WB at 16 kHz and AMR-NB at 8 kHz. Due tolimitations of mobile radio resources, bit rates were selectedin range 8 – 32 kbps. The overall accuracy of proposedmethod exceeds 92% for speech and non-speech contentaveraged over all codecs.For further audiovisual quality estimation two audioparameters are used. Audio metric (MOSa) as definedby ITU-T P.563 and the proposed speech or non-speechdetector(binary decision on Speech or Non-speech content(CCa = {0, 1})).

IV. AUDIOVISUAL QUALITY ESTIMATION

The proposed model works in several stages as follows.From the video signal 5 parameters described in the abovesection are computed. The audio stream from the audiovisualcontent is input for the 3SQM quality metric. At the outputwe get the MOS value of the audio. The audio stream is also

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7

Fig. 12. Block scheme of the proposed model.

led to the speech/non-speech detector and at the output weget either a zero value for the non-speech content class, orvalue of one for the speech content class. All these sevenparameters (five for video and two for audio) form the featurevector of non-reference objective values and they are theinput for the ensemble based model at which output we getthe final MOS value of the audiovisual content. The blockscheme of the proposed model is shown in Figure 12.

According to our former experiences with metric design [2],[18], we propose ensemble based estimation for the investi-gated scenario. Ensemble based estimators average the outputsof several estimators in order to reduce the risk of an unfor-tunate selection of a poorly performing estimator. The veryfirst idea was to use more than one classifier for estimationcomes from the neural network community [23]. In the lastdecade research in this field has expanded in strategies [24] forgenerating individual classifiers, and/or the strategy employedfor combining the classifiers.The aim is to train a defined ensemble of models with a featurevector based on audio and video content sensitive objectiveparameters:

X =

ZNSUBR

MOSaCCa

.

The ensemble consists of different model classes to improvethe performance in regression problems. The theoretical back-ground [25] of this approach is that an ensemble of hetero-geneous models usually leads to reduction of the ensemblevariance because the cross terms in the variance contributionhave a higher ambiguity. A data set with input values offeature vector X and output value (MOS) y with a functionalrelationship is considered , where e is an estimation error:

y = f(X) + e. (7)

The weighted average f(X) of the ensemble of models isdefined as follows:

f(X) =

K∑k=1

wkfk(X), (8)

where fk(X) denotes the k-th individual model and the pos-itive weights wk sum to one (

∑k wk = 1). The generalization

(squared) error q(X) of the ensemble is given by:

q(X) = (y(X)− f(X))2. (9)

According to [25], the error can be decomposed as follows:

q(X) = q(X)− a(X). (10)

This assumption allows us to neglect the mixed terms ofthe following equation where the average error q(X) of theindividual model is:

q(X) =

K∑k=1

wk(y(X)− fk(X))2, (11)

and the average ambiguity a(X) of the ensemble is:

a(X) =

K∑k=1

(fk(X)− f(X))2. (12)

• A consequence of (10) is that the ensemble generalizationerror q(X) is always smaller than the expected error ofthe individual models q(X) [25]: q(X) ≤ q(X).

• The previous Equations (7) — (12) require that an en-semble should consist of well trained but diverse modelsin order to increase the ensemble ambiguity.

This prerequisite was applied to an ensemble of universalmodels. In order to estimate the generalization error and toselect models for the final ensemble a cross-validation schemefor model training [26] was used. These algorithms increasethe ambiguity and thus improve generalization of a trainedmodel. Furthermore, an unbiased estimator of the ensemblegeneralization error was obtained. The cross-validation worksas follows:• The data set is divided in two subsets and the models are

trained on the first set.• The models are evaluated on the second set, the model

with the best performance becomes an ensemble member.• The data set is divided with small overlapping with

previous subsets into two new subsets and the modelsare trained on the first set.

• The cross-validation continues until the ensemble has adesired size. The best trade-off between ensemble com-plexity and performance was achieved for an ensembleof six estimators.

The final step in the design of an ensemble basedsystem is to find a suitable combination of models. Dueto outliers and overlapping in data distribution of the dataset, it is impossible to propose a single estimator withperfect generalization performance. Therefore, an ensembleof many classifiers was designed and their outputs werecombined such that the combination improves upon theperformance of a single classifier. Moreover, classifiers

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8

with significantly different decision boundaries from therest of the ensemble set were chosen. This property ofan ensemble set is called diversity. The above mentionedcross-validation introduces model-diversity, the trainingon slightly different data sets leads to different estimators(classifiers). Additionally, diversity was increased by usingtwo independent models; k-nearest neighbor rule and artificialneural network. Furthermore, in cross validation classifierswith worse correlation than 50% on the second set wereautomatically excluded.

1) As the first estimation model, we chose a simplenonparametric method, the k-Nearest Neighbor rule(kNN) with adaptive metric [26]. This method is veryflexible and does not require any preprocessing of thetraining data. The kNN decision rule assigns to anunclassified sample point the classification of the nearestsample point of a set of previous classified points.Moreover, a locally adaptive form of the k-nearestneighbor was used for classification. The value of k isselected by cross validation.

2) As the second model an Artificial Neural Network(ANN) was used. A network with three layers wasproposed; input, one hidden and output layer using fiveobjective parameters as an input and estimated MOS asoutput. Each ANN has 90 neurons in the hidden layer.As a learning method Improved Resilient Propagation(IRPROP+) with back propagation [27] was used. IR-PROP+ is a fast and accurate learning method in solvingestimation tasks for the data set.

The ensemble consists of six estimators, three based on kNNand three based on ANN. The three different estimators werebased on different subsets as being defined for training andevaluation as explained before. Each of the six estimatorsresults in a MOS value:

MOSi = f(X), i = 1...6. (13)

Finally, the estimated MOS is an average over the obtainedindividual MOSi values from the proposed ensemble of esti-mators:

MOS =1

6

6∑i=1

MOSi. (14)

V. PERFORMANCE EVALUATION

To validate the performance of the proposed ensemble basedestimator, the Pearson (linear) correlation factor [28] wasapplied:

r =(x− x)T (y− y)√

((x− x)T (x− x))((y− y)T (y− y)), (15)

Here, the vector x corresponds to the MOS values aver-aged over two runs of all obtained subjective evaluations forparticular test sequence and one encoding setting. x is theaverage over x. Vector y corresponds to the prediction madeby the proposed metric and y corresponds to average over

y. The dimension of x and y refers to the amount of testedsequences. The obtained results for the Pearson correlationfactor shows 91% correlation, thus reflects the excellent fit(see Figure 13) with the independent evaluation set for allcontent types together.

1 1.5 2 2.5 3 3.5 4 4.5 51

1.5

2

2.5

3

3.5

4

4.5

5

MOS

Est

imat

ed M

OS

Ensemble based modelM1(16)M2(17)M3(18)Reference line

Fig. 13. Estimated MOS over subjective MOS results.

Furthermore, it was necessary to provide objective compar-ison with state of the art estimation method. For this purposewere selected audiovisual quality models devoted in [10]. Themodels relate individual audio and video quality (MOSa andMOSv) to subjective audiovisual quality:

MOS = a+ b ·MOSa + c ·MOSv, (16)

MOS = a+ d ·MOSa ×MOSv, (17)

MOS = a+ b ·MOSa + c ·MOSv + d ·MOSa×MOSv. (18)

Several different forms of equations were analyzed, includingcross products {(17) and (18)}.

The MOSa was based on Auditory Distance (AD) algorithm[29]. The subjective audio quality model was devoted in [10].AD is a measure of difference between the original anddegraded audio signal. Thus, larger AD indicate poorer qualityof degraded audio signal. AD is linearly fitted to subjectiveaudio quality:

MOSa = 4.388− 0.638AD. (19)

The well known ANSI T1.801.03 metric [30]. ANSIT1.801.03 is based on quality parameters that measure theperceptual effects of a wide range of impairments such asblurring, block distortion, unnatural motion, noise and errorblocks. Each quality parameter is calculated through a qualityfeature, defined as a quantity of information associatedwith a spatial-temporal sub-region of a video stream. ANSIT1.801.03 is a full reference metric.

Estimator a b c d r(16) -21.036 5.499 0.038 — 0.15(17) 2.630 — — 0.029 0.10(18) 213.251 -48.026 -86.865 19.854 0.03Ensemble based estimator — — — — 0.91

TABLE VIICOEFFICIENTS AND CORRELATIONS OF AUDIOVISUAL QUALITY

ESTIMATORS.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 9

In order to provide a fair comparison of our ensemble basedestimator and audiovisual quality models (16), (17) and (18)we trained audiovisual quality models {(16), (17) and (18)}on our training set. The model coefficients were obtainedby a linear regression (see Table VII). The comparison wasperformed on the evaluation set.

Unfortunately, audiovisual quality models (16), (17) and(18) have extremely poor correlations on our evaluation set(see Table VII). The ensemble based estimator significantlyoutperforms simple quality models. This is mainly influencedby the following factors:• The ANSI T1.801.03 and (19) estimators do not consider

audio and video content which significantly influences thesubjective quality.

• The ANSI T1.801.03 and (19) estimators are not origi-nally designed for mobile environment.

• It is possible that the poor performance of the modelby Jones & Atkinson [10] be due to poor (separate)predictions MOSa and MOSv used in equations (16)-(18), rather than audiovisual model itself, as depicted inequations (16)-(18).

Moreover, the depicted results in Table VII show that it isessential to consider audiovisual content and scenario in theestimator design.

VI. CONCLUSION

The recent development in handheld devices and videoencoding brought significant improvement in processing powerof handheld devices thus allowing increasing screen resolu-tion of these devices. Multimedia applications are thereforemore and more of interest and make up for a large part ofthe transmitted data in mobile communications. As serviceproviders are requested to guarantee a high quality but at thesame time desire to limit the necessary resources for deliveryit is important to optimize perceived quality of multimediaservices. Here, mutual compensation effects in audiovisualtransmissions offer interesting trade-offs. Although in poorreceive situations one part of the transmitted stream may not beupdated or only on a very low rate, the information is carriedmostly by the other multimedia mode (for example audio) andthus the perceived quality is moderate to high although thedata rate required is relatively low. Our contribution shows thatclever estimation techniques extracting side information likespeech/non-speech of the audio stream can result in excellentquality estimation. Such quality metric on the other hand is thebasis to optimize transmission methods for specific contents.

ACKNOWLEDGMENT

The authors would like to thank mobilkom austria AG forsupporting their research. The views expressed in this paperare those of the authors and do not necessarily reflect the viewswithin mobilkom austria AG.

REFERENCES

[1] 3GPP TS 26.234 V6.13.0:”Transparent end-to-end Packet-switchedStreaming Service (PSS); Protocols and codecs,” Mar. 2008.

[2] M. Ries, ”Video Quality Estimation for Mobile Video Streaming,”Doctoral thesis, INTHFT, Vienna University of Technology, Vienna,Austria, Oct. 2008, available at http://publik.tuwien.ac.at/files/PubDat170043.pdf.

[3] D.S.Hands, ”A Basic Multimedia Quality Model”, in IEEETrans.Multimedia, vol. 6, no. 6, pp. 806- 816, Dec. 2004.

[4] J.G. Beerends and F.E. de Caluwe, ”The Influence of Video Qualityon Perceived Audio Quality and Vice-Versa”, in Journal of AudioEngineering Society, vol. 47, no. 5, pp. 355-362, May 1999.

[5] M.P. Hollier, A.N. Rimmel, D.S. Hands, and R.M. Volcker, ”MultimodalPerception”, in BT Technology Journal, 17 (1), pp. 35-46, Jan. 1999.

[6] M. Rupp, ”Video and Multimedia Transmissions over Cellular Networks”,Vienna: Willey, 2009.

[7] S. Tasaka, Y. Ishibashi, “Mutually Compensatory Property of MultimediaQoS,” in Proc. of IEEE International Conference on Communications2002, vol. 2, pp. 1105–1111, NY, USA, 2002.

[8] S. Winkler, C. Faller, “Perceived Audiovisual Quality of Low-BitrateMultimedia Content,”IEEE Transactions on Multimedia, Vol. 8, no. 5,pp. 973- 980, 2006.

[9] M. Ries, R. Puglia, T. Tebaldi, O. Nemethova, M. Rupp,“AudivisualQuality Estimation for Mobile Streaming Services,” in Proc. of 2nd Int.Symp. on Wireless Communications (ISWCS), pp. 173–177, Siena, Italy,Sep. 2005.

[10] C. Jones, D.J. Atkinson, ”Development of Opinion-Based AudiovisualQuality Models for Desktop Video-Teleconferencing,” 6th IEEE Interna-tional Workshop on Quality of Service, Napa, CA, USA, May, 1998.

[11] M. Brotherton, Q. Huynh-Thu, D.S Hands and K. Brunnstrom, ”Sub-jective multimedia quality assessment”, in IEICE Transactions on Fun-damentals of Electronics, Communications and Computer Sciences, Vol.E89-A, No.11, pp 2920-2932, Nov. 2006.

[12] O. Nemethova, M. Ries, M. Rupp, E. Siffel, ”Quality Assessment ForH.264 Coded Low-rate and Low-Resolution Video Sequences”, Commu-nications, Internet, and Information Technology,pp. 508-512, 2004.

[13] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuitsand Systems Magazine, Vol. 6, no. 3, pp. 21 - 45, Third Quarter 2006.

[14] ITU-T Recommendation P.911, ”Subjective audiovisual quality assess-ment methods for multimedia application,” International Telecommunica-tion Union, 1998.

[15] G. Zhai, J. Cai, W. Lin, X. Yang, W. Zhang, M. Etoh, Cross-dimensionalPerceptual Quality Assessment for Low Bitrate Videos, IEEE Transac-tions on Multimedia, vol.10 (7), pp. 1316-1324, Nov. 2008,

[16] ITU-T Recommendation P.563 ”Single ended method for objectivespeech quality assessment in narrow-band telephony applications” ITU,2004.

[17] M. Ries, O. Nemethova, M. Rupp, ”Motion Based Reference-FreeQuality Estimation for H.264/AVC Video Streaming,” in Proc. of IEEEInt. Symp. on Wireless Pervasive Computing (ISWPC), San Juan, PuertoRico, US, Feb. 2007.

[18] M. Ries, O. Nemethova, M. Rupp,”Performance evaluation of mobilevideo quality estimators,” invited paper, in Proc. of 15th European SignalProcessing Conference (EUSIPCO), Poznan, Polen, Sep. 2007.

[19] ITU-T Recommendation P.862, ”Perceptual evaluation of speech quality(PESQ): An objective method for end-to-end speech quality assessmentof narrow band telephone networks and speech codecs,” InternationalTelecommunication Union, 2001.

[20] ITU-R Recommendation BS.1387-1, ”Method for objective measure-ments of perceived audio quality,” Geneva, Switzerland, 1998-2001.

[21] M. G. Bulmer, Principles of statistics, New York: Dover Publications,1967.

[22] L. Lu, H. Jiang, and H. J. Zhang, ”Content analysis for audio clas-sification and segmentation,” IEEE Transactions on Speech and AudioProcessing, vol. 10, no. 7, Oct. 2002.

[23] B.V. Dasarathy, B.V. Sheela, ”Composite classifier system design: Con-cepts and methodology,” in Proc. of the IEEE, vol. 67, no. 5, pp. 708-713,1979.

[24] L.I. Kuncheva, ”Combining Pattern Classifiers, Methods and Algo-rithms,” New York, Wiley Interscience, 2005.

[25] A. Krogh, J. Vedelsby, ”Neural Network Ensembles, Cross Validationand Active Learning,” Advances in Neural Information Processing Sys-tems 7, MIT Press, 1995.

[26] Hastie, Tibshirani, Friedman, ”The Elements of Statistical Learning,”Springer, 2001.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10

[27] C. Igel, M. Hsken, ”Improving the Rprop learning algorithm,” in Proc.of the 2nd Int. Symp. on Neural Computation, pp. 115-121, Berlin, ICSCAcademic Press, 2000.

[28] VQEG: ”Final report from the Video Quality Experts Group on the val-idation of objective models of video quality assessment,” 2000, availableat http://www.vqeg.org/.

[29] S. Voran, ”Objective Estimation of Perceived Speech Quality. Part I:Development of the Measuring Normalizing Block Technique,” in Journalof IEEE transactions on speech and audio processing, vol. 7, no. 4,Jul. 1999.

[30] ANSI T1.801.03, ”American National Standard for Telecommunications- Digital transport of one-way video signals. Parameters for objectiveperformance assessment,” American National Standars Institute, 2003.

Michal Ries is currently a postdoctoral fellow at theInstitute of Communications and Radio-FrequencyEngineering. He received his B.S. and M.S. de-grees in 2002 and 2004 at the Slovak University ofTechnology, Faculty of Electrical Engineering andInformation Technology in Bratislava and his Dr.techn. (PhD) on Electrical Engineering from ViennaUniversity of Technology in 2008. Before he joinedTU Vienna he was working for Siemens PSE assystem engineer.His research interests include perceptual video and

audiovisual quality evaluation, video and audiovisual metric design, mon-itoring of QoS in wireless networks, video streaming in wireless networkoptimization.

Bruno Gardlo is currently a doctoral student atthe University of Zilina, Faculty of electrical en-gineering in Zilina. The topic of his dissertationwork is Automatic Indexing Of Speech Contentin Audio Databases. At the Vienna University ofTechnology he was working on his diploma thesis.The topic of his thesis was Subjective Audiovisualquality in multimedia networks. He received hisM.S. degree in 2009 at the University of Zilina.During his studies he worked for Siemens PSE assystem engineer.

His interests in telecommunication is signal processing, wireless systems andwireless networking.