An Analysis of the Impact of Playout Delay Adjustments introduced by VoIP Jitter Buffers on...

16
ACTA ACUSTICA UNITED WITH ACUSTICA Vol. 101 (2015) 616 – 631 DOI 10.3813/AAA.918857 An Analysis of the Impact of Playout Delay Adjustments introduced by VoIP Jitter Buffers on Listening Speech Quality Peter Počta 1) , Hugh Melvin 2) , Andrew Hines 3) 1) Dept. of Telecommunications and Multimedia, FEE, University of Žilina, Univerzitná 1, 01026, Žilina, Slovakia. [email protected] 2) Discipline of Information Technology, College of Engineering & Informatics, National University of Ireland, Galway, Ireland. [email protected] 3) Sigmedia, Trinity College Dublin, Ireland. [email protected] Summary This paper investigates the impact of frequent and small playout delay adjustments (time-shifting) of 30 ms or less introduced to silence periods by Voice over IP (VoIP) jitter buer strategies on listening quality perceived by the end user. In particular, the quality impact is assessed using both a subjective method (quality scores obtained from subjective listening test) and an objective method based on perceptual modelling. Two dierent objective methods are used, PESQ (Perceptual Evaluation of Speech Quality, ITU-T Recommendation P.862) and POLQA (Perceptual Objective Listening Quality Assessment, ITU-T Recommendation P.863). Moreover, the relative accuracy of both objective models is assessed by comparing their predictions with subjective assessments. The results show that the impact of the investigated playout delay adjustments on subjective listening quality scores is negligible. On the other hand, a significant impact is reported for objective listening quality scores predicted by the PESQ model i.e. the PESQ model fails to correctly predict quality scores for this kind of degradation. Finally, the POLQA model is shown to perform significantly better than PESQ. We conclude the paper by identifying further related research that arises from this study. PACS no. 43.71.Gv, 43.72.Kb 1. Introduction The default best-eort Internet presents significant chal- lenges for delay-sensitive applications such as VoIP. To cope with non-determinism, VoIP applications employ re- ceiver playout strategies that adapt to network conditions. Such strategies can be categorised as either per-talkspurt or per-packet. The former take advantage of silence pe- riods within natural speech and adjust such silences to track network conditions, thus preserving the integrity of talkspurts. This approach thus minimises delay at the ex- pense of silence period adjustments and some potential late packet loss. Examples of this approach include [1, 2]. Per-packet strategies are dierent in that adjustments are made both during silence periods and during talkspurts by scaling of packets, a technique also known as time- warping. This approach is more responsive to short net- work delay changes in that the per-talkspurt approach can only adapt during recognised silences even though the timescale of many delay spikes may be less than that of a talkspurt. The main disadvantage of this approach is the Received 04 October 2013, accepted 03 December 2014. degradation caused by the scaling of speech packets. Ex- amples of the latter approach are described in [3, 4] and such techniques can be found in popular VoIP applications such as GoogleTalk and Skype. Other research has at- tempted to optimise buer size and in particular, the trade obetween late packet loss and delay based on customised objective models [5, 6, 7, 8] Finally, previous research by one of the authors has proposed a hybrid playout strategy that utilises synchronised time in order to implement an informed fixed delay playout whenever possible thus min- imising the need for playout adjustments whilst minimis- ing late packet loss impairments. It reverts to an adaptive approach when delays become excessive. Details of this approach can be found in [9]. In this research, we focus on applications that deploy per-talkspurt strategies, which are commonly found in current telecommunication networks. Comparative performance analysis of the various per- talkspurt playout strategies has to date largely focused on metrics such as average delay and extent of late packet loss. We have found little research to date that has thor- oughly and specifically examined the precise impact of multiple and frequent silence period adjustments, charac- teristic of such adaptive playout strategies on speech qual- ity. Although both Ramjee et al. [1] and Moon et al. [2] 616 © S. Hirzel Verlag · EAA

Transcript of An Analysis of the Impact of Playout Delay Adjustments introduced by VoIP Jitter Buffers on...

ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015) 616 ndash 631

DOI 103813AAA918857

An Analysis of the Impact of Playout DelayAdjustments introduced by VoIP Jitter Buffers onListening Speech Quality

Peter Počta1) Hugh Melvin2) Andrew Hines3)1) Dept of Telecommunications and Multimedia FEE University of Žilina Univerzitnaacute 1 01026 ŽilinaSlovakia peterpoctafelunizask

2) Discipline of Information Technology College of Engineering amp Informatics National University of IrelandGalway Ireland hughmelvinnuigalwayie

3) Sigmedia Trinity College Dublin Ireland andrewhinestcdie

SummaryThis paper investigates the impact of frequent and small playout delay adjustments (time-shifting) of 30 ms orless introduced to silence periods by Voice over IP (VoIP) jitter buffer strategies on listening quality perceived bythe end user In particular the quality impact is assessed using both a subjective method (quality scores obtainedfrom subjective listening test) and an objective method based on perceptual modelling Two different objectivemethods are used PESQ (Perceptual Evaluation of Speech Quality ITU-T Recommendation P862) and POLQA(Perceptual Objective Listening Quality Assessment ITU-T Recommendation P863) Moreover the relativeaccuracy of both objective models is assessed by comparing their predictions with subjective assessments Theresults show that the impact of the investigated playout delay adjustments on subjective listening quality scores isnegligible On the other hand a significant impact is reported for objective listening quality scores predicted bythe PESQ model ie the PESQ model fails to correctly predict quality scores for this kind of degradation Finallythe POLQA model is shown to perform significantly better than PESQ We conclude the paper by identifyingfurther related research that arises from this study

PACS no 4371Gv 4372Kb

1 Introduction

The default best-effort Internet presents significant chal-lenges for delay-sensitive applications such as VoIP Tocope with non-determinism VoIP applications employ re-ceiver playout strategies that adapt to network conditionsSuch strategies can be categorised as either per-talkspurtor per-packet The former take advantage of silence pe-riods within natural speech and adjust such silences totrack network conditions thus preserving the integrity oftalkspurts This approach thus minimises delay at the ex-pense of silence period adjustments and some potentiallate packet loss Examples of this approach include [1 2]Per-packet strategies are different in that adjustments aremade both during silence periods and during talkspurtsby scaling of packets a technique also known as time-warping This approach is more responsive to short net-work delay changes in that the per-talkspurt approach canonly adapt during recognised silences even though thetimescale of many delay spikes may be less than that ofa talkspurt The main disadvantage of this approach is the

Received 04 October 2013accepted 03 December 2014

degradation caused by the scaling of speech packets Ex-amples of the latter approach are described in [3 4] andsuch techniques can be found in popular VoIP applicationssuch as GoogleTalk and Skype Other research has at-tempted to optimise buffer size and in particular the tradeoff between late packet loss and delay based on customisedobjective models [5 6 7 8] Finally previous research byone of the authors has proposed a hybrid playout strategythat utilises synchronised time in order to implement aninformed fixed delay playout whenever possible thus min-imising the need for playout adjustments whilst minimis-ing late packet loss impairments It reverts to an adaptiveapproach when delays become excessive Details of thisapproach can be found in [9] In this research we focus onapplications that deploy per-talkspurt strategies which arecommonly found in current telecommunication networks

Comparative performance analysis of the various per-talkspurt playout strategies has to date largely focused onmetrics such as average delay and extent of late packetloss We have found little research to date that has thor-oughly and specifically examined the precise impact ofmultiple and frequent silence period adjustments charac-teristic of such adaptive playout strategies on speech qual-ity Although both Ramjee et al [1] and Moon et al [2]

616 copy S Hirzel Verlag middot EAA

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

cite Montgomery [10] in claiming that such distortion doesnot have a noticeable effect the latter which was publishedin 1983 does not provide any evidence in this regard Allthree simply qualify their assertion regarding the impact ofsilence period distortion by stating that small adjustmentsare not noticeable On the other hand research by Hoene etal [6] has shown that playout delay adjustments during ac-tive speech have significant impacts on subjective listeningspeech quality but tests did not assess adjustments duringsilences Hoene et al also validated the use of the PESQmodel to predict the impact of adjustments during activespeech In subsequent research by Hoene et al [5] PESQwas used to estimate the impact of single large adjustmentsduring both silences and active speech and regarding theformer shows how adjustments of up to approximately320ms are deemed not noticeable Finally research usingsubjective listening tests by Voran [11] suggests that verylarge adjustments (430ms) are noticeable and then exam-ines the impact of various general impairments but notspecifically silence period adjustments

All of the above tests both subjective and objective ad-dress listening quality only Other research has examinedthe broader issue of conversational quality which includesthe interactive nature of voice communications For exam-ple Lee et al [12] suggest that in a wider context playoutdelays typical of jitter buffer strategies can when consid-ered at both ends of a VoIP session have an effect in aconversational environment and thus impact speech qual-ity They propose a time-scaling approach that whilst im-pacting marginally on listening quality minimises or elim-inates the need for jitter buffer delays thus minimisingany impact on conversational quality However their test-ing approach is based solely on listening-only tests Theimpact of playout adjustments on conversational speechquality was also raised by Gong et al [13] They dis-cuss the ITU-T E-model which takes into account end-to-end delays and thus goes some way towards examiningconversational quality Their analysis of the impact of de-lay on conversational quality is limited as the work pri-marily examines listening quality for differing packet lossstrategies using PESQ Interestingly they also suggest thatsmall adjustments to silence periods have lsquoalmost no effecton perceived qualityrsquo without any supporting research tovalidate this claim Undoubtedly there is significant meritin a full reference objective metric that could accuratelypredict conversational quality taking into account issuessuch as the impact of playout delays on prosody or natu-ral turn-taking rhythm and ultimately on quality Howeverwhether such a metric is necessary or indeed feasible is aresearch question beyond the scope addressed in this pa-per

In summary significant research has examined the im-pact on listening quality of large scale silence adjust-ments and adjustments to both silence periods and activespeech None to date have addressed the impact of fre-quent and small playout delay adjustments (time-shifting)introduced to silence periods by Voice over IP (VoIP) jitterbuffer strategies

This gap in the literature provided the main motivationfor our research summarised and presented in this paperThis research is four-staged and structured as follows

bull Detailed subjective test carried out in May 2012 toassess the precise impact of frequent and small (100ms) silence period adjustments typical of VoIP jit-ter buffer strategies on subjective listening MOS scores(MOS-LQS)

bull Comprehensive study to build on Hoenersquos et al workin [5] and investigate the impact of such silence-periodadjustments (ie typical of VoIP jitter buffers) on ob-jective listening MOS scores (MOS-LQO) specificallypredicted by PESQ This research was initially pre-sented at [14] and is more exhaustively analysed here

bull A similar and previously unpublished study on the per-formance of the more recent objective model POLQA

bull Comprehensive correlation analysis of both objectiveand subjective results

The remainder of this paper is structured as followsSection 2 provides background information and sets thecontext for our research Section 21 summarises both sub-jective and objective approaches to speech quality mea-surement Section 22 summarises related research Sec-tion 23 outlines our research motivation and related re-search questions Section 3 outlines our simulator-basedapproach to generating the impaired speech samples usedfor both objective and subjective testing It deals withsimulator details delay profiles generated adaptive algo-rithms and settings speech samples chosen and also sum-marises our speech quality assessment procedures Sec-tion 4 presents and discusses experimental results Sec-tion 5 concludes the paper and suggests some areas forfuture research arising from this paper

2 Background

This section sets the context for our research It firstlysummarises both objective and subjective approaches tospeech quality measurement It then briefly describes re-lated research that has touched upon similar research ques-tions Finally it describes our contribution by specifyingour research motivation and related research questions

21 Subjective and Objective Speech Quality As-sessment

Speech quality is judged by human listeners and hence itis inherently subjective Therefore the most reliable ap-proach for assessing speech quality is through subjectivetests The Absolute Category Rating (ACR) test definedby ITU-T Recommendation P800 [15] is one of the mostwidely accepted methods of listening speech quality as-sessment In the test listeners express their opinions on thequality of the speech material in terms of five categoriesexcellent good fair poor and bad with corresponding in-teger score 5432 and 1 respectively The ratings are av-eraged and the result is known as Mean Opinion Score(MOS) Subjective testing is thus time-consuming expen-sive and requires strict adherence to methodology to en-sure applicability of results As such subjective testing is

617

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

impractical for frequent testing such as routine networkmonitoring An interested reader can find more detailsabout subjective testing in [16] Arising from such limita-tions objective test methods have been developed in recentyears They are machine-executable and require little hu-man involvement In principle objective methods can beclassified into two categories signal-based methods andparameter-based methods The former requires availabil-ity of speech signals to realize quality prediction processand as detailed in [17] can be further divided into two cat-egories intrusive or non-intrusive Intrusive signal-basedmethods use two signals as the input to the measurementnamely a reference signal and a degraded signal which isthe output of the system under test They identify the audi-ble distortions based on the perceptual domain representa-tion of two signals incorporating human auditory modelsSeveral intrusive models have been developed over recentyears like Perceptual Speech Quality Measure (PSQM)[18] Measuring Normalizing System (MNB) [19 20]Perceptual Analysis Measurement System (PAMS) [21]Perceptual Evaluation of Speech Quality (PESQ) [22 23]and Perceptual Objective Listening Quality Assessment(POLQA) [24 25] Among the models mentioned abovePSQM PESQ and most recently POLQA have been stan-dardised by the ITU-T as Recommendations P861 [26]P862 [27] and P863 [28] respectively Moreover MNBis described in Appendix II of ITU-T Rec P861 in orderto extend the scope of the recommendation It should benoted here that ITU-T Rec P861 has been withdrawn in2001 and replaced by PESQ In contrast to intrusive meth-ods the idea of the single-ended (non-intrusive) signal-based methods is to generate an artificial reference (iean ldquoidealrdquo undistorted signal) from the degraded speechsignal Once a reference is available a signal compari-son similar to PESQPOLQA can then be performed Theresult of this comparison can further be modified by aparametric degradation analysis and integrated into an as-sessment of overall quality The most widely used non-intrusive models include Auditory Non-Intrusive QUalityEstimation (ANIQUE) [29] and internationally standard-ized P563 [30 31]

Finally parameter-based methods predict the speechquality through a computation model based on parametersrather than speech signals The E-model is such a methoddefined by ITU-T Recommendations G107 [32] (narrow-band version) and G1071 [33] (wideband version) and isprimarily used for transmission planning purposes in nar-rowband and wideband telephony networks This modelincludes a set of parameters characterising end-to-endvoice transmission as its input and the output (R-value)can then be transformed into MOS-Conversational Qual-ity Estimated (MOS-CQE) values

22 Related research

To date comparative performance analysis of per-talkspurtplayout strategies to cope with network jitter (such as[1 2 3]) have focused on metrics such as late loss rateand average delays which are the indirect effects of such

strategies with little consideration given to either the ex-tent or frequency of the silence period adjustments andthe impact they might directly have on quality perceivedby the end user The frequency of such adjustment is setby the talkspurtsilence ratio and thus is very much de-pendent on inherent speech type but also on Voice Ac-tivity Detection (VAD) settings within VoIP applicationsSuch VAD settings are often user-configurable and canvary greatly across differing VoIP applications For thatreason a speech segment identified as a silence period byone application will be listed as active speech by anotherAs such for a given speech segment and network condi-tions the performance of a specific adaptive strategy willbe directly impacted by such settings as described by [34]The extent of adjustments is influenced needless to say bynetwork conditions but also by the specific adaptive play-out strategy The qualifying phrase used by [10] that smalladjustments are not noticeable is of little practical valueconsidering the variability in both frequency and extent ofadjustments that can arise Although no subjective listen-ing testing to our knowledge has been done to preciselyquantify this impact some research dealing peripherallywith the issue is summarised below

Sun and Ifeachor in [7 8] developed algorithms thatseek to develop optimum buffer parameters in a trade offbetween delay and late packet loss Moreover the impactof jitter on speech quality using PESQ was investigatedby Qiao et al in [35] but was done by black-box testingand thus it is unclear whether the precise impact of jitteris direct (silence period adjustments) or indirect ( throughlate packet loss) In [36] an extension to the E-model wasdeveloped to include the indirect impact of jitter via latepacket lossHoene et al in [5] used PESQ to investigate the impact

of a single adjustment (0-1000msec) in an 8 second sam-ple (typical of delay spikes) during both silences and ac-tive speech on speech quality He showed that PESQ pre-dicts significant impacts during active speech but that ad-justments of up to approx 320ms are not noticeable dur-ing silences In other research by Hoene et al [6] he val-idated through subjective listening tests the behaviour ofPESQ in predicting the impact of a single adjustment dur-ing active speech but these tests did not extend to similaranalysis during silencesThe more extensive work of Voran [11] also deals some-

what peripherally with the issue and is summarised as fol-lows Voran evaluated through subjective testing the im-pact of temporal discontinuities and packet loss on listen-ing speech quality Similar to Hoenersquos et al work pub-lished in [6] discontinuities were applied to active speechsegments only A range of experiments were carried outto quantify the impact on MOS of such impairments Heintroduced three impairments termed loss jump and pauseto speech where loss refers to conventional packet loss andwas compensated for through Packet Loss Concealment(PLC) jump refers to temporal contraction of speech bydropping packets (thus without any PLC) and pause refersto temporal elongation of speech through silence inser-

618

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

tion with PLC applied to the inserted silence As such thepause and jump impairments are of most interest as theyinvolve temporal discontinuity and thus most closely re-flect the type of impairment caused by per-talkspurt play-out strategies However one key distinction between Vo-ranrsquos work and the operation of per-talkspurt strategies isthat he applied all impairments (pauselossjump) to ac-tive speech segments It is important to note that his pauseimpairment which introduced a silence gap within activespeech was then compensated for through PLC and hisjump impairment essentially removed a segment of activespeech as if it never existed The impact of both magni-tude and frequency of each of the three impairments wereexamined independently as well as a combination of pauseand jump Impairments were added at random locationswithin G723-encoded active speech From [37] his mainfindings are summarised as follows

bull For a given frequency and magnitude of impairmentthe impact of the four impairments (loss pause jumppause and jump) on MOS scores was found to beroughly similar

bull As the magnitude of impairment increased the reportedMOS scores decreased at an almost linear rate Forexample at a frequency of one impairment per 100frames a 3060120ms pause impairment resulted intheMOS score dropping by 021041115 respectively

bull As the frequency of impairment increased the reportedMOS scores decreased at a non-linear rate

In addition to the above findings he showed that for a verylarge and noticeable single adjustment (430ms silence re-moval) PESQ failed to register any impact This to someextend agrees with Hoenersquos et al analysis in that adjust-ments within silences of up to approx 320ms are ignoredby PESQAs emphasised earlier both Hoenersquos et al and Voranrsquos

detailed work introduced the impairments throughout ac-tive speech (talkspurts) only As such the results cannotbe directly compared with per talkspurt playout strate-gies where the temporal adjustment impairments only oc-cur during silences (ie the silence period is contractedor elongated) In particular Voranrsquos additional finding re-garding the very noticeable 430ms temporal adjustment(silence removal) coupled with both Hoenersquos et al andVoranrsquos findings that PESQ ignores such large single ad-justments during silences strengthened the argument thatboth PESQ and POLQA need to be tested for the impact offrequent though smaller playout delay adjustments moretypical of VoIP and thus prompted us to undertake this re-search

23 Research motivation

As outlined thus far the literature to date has not quanti-fied either objectively or subjectively the precise impactof multiple small silence period adjustments typical ofVoIP applications on quality perceived by the end userexpressed by MOS values As described above researchby Voran and Hoene et al make some contribution in this

area They both outlined firstly that the PESQ scores werenot impacted by very large and noticeable single adjust-ments during silences Secondly both showed that signif-icant adjustments during active speech did impact on sub-jective results (MOS-LQS) though these are quite differentto silence period adjustments

Considering all this our primary research motivationwas to address the gap in the literature by assessing the im-pact of frequent and small silence period adjustments onlistening quality perceived by the end user both throughobjective and subjective tests In particular we identifieda number of key research questions that we wished to an-swer namely

1 What impact do frequent and small silence period ad-justments have on subjective listening MOS scores

2 What impact do frequent and small silence periodadjustments have on objective listening MOS scoresspecifically those predicted by both PESQ as well asPOLQA

3 Can the PESQ andor POLQA model correctly predictthe impact of frequent and small silence period adjust-ments on listening quality perceived by the end user asquantified by a subjective test If so how accurate arethose predictions

Further questions include

4 What relationship exists between the magnitude of ad-justments and objective and subjective listening MOSscores

5 What impact does the position of adjustments withinspeech samples have on objective and subjective listen-ing MOS scores

3 Methodology

In this section we describe the methodology used to gen-erate the speech samples for both objective and subjectivetests and then provide details of the testing process Acustom-built Matlab-based simulator was developed andused to generate playout adjustments (as depicted in Fig-ure 1) The overall methodology comprised a number ofstages as follows

bull Generate a series of network packet delays consistentwith varying network conditions

bull Using these delays and simulated voice patterns (talk-spurt distribution) and applied to different playout al-gorithms generate a series of playout adjustments

bull Apply these set of adjustments to different locationswithin reference speech samples

Using this set of degraded and reference speech sam-ples we carried out both ITU-T standardised subjectivelistening test and objective tests the latter using bothPESQ and POLQA This facilitated a comparison betweenboth approaches and between PESQ and POLQA Theprocess of generating adjustments is described in Sec-tion 31 The section starts with a description of the playoutadjustment simulator and ends up with a simulation work

619

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

flow and outputs The process of applying adjustments tospeech is described in Section 32 Details related to theactual testing are given in Section 33

31 Playout Adjustment Generation

In order to assess both objectively and subjectively theimpact on listening quality perceived by the end user ofsilence period adjustments typical of VoIP applications adetailed simulator was built to generate such adjustmentsOverall objectives were to

bull Generate a sequence of VoIP packets V with a talkspurtdistribution typical of real speech

bull Generate a range n of network delay sequences Dwhere each range represents different network condi-tions ie nD delay values

bull For each of the n delay sequences D generate a seriesof playout adjustments A that would result from apply-ing this sequence to the VoIP packets V using typicalplayout algorithms

Figure 1 depicts the playout adjustment simulator whichwas implemented using Matlab The simulator was builtby one of the authors for previous research as outlined in[38] As can be seen in Figure 1 the simulator consists ofthree separate module blocks namely

bull Voice Simulator Blockbull Delay Simulator Blockbull Playout Algorithm Simulator Block

Each block is described in the following subsections

311 Voice simulator block

The simulated voice streams were based on live speechsamples The critical factor here is the distribution of talk-spurts which were extracted directly from voice tests intotext files and used to reproduce speech characteristicsThis was done by recording normal VoIP speech with VADenabled and extracting the Marker bits within the RTPpacket headers where lsquo1rsquo indicates the start of a talkspurtand lsquo0rsquo represents an active speech packet within a talk-spurt An array V thus represents the distribution of talk-spurt packets from normal speech ndash eg a sample subset ofV [10000001000] represents 2 separate talkspurts ofduration 7 packets and 4 packets respectively

312 Delay simulator block

Significant research has focused on modelling of Internetdelay and loss characteristics [39 40 41 42 43 44] In[8] Sun and Ifeachor show that for VoIP a Weibull dis-tribution models traffic better than exponential or ParetoOur model is designed to model the temporal relationshipor burstiness of delay traffic which is commonly found andlogically follows from the research that has proposed theuse of bursty packet loss models As such we propose aseries of 2-state Markov models to simulate varying net-work conditions Figure 2 illustrates its application to de-lay modelling The following summarises the most rele-vant characteristics of the models developed

PacketDelay

Mapping

Playout

Algorithm

Simulator

Voice

Simulator

VoIP Tests

Talkspurt

Distribution etc

Markov Delay

Models

Talkspurtdistribution V

Delaydistribution D

PlayoutAdj distribution A

Figure 1 Playout adjustment simulator

Figure 2 2-state Markov Delay Model

bull BADGOOD State Jitter Level The delay models thatwere developed used different ranges of jitter to differ-entiate between GOOD and BAD states Essentially aGOOD state had a low jitter metric (set as a of basedelays) and a multiplier was applied to this metric torepresent the BAD state

bull BAD State Probability This represents the percentageof packets that are affected by high delay variance (jit-ter)

bull Average BAD State Burst Length This determines howthe BAD state packets are distributed Much of the lit-erature on network analysis has reported that both lossand delayjitter have strong temporal dependency orburstiness Where strong temporal dependency of jit-terdelay is present this will result in clusters of BADstate packets resulting in BAD delayjitter bursts span-ning more than one packet Longer BAD bursts will bereflected in higher values for PBB from Figure 2

bull Using these values we can derive values for all 4 prob-abilitiesndash PBB Probability that packet n+1 will have high jitter

(BAD state) given that packet n has high jitterndash PBG Probability that packet n+1 will have low jitter

given that packet n has high jitter ie switch statesndash PGG Probability that packet n+1 will have low jitter

(GOOD state) given that packet n has low jitterndash PGB Probability that packet n+1 will have high jitter

given that packet n has low jitter ie switch states

An additional important requirement from the delay blockwas to ensure that out-of-order packets would not arisein reality such events are largely due to route changes andoccur infrequently and thus it was important to reproducethis

620

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

313 Playout algorithm simulator block

Two per-talkspurt adaptive strategies (namely algorithm 1and 4 from [1] and referred to here as algorithm 1 and 2 re-spectively) were simulated Both algorithms utilise linearrecursive filters in tracking network conditions but differin that algorithm 2 responds more quickly due to differentparameters and also includes a spike mode that respondsmore rapidly to changing network conditions although itmust still wait for the next silence period to do so Bothalgorithms adjust playout time accordingly at the start of atalkspurt as given by

di = αdiminus1 + 1 minus α ni (1)

pi = ti + di + βvi (2)

In the above i refers to packet i di is the estimated end-to-end delay α is the filter gain ni is the measured delay piis the playout time ti is the send time vi is the estimatedvariation in delay and β is a multiplication factor For ex-ample in [1] the authors choose a β value of 4 for bothalgorithms whereas the history factor α was set to 0875for algorithm 2 versus 0998 for algorithm 1 The choiceof parameters α β and the spike detection threshold (foralgorithm 2) impact greatly on the performance of thesealgorithms and are usually tuned to match precise networkconditions ie from stable to unstable For this reason weutilised a range of values as described in the followingsection As described earlier adjusting on a per-talkspurtbasis maintains the integrity of speech within talkspurtswhilst altering the inter-talkspurt silence periodsThe delays from the Delay Block are mapped to the

Voice talkspurt distribution series V and applied to the var-ious playout algorithms The delays are applied to eachpacket in turn and processed by the playout algorithm andadjustments made at the start of each talkspurt indicatedby a lsquo1rsquo in the V array This then generates a series of play-out adjustments A As outlined in Section 22 the extentof adjustment is dependent not only on the network con-dition and playout algorithm but also on the specific VADsettings of the VoIP application as the latter will greatlyimpact on the talkspurt distribution for a given speech seg-ment In any event the resulting required adjustment (si-lence) can be added or removed at this point before thenext talkspurt is processed

314 Simulation work flow and outputs

The overall simulator works as follows User firstly speci-fies a talkspurt distribution file which is extracted from livespeech and loaded as an array V User then specifies net-work conditions for test Delay block returns a sequence ofnetwork delays D corresponding to those conditions Theinput parameters are

1 Number of packets2 Packet interval (ms)3 Base delay (ms)4 BAD state burst length (ms)5 BAD state probability ()6 GOOD state jitter ( of base delay)

7 BAD state jitter multiplier

For our testing parameters 1ndash4 were kept constantwhile parameters 5ndash7 were varied as outlined below togive different network characteristics ranging from stableto unstable

1 Number of packets = 40002 Packet interval = 20ms3 Base delay = 50ms4 BAD state burst length = 10 Packets (200ms)5 BAD state probability = 20 40 60 806 GOOD state jitter = 25 507 BAD state jitter multiplier = 2 3 4

Note that in arising at these parameters particularlythose relating to jitter a detailed series of delay and jit-ter tests were undertaken measuring delayjitter betweenIreland and the USmainland Europe Further details canbe found in [38] More recent testing by [45 46] has high-lighted the particular problem of very high jitterdelay incongested IEEE 80211 networks The final step was tomap the delays to packets and apply them to adaptive jitterbuffering (AJB) algorithms 1 and 2 This generates a se-ries of playout adjustments A for every network test condi-tion and playout algorithm Figure 1 summarises the over-all flow within the simulatorIn [47] details of the comprehensive tests using the sim-

ulator are presented In summary 24 different network de-lay models were used generating 24 network delay se-quences D These delay values were fed to 2 differentplayout algorithms as described in Section 313 For eachplayout algorithm tests were repeated using different pa-rameters such as α (history weighting - varied from 08 to0998) β (jitter multiplier - varied from 4 to 6) and spikemode threshold (only algorithm 2) see again 313 for de-tails Each combination resulted in a distinct set of playoutadjustments A for each test scenario Note that the voicesamples V were based on 80 seconds of active speech with40 talkspurts (Marker bit = 1) whereas the speech sampleschosen for this experiment were 8 seconds long thus thisalso had to be factored Essentially a pro-rata approachwas taken in that for the 80 seconds of speech used fortests there were 40 playout adjustments so for our 8 sec-onds ITU-T speech samples we implemented 4 adjust-ments Arising from the full range of test combinationsdescribed above which numbered 96 and resulting adjust-ments a subset of 12 sets of playout adjustments contain-ing 4 adjustments each were taken to represent a spectrumof network conditions ranging from a stable network toan unstable network Table I illustrates the actual playoutadjustments selected that were applied to the speech sam-ples

32 Speech samples

As normal for quality testing 4 reference speech sampleswere used The English subset of ITU-T P Supplement 23[48] database was used for speech material consisting of

621

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

cite Montgomery [10] in claiming that such distortion doesnot have a noticeable effect the latter which was publishedin 1983 does not provide any evidence in this regard Allthree simply qualify their assertion regarding the impact ofsilence period distortion by stating that small adjustmentsare not noticeable On the other hand research by Hoene etal [6] has shown that playout delay adjustments during ac-tive speech have significant impacts on subjective listeningspeech quality but tests did not assess adjustments duringsilences Hoene et al also validated the use of the PESQmodel to predict the impact of adjustments during activespeech In subsequent research by Hoene et al [5] PESQwas used to estimate the impact of single large adjustmentsduring both silences and active speech and regarding theformer shows how adjustments of up to approximately320ms are deemed not noticeable Finally research usingsubjective listening tests by Voran [11] suggests that verylarge adjustments (430ms) are noticeable and then exam-ines the impact of various general impairments but notspecifically silence period adjustments

All of the above tests both subjective and objective ad-dress listening quality only Other research has examinedthe broader issue of conversational quality which includesthe interactive nature of voice communications For exam-ple Lee et al [12] suggest that in a wider context playoutdelays typical of jitter buffer strategies can when consid-ered at both ends of a VoIP session have an effect in aconversational environment and thus impact speech qual-ity They propose a time-scaling approach that whilst im-pacting marginally on listening quality minimises or elim-inates the need for jitter buffer delays thus minimisingany impact on conversational quality However their test-ing approach is based solely on listening-only tests Theimpact of playout adjustments on conversational speechquality was also raised by Gong et al [13] They dis-cuss the ITU-T E-model which takes into account end-to-end delays and thus goes some way towards examiningconversational quality Their analysis of the impact of de-lay on conversational quality is limited as the work pri-marily examines listening quality for differing packet lossstrategies using PESQ Interestingly they also suggest thatsmall adjustments to silence periods have lsquoalmost no effecton perceived qualityrsquo without any supporting research tovalidate this claim Undoubtedly there is significant meritin a full reference objective metric that could accuratelypredict conversational quality taking into account issuessuch as the impact of playout delays on prosody or natu-ral turn-taking rhythm and ultimately on quality Howeverwhether such a metric is necessary or indeed feasible is aresearch question beyond the scope addressed in this pa-per

In summary significant research has examined the im-pact on listening quality of large scale silence adjust-ments and adjustments to both silence periods and activespeech None to date have addressed the impact of fre-quent and small playout delay adjustments (time-shifting)introduced to silence periods by Voice over IP (VoIP) jitterbuffer strategies

This gap in the literature provided the main motivationfor our research summarised and presented in this paperThis research is four-staged and structured as follows

bull Detailed subjective test carried out in May 2012 toassess the precise impact of frequent and small (100ms) silence period adjustments typical of VoIP jit-ter buffer strategies on subjective listening MOS scores(MOS-LQS)

bull Comprehensive study to build on Hoenersquos et al workin [5] and investigate the impact of such silence-periodadjustments (ie typical of VoIP jitter buffers) on ob-jective listening MOS scores (MOS-LQO) specificallypredicted by PESQ This research was initially pre-sented at [14] and is more exhaustively analysed here

bull A similar and previously unpublished study on the per-formance of the more recent objective model POLQA

bull Comprehensive correlation analysis of both objectiveand subjective results

The remainder of this paper is structured as followsSection 2 provides background information and sets thecontext for our research Section 21 summarises both sub-jective and objective approaches to speech quality mea-surement Section 22 summarises related research Sec-tion 23 outlines our research motivation and related re-search questions Section 3 outlines our simulator-basedapproach to generating the impaired speech samples usedfor both objective and subjective testing It deals withsimulator details delay profiles generated adaptive algo-rithms and settings speech samples chosen and also sum-marises our speech quality assessment procedures Sec-tion 4 presents and discusses experimental results Sec-tion 5 concludes the paper and suggests some areas forfuture research arising from this paper

2 Background

This section sets the context for our research It firstlysummarises both objective and subjective approaches tospeech quality measurement It then briefly describes re-lated research that has touched upon similar research ques-tions Finally it describes our contribution by specifyingour research motivation and related research questions

21 Subjective and Objective Speech Quality As-sessment

Speech quality is judged by human listeners and hence itis inherently subjective Therefore the most reliable ap-proach for assessing speech quality is through subjectivetests The Absolute Category Rating (ACR) test definedby ITU-T Recommendation P800 [15] is one of the mostwidely accepted methods of listening speech quality as-sessment In the test listeners express their opinions on thequality of the speech material in terms of five categoriesexcellent good fair poor and bad with corresponding in-teger score 5432 and 1 respectively The ratings are av-eraged and the result is known as Mean Opinion Score(MOS) Subjective testing is thus time-consuming expen-sive and requires strict adherence to methodology to en-sure applicability of results As such subjective testing is

617

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

impractical for frequent testing such as routine networkmonitoring An interested reader can find more detailsabout subjective testing in [16] Arising from such limita-tions objective test methods have been developed in recentyears They are machine-executable and require little hu-man involvement In principle objective methods can beclassified into two categories signal-based methods andparameter-based methods The former requires availabil-ity of speech signals to realize quality prediction processand as detailed in [17] can be further divided into two cat-egories intrusive or non-intrusive Intrusive signal-basedmethods use two signals as the input to the measurementnamely a reference signal and a degraded signal which isthe output of the system under test They identify the audi-ble distortions based on the perceptual domain representa-tion of two signals incorporating human auditory modelsSeveral intrusive models have been developed over recentyears like Perceptual Speech Quality Measure (PSQM)[18] Measuring Normalizing System (MNB) [19 20]Perceptual Analysis Measurement System (PAMS) [21]Perceptual Evaluation of Speech Quality (PESQ) [22 23]and Perceptual Objective Listening Quality Assessment(POLQA) [24 25] Among the models mentioned abovePSQM PESQ and most recently POLQA have been stan-dardised by the ITU-T as Recommendations P861 [26]P862 [27] and P863 [28] respectively Moreover MNBis described in Appendix II of ITU-T Rec P861 in orderto extend the scope of the recommendation It should benoted here that ITU-T Rec P861 has been withdrawn in2001 and replaced by PESQ In contrast to intrusive meth-ods the idea of the single-ended (non-intrusive) signal-based methods is to generate an artificial reference (iean ldquoidealrdquo undistorted signal) from the degraded speechsignal Once a reference is available a signal compari-son similar to PESQPOLQA can then be performed Theresult of this comparison can further be modified by aparametric degradation analysis and integrated into an as-sessment of overall quality The most widely used non-intrusive models include Auditory Non-Intrusive QUalityEstimation (ANIQUE) [29] and internationally standard-ized P563 [30 31]

Finally parameter-based methods predict the speechquality through a computation model based on parametersrather than speech signals The E-model is such a methoddefined by ITU-T Recommendations G107 [32] (narrow-band version) and G1071 [33] (wideband version) and isprimarily used for transmission planning purposes in nar-rowband and wideband telephony networks This modelincludes a set of parameters characterising end-to-endvoice transmission as its input and the output (R-value)can then be transformed into MOS-Conversational Qual-ity Estimated (MOS-CQE) values

22 Related research

To date comparative performance analysis of per-talkspurtplayout strategies to cope with network jitter (such as[1 2 3]) have focused on metrics such as late loss rateand average delays which are the indirect effects of such

strategies with little consideration given to either the ex-tent or frequency of the silence period adjustments andthe impact they might directly have on quality perceivedby the end user The frequency of such adjustment is setby the talkspurtsilence ratio and thus is very much de-pendent on inherent speech type but also on Voice Ac-tivity Detection (VAD) settings within VoIP applicationsSuch VAD settings are often user-configurable and canvary greatly across differing VoIP applications For thatreason a speech segment identified as a silence period byone application will be listed as active speech by anotherAs such for a given speech segment and network condi-tions the performance of a specific adaptive strategy willbe directly impacted by such settings as described by [34]The extent of adjustments is influenced needless to say bynetwork conditions but also by the specific adaptive play-out strategy The qualifying phrase used by [10] that smalladjustments are not noticeable is of little practical valueconsidering the variability in both frequency and extent ofadjustments that can arise Although no subjective listen-ing testing to our knowledge has been done to preciselyquantify this impact some research dealing peripherallywith the issue is summarised below

Sun and Ifeachor in [7 8] developed algorithms thatseek to develop optimum buffer parameters in a trade offbetween delay and late packet loss Moreover the impactof jitter on speech quality using PESQ was investigatedby Qiao et al in [35] but was done by black-box testingand thus it is unclear whether the precise impact of jitteris direct (silence period adjustments) or indirect ( throughlate packet loss) In [36] an extension to the E-model wasdeveloped to include the indirect impact of jitter via latepacket lossHoene et al in [5] used PESQ to investigate the impact

of a single adjustment (0-1000msec) in an 8 second sam-ple (typical of delay spikes) during both silences and ac-tive speech on speech quality He showed that PESQ pre-dicts significant impacts during active speech but that ad-justments of up to approx 320ms are not noticeable dur-ing silences In other research by Hoene et al [6] he val-idated through subjective listening tests the behaviour ofPESQ in predicting the impact of a single adjustment dur-ing active speech but these tests did not extend to similaranalysis during silencesThe more extensive work of Voran [11] also deals some-

what peripherally with the issue and is summarised as fol-lows Voran evaluated through subjective testing the im-pact of temporal discontinuities and packet loss on listen-ing speech quality Similar to Hoenersquos et al work pub-lished in [6] discontinuities were applied to active speechsegments only A range of experiments were carried outto quantify the impact on MOS of such impairments Heintroduced three impairments termed loss jump and pauseto speech where loss refers to conventional packet loss andwas compensated for through Packet Loss Concealment(PLC) jump refers to temporal contraction of speech bydropping packets (thus without any PLC) and pause refersto temporal elongation of speech through silence inser-

618

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

tion with PLC applied to the inserted silence As such thepause and jump impairments are of most interest as theyinvolve temporal discontinuity and thus most closely re-flect the type of impairment caused by per-talkspurt play-out strategies However one key distinction between Vo-ranrsquos work and the operation of per-talkspurt strategies isthat he applied all impairments (pauselossjump) to ac-tive speech segments It is important to note that his pauseimpairment which introduced a silence gap within activespeech was then compensated for through PLC and hisjump impairment essentially removed a segment of activespeech as if it never existed The impact of both magni-tude and frequency of each of the three impairments wereexamined independently as well as a combination of pauseand jump Impairments were added at random locationswithin G723-encoded active speech From [37] his mainfindings are summarised as follows

bull For a given frequency and magnitude of impairmentthe impact of the four impairments (loss pause jumppause and jump) on MOS scores was found to beroughly similar

bull As the magnitude of impairment increased the reportedMOS scores decreased at an almost linear rate Forexample at a frequency of one impairment per 100frames a 3060120ms pause impairment resulted intheMOS score dropping by 021041115 respectively

bull As the frequency of impairment increased the reportedMOS scores decreased at a non-linear rate

In addition to the above findings he showed that for a verylarge and noticeable single adjustment (430ms silence re-moval) PESQ failed to register any impact This to someextend agrees with Hoenersquos et al analysis in that adjust-ments within silences of up to approx 320ms are ignoredby PESQAs emphasised earlier both Hoenersquos et al and Voranrsquos

detailed work introduced the impairments throughout ac-tive speech (talkspurts) only As such the results cannotbe directly compared with per talkspurt playout strate-gies where the temporal adjustment impairments only oc-cur during silences (ie the silence period is contractedor elongated) In particular Voranrsquos additional finding re-garding the very noticeable 430ms temporal adjustment(silence removal) coupled with both Hoenersquos et al andVoranrsquos findings that PESQ ignores such large single ad-justments during silences strengthened the argument thatboth PESQ and POLQA need to be tested for the impact offrequent though smaller playout delay adjustments moretypical of VoIP and thus prompted us to undertake this re-search

23 Research motivation

As outlined thus far the literature to date has not quanti-fied either objectively or subjectively the precise impactof multiple small silence period adjustments typical ofVoIP applications on quality perceived by the end userexpressed by MOS values As described above researchby Voran and Hoene et al make some contribution in this

area They both outlined firstly that the PESQ scores werenot impacted by very large and noticeable single adjust-ments during silences Secondly both showed that signif-icant adjustments during active speech did impact on sub-jective results (MOS-LQS) though these are quite differentto silence period adjustments

Considering all this our primary research motivationwas to address the gap in the literature by assessing the im-pact of frequent and small silence period adjustments onlistening quality perceived by the end user both throughobjective and subjective tests In particular we identifieda number of key research questions that we wished to an-swer namely

1 What impact do frequent and small silence period ad-justments have on subjective listening MOS scores

2 What impact do frequent and small silence periodadjustments have on objective listening MOS scoresspecifically those predicted by both PESQ as well asPOLQA

3 Can the PESQ andor POLQA model correctly predictthe impact of frequent and small silence period adjust-ments on listening quality perceived by the end user asquantified by a subjective test If so how accurate arethose predictions

Further questions include

4 What relationship exists between the magnitude of ad-justments and objective and subjective listening MOSscores

5 What impact does the position of adjustments withinspeech samples have on objective and subjective listen-ing MOS scores

3 Methodology

In this section we describe the methodology used to gen-erate the speech samples for both objective and subjectivetests and then provide details of the testing process Acustom-built Matlab-based simulator was developed andused to generate playout adjustments (as depicted in Fig-ure 1) The overall methodology comprised a number ofstages as follows

bull Generate a series of network packet delays consistentwith varying network conditions

bull Using these delays and simulated voice patterns (talk-spurt distribution) and applied to different playout al-gorithms generate a series of playout adjustments

bull Apply these set of adjustments to different locationswithin reference speech samples

Using this set of degraded and reference speech sam-ples we carried out both ITU-T standardised subjectivelistening test and objective tests the latter using bothPESQ and POLQA This facilitated a comparison betweenboth approaches and between PESQ and POLQA Theprocess of generating adjustments is described in Sec-tion 31 The section starts with a description of the playoutadjustment simulator and ends up with a simulation work

619

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

flow and outputs The process of applying adjustments tospeech is described in Section 32 Details related to theactual testing are given in Section 33

31 Playout Adjustment Generation

In order to assess both objectively and subjectively theimpact on listening quality perceived by the end user ofsilence period adjustments typical of VoIP applications adetailed simulator was built to generate such adjustmentsOverall objectives were to

bull Generate a sequence of VoIP packets V with a talkspurtdistribution typical of real speech

bull Generate a range n of network delay sequences Dwhere each range represents different network condi-tions ie nD delay values

bull For each of the n delay sequences D generate a seriesof playout adjustments A that would result from apply-ing this sequence to the VoIP packets V using typicalplayout algorithms

Figure 1 depicts the playout adjustment simulator whichwas implemented using Matlab The simulator was builtby one of the authors for previous research as outlined in[38] As can be seen in Figure 1 the simulator consists ofthree separate module blocks namely

bull Voice Simulator Blockbull Delay Simulator Blockbull Playout Algorithm Simulator Block

Each block is described in the following subsections

311 Voice simulator block

The simulated voice streams were based on live speechsamples The critical factor here is the distribution of talk-spurts which were extracted directly from voice tests intotext files and used to reproduce speech characteristicsThis was done by recording normal VoIP speech with VADenabled and extracting the Marker bits within the RTPpacket headers where lsquo1rsquo indicates the start of a talkspurtand lsquo0rsquo represents an active speech packet within a talk-spurt An array V thus represents the distribution of talk-spurt packets from normal speech ndash eg a sample subset ofV [10000001000] represents 2 separate talkspurts ofduration 7 packets and 4 packets respectively

312 Delay simulator block

Significant research has focused on modelling of Internetdelay and loss characteristics [39 40 41 42 43 44] In[8] Sun and Ifeachor show that for VoIP a Weibull dis-tribution models traffic better than exponential or ParetoOur model is designed to model the temporal relationshipor burstiness of delay traffic which is commonly found andlogically follows from the research that has proposed theuse of bursty packet loss models As such we propose aseries of 2-state Markov models to simulate varying net-work conditions Figure 2 illustrates its application to de-lay modelling The following summarises the most rele-vant characteristics of the models developed

PacketDelay

Mapping

Playout

Algorithm

Simulator

Voice

Simulator

VoIP Tests

Talkspurt

Distribution etc

Markov Delay

Models

Talkspurtdistribution V

Delaydistribution D

PlayoutAdj distribution A

Figure 1 Playout adjustment simulator

Figure 2 2-state Markov Delay Model

bull BADGOOD State Jitter Level The delay models thatwere developed used different ranges of jitter to differ-entiate between GOOD and BAD states Essentially aGOOD state had a low jitter metric (set as a of basedelays) and a multiplier was applied to this metric torepresent the BAD state

bull BAD State Probability This represents the percentageof packets that are affected by high delay variance (jit-ter)

bull Average BAD State Burst Length This determines howthe BAD state packets are distributed Much of the lit-erature on network analysis has reported that both lossand delayjitter have strong temporal dependency orburstiness Where strong temporal dependency of jit-terdelay is present this will result in clusters of BADstate packets resulting in BAD delayjitter bursts span-ning more than one packet Longer BAD bursts will bereflected in higher values for PBB from Figure 2

bull Using these values we can derive values for all 4 prob-abilitiesndash PBB Probability that packet n+1 will have high jitter

(BAD state) given that packet n has high jitterndash PBG Probability that packet n+1 will have low jitter

given that packet n has high jitter ie switch statesndash PGG Probability that packet n+1 will have low jitter

(GOOD state) given that packet n has low jitterndash PGB Probability that packet n+1 will have high jitter

given that packet n has low jitter ie switch states

An additional important requirement from the delay blockwas to ensure that out-of-order packets would not arisein reality such events are largely due to route changes andoccur infrequently and thus it was important to reproducethis

620

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

313 Playout algorithm simulator block

Two per-talkspurt adaptive strategies (namely algorithm 1and 4 from [1] and referred to here as algorithm 1 and 2 re-spectively) were simulated Both algorithms utilise linearrecursive filters in tracking network conditions but differin that algorithm 2 responds more quickly due to differentparameters and also includes a spike mode that respondsmore rapidly to changing network conditions although itmust still wait for the next silence period to do so Bothalgorithms adjust playout time accordingly at the start of atalkspurt as given by

di = αdiminus1 + 1 minus α ni (1)

pi = ti + di + βvi (2)

In the above i refers to packet i di is the estimated end-to-end delay α is the filter gain ni is the measured delay piis the playout time ti is the send time vi is the estimatedvariation in delay and β is a multiplication factor For ex-ample in [1] the authors choose a β value of 4 for bothalgorithms whereas the history factor α was set to 0875for algorithm 2 versus 0998 for algorithm 1 The choiceof parameters α β and the spike detection threshold (foralgorithm 2) impact greatly on the performance of thesealgorithms and are usually tuned to match precise networkconditions ie from stable to unstable For this reason weutilised a range of values as described in the followingsection As described earlier adjusting on a per-talkspurtbasis maintains the integrity of speech within talkspurtswhilst altering the inter-talkspurt silence periodsThe delays from the Delay Block are mapped to the

Voice talkspurt distribution series V and applied to the var-ious playout algorithms The delays are applied to eachpacket in turn and processed by the playout algorithm andadjustments made at the start of each talkspurt indicatedby a lsquo1rsquo in the V array This then generates a series of play-out adjustments A As outlined in Section 22 the extentof adjustment is dependent not only on the network con-dition and playout algorithm but also on the specific VADsettings of the VoIP application as the latter will greatlyimpact on the talkspurt distribution for a given speech seg-ment In any event the resulting required adjustment (si-lence) can be added or removed at this point before thenext talkspurt is processed

314 Simulation work flow and outputs

The overall simulator works as follows User firstly speci-fies a talkspurt distribution file which is extracted from livespeech and loaded as an array V User then specifies net-work conditions for test Delay block returns a sequence ofnetwork delays D corresponding to those conditions Theinput parameters are

1 Number of packets2 Packet interval (ms)3 Base delay (ms)4 BAD state burst length (ms)5 BAD state probability ()6 GOOD state jitter ( of base delay)

7 BAD state jitter multiplier

For our testing parameters 1ndash4 were kept constantwhile parameters 5ndash7 were varied as outlined below togive different network characteristics ranging from stableto unstable

1 Number of packets = 40002 Packet interval = 20ms3 Base delay = 50ms4 BAD state burst length = 10 Packets (200ms)5 BAD state probability = 20 40 60 806 GOOD state jitter = 25 507 BAD state jitter multiplier = 2 3 4

Note that in arising at these parameters particularlythose relating to jitter a detailed series of delay and jit-ter tests were undertaken measuring delayjitter betweenIreland and the USmainland Europe Further details canbe found in [38] More recent testing by [45 46] has high-lighted the particular problem of very high jitterdelay incongested IEEE 80211 networks The final step was tomap the delays to packets and apply them to adaptive jitterbuffering (AJB) algorithms 1 and 2 This generates a se-ries of playout adjustments A for every network test condi-tion and playout algorithm Figure 1 summarises the over-all flow within the simulatorIn [47] details of the comprehensive tests using the sim-

ulator are presented In summary 24 different network de-lay models were used generating 24 network delay se-quences D These delay values were fed to 2 differentplayout algorithms as described in Section 313 For eachplayout algorithm tests were repeated using different pa-rameters such as α (history weighting - varied from 08 to0998) β (jitter multiplier - varied from 4 to 6) and spikemode threshold (only algorithm 2) see again 313 for de-tails Each combination resulted in a distinct set of playoutadjustments A for each test scenario Note that the voicesamples V were based on 80 seconds of active speech with40 talkspurts (Marker bit = 1) whereas the speech sampleschosen for this experiment were 8 seconds long thus thisalso had to be factored Essentially a pro-rata approachwas taken in that for the 80 seconds of speech used fortests there were 40 playout adjustments so for our 8 sec-onds ITU-T speech samples we implemented 4 adjust-ments Arising from the full range of test combinationsdescribed above which numbered 96 and resulting adjust-ments a subset of 12 sets of playout adjustments contain-ing 4 adjustments each were taken to represent a spectrumof network conditions ranging from a stable network toan unstable network Table I illustrates the actual playoutadjustments selected that were applied to the speech sam-ples

32 Speech samples

As normal for quality testing 4 reference speech sampleswere used The English subset of ITU-T P Supplement 23[48] database was used for speech material consisting of

621

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

impractical for frequent testing such as routine networkmonitoring An interested reader can find more detailsabout subjective testing in [16] Arising from such limita-tions objective test methods have been developed in recentyears They are machine-executable and require little hu-man involvement In principle objective methods can beclassified into two categories signal-based methods andparameter-based methods The former requires availabil-ity of speech signals to realize quality prediction processand as detailed in [17] can be further divided into two cat-egories intrusive or non-intrusive Intrusive signal-basedmethods use two signals as the input to the measurementnamely a reference signal and a degraded signal which isthe output of the system under test They identify the audi-ble distortions based on the perceptual domain representa-tion of two signals incorporating human auditory modelsSeveral intrusive models have been developed over recentyears like Perceptual Speech Quality Measure (PSQM)[18] Measuring Normalizing System (MNB) [19 20]Perceptual Analysis Measurement System (PAMS) [21]Perceptual Evaluation of Speech Quality (PESQ) [22 23]and Perceptual Objective Listening Quality Assessment(POLQA) [24 25] Among the models mentioned abovePSQM PESQ and most recently POLQA have been stan-dardised by the ITU-T as Recommendations P861 [26]P862 [27] and P863 [28] respectively Moreover MNBis described in Appendix II of ITU-T Rec P861 in orderto extend the scope of the recommendation It should benoted here that ITU-T Rec P861 has been withdrawn in2001 and replaced by PESQ In contrast to intrusive meth-ods the idea of the single-ended (non-intrusive) signal-based methods is to generate an artificial reference (iean ldquoidealrdquo undistorted signal) from the degraded speechsignal Once a reference is available a signal compari-son similar to PESQPOLQA can then be performed Theresult of this comparison can further be modified by aparametric degradation analysis and integrated into an as-sessment of overall quality The most widely used non-intrusive models include Auditory Non-Intrusive QUalityEstimation (ANIQUE) [29] and internationally standard-ized P563 [30 31]

Finally parameter-based methods predict the speechquality through a computation model based on parametersrather than speech signals The E-model is such a methoddefined by ITU-T Recommendations G107 [32] (narrow-band version) and G1071 [33] (wideband version) and isprimarily used for transmission planning purposes in nar-rowband and wideband telephony networks This modelincludes a set of parameters characterising end-to-endvoice transmission as its input and the output (R-value)can then be transformed into MOS-Conversational Qual-ity Estimated (MOS-CQE) values

22 Related research

To date comparative performance analysis of per-talkspurtplayout strategies to cope with network jitter (such as[1 2 3]) have focused on metrics such as late loss rateand average delays which are the indirect effects of such

strategies with little consideration given to either the ex-tent or frequency of the silence period adjustments andthe impact they might directly have on quality perceivedby the end user The frequency of such adjustment is setby the talkspurtsilence ratio and thus is very much de-pendent on inherent speech type but also on Voice Ac-tivity Detection (VAD) settings within VoIP applicationsSuch VAD settings are often user-configurable and canvary greatly across differing VoIP applications For thatreason a speech segment identified as a silence period byone application will be listed as active speech by anotherAs such for a given speech segment and network condi-tions the performance of a specific adaptive strategy willbe directly impacted by such settings as described by [34]The extent of adjustments is influenced needless to say bynetwork conditions but also by the specific adaptive play-out strategy The qualifying phrase used by [10] that smalladjustments are not noticeable is of little practical valueconsidering the variability in both frequency and extent ofadjustments that can arise Although no subjective listen-ing testing to our knowledge has been done to preciselyquantify this impact some research dealing peripherallywith the issue is summarised below

Sun and Ifeachor in [7 8] developed algorithms thatseek to develop optimum buffer parameters in a trade offbetween delay and late packet loss Moreover the impactof jitter on speech quality using PESQ was investigatedby Qiao et al in [35] but was done by black-box testingand thus it is unclear whether the precise impact of jitteris direct (silence period adjustments) or indirect ( throughlate packet loss) In [36] an extension to the E-model wasdeveloped to include the indirect impact of jitter via latepacket lossHoene et al in [5] used PESQ to investigate the impact

of a single adjustment (0-1000msec) in an 8 second sam-ple (typical of delay spikes) during both silences and ac-tive speech on speech quality He showed that PESQ pre-dicts significant impacts during active speech but that ad-justments of up to approx 320ms are not noticeable dur-ing silences In other research by Hoene et al [6] he val-idated through subjective listening tests the behaviour ofPESQ in predicting the impact of a single adjustment dur-ing active speech but these tests did not extend to similaranalysis during silencesThe more extensive work of Voran [11] also deals some-

what peripherally with the issue and is summarised as fol-lows Voran evaluated through subjective testing the im-pact of temporal discontinuities and packet loss on listen-ing speech quality Similar to Hoenersquos et al work pub-lished in [6] discontinuities were applied to active speechsegments only A range of experiments were carried outto quantify the impact on MOS of such impairments Heintroduced three impairments termed loss jump and pauseto speech where loss refers to conventional packet loss andwas compensated for through Packet Loss Concealment(PLC) jump refers to temporal contraction of speech bydropping packets (thus without any PLC) and pause refersto temporal elongation of speech through silence inser-

618

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

tion with PLC applied to the inserted silence As such thepause and jump impairments are of most interest as theyinvolve temporal discontinuity and thus most closely re-flect the type of impairment caused by per-talkspurt play-out strategies However one key distinction between Vo-ranrsquos work and the operation of per-talkspurt strategies isthat he applied all impairments (pauselossjump) to ac-tive speech segments It is important to note that his pauseimpairment which introduced a silence gap within activespeech was then compensated for through PLC and hisjump impairment essentially removed a segment of activespeech as if it never existed The impact of both magni-tude and frequency of each of the three impairments wereexamined independently as well as a combination of pauseand jump Impairments were added at random locationswithin G723-encoded active speech From [37] his mainfindings are summarised as follows

bull For a given frequency and magnitude of impairmentthe impact of the four impairments (loss pause jumppause and jump) on MOS scores was found to beroughly similar

bull As the magnitude of impairment increased the reportedMOS scores decreased at an almost linear rate Forexample at a frequency of one impairment per 100frames a 3060120ms pause impairment resulted intheMOS score dropping by 021041115 respectively

bull As the frequency of impairment increased the reportedMOS scores decreased at a non-linear rate

In addition to the above findings he showed that for a verylarge and noticeable single adjustment (430ms silence re-moval) PESQ failed to register any impact This to someextend agrees with Hoenersquos et al analysis in that adjust-ments within silences of up to approx 320ms are ignoredby PESQAs emphasised earlier both Hoenersquos et al and Voranrsquos

detailed work introduced the impairments throughout ac-tive speech (talkspurts) only As such the results cannotbe directly compared with per talkspurt playout strate-gies where the temporal adjustment impairments only oc-cur during silences (ie the silence period is contractedor elongated) In particular Voranrsquos additional finding re-garding the very noticeable 430ms temporal adjustment(silence removal) coupled with both Hoenersquos et al andVoranrsquos findings that PESQ ignores such large single ad-justments during silences strengthened the argument thatboth PESQ and POLQA need to be tested for the impact offrequent though smaller playout delay adjustments moretypical of VoIP and thus prompted us to undertake this re-search

23 Research motivation

As outlined thus far the literature to date has not quanti-fied either objectively or subjectively the precise impactof multiple small silence period adjustments typical ofVoIP applications on quality perceived by the end userexpressed by MOS values As described above researchby Voran and Hoene et al make some contribution in this

area They both outlined firstly that the PESQ scores werenot impacted by very large and noticeable single adjust-ments during silences Secondly both showed that signif-icant adjustments during active speech did impact on sub-jective results (MOS-LQS) though these are quite differentto silence period adjustments

Considering all this our primary research motivationwas to address the gap in the literature by assessing the im-pact of frequent and small silence period adjustments onlistening quality perceived by the end user both throughobjective and subjective tests In particular we identifieda number of key research questions that we wished to an-swer namely

1 What impact do frequent and small silence period ad-justments have on subjective listening MOS scores

2 What impact do frequent and small silence periodadjustments have on objective listening MOS scoresspecifically those predicted by both PESQ as well asPOLQA

3 Can the PESQ andor POLQA model correctly predictthe impact of frequent and small silence period adjust-ments on listening quality perceived by the end user asquantified by a subjective test If so how accurate arethose predictions

Further questions include

4 What relationship exists between the magnitude of ad-justments and objective and subjective listening MOSscores

5 What impact does the position of adjustments withinspeech samples have on objective and subjective listen-ing MOS scores

3 Methodology

In this section we describe the methodology used to gen-erate the speech samples for both objective and subjectivetests and then provide details of the testing process Acustom-built Matlab-based simulator was developed andused to generate playout adjustments (as depicted in Fig-ure 1) The overall methodology comprised a number ofstages as follows

bull Generate a series of network packet delays consistentwith varying network conditions

bull Using these delays and simulated voice patterns (talk-spurt distribution) and applied to different playout al-gorithms generate a series of playout adjustments

bull Apply these set of adjustments to different locationswithin reference speech samples

Using this set of degraded and reference speech sam-ples we carried out both ITU-T standardised subjectivelistening test and objective tests the latter using bothPESQ and POLQA This facilitated a comparison betweenboth approaches and between PESQ and POLQA Theprocess of generating adjustments is described in Sec-tion 31 The section starts with a description of the playoutadjustment simulator and ends up with a simulation work

619

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

flow and outputs The process of applying adjustments tospeech is described in Section 32 Details related to theactual testing are given in Section 33

31 Playout Adjustment Generation

In order to assess both objectively and subjectively theimpact on listening quality perceived by the end user ofsilence period adjustments typical of VoIP applications adetailed simulator was built to generate such adjustmentsOverall objectives were to

bull Generate a sequence of VoIP packets V with a talkspurtdistribution typical of real speech

bull Generate a range n of network delay sequences Dwhere each range represents different network condi-tions ie nD delay values

bull For each of the n delay sequences D generate a seriesof playout adjustments A that would result from apply-ing this sequence to the VoIP packets V using typicalplayout algorithms

Figure 1 depicts the playout adjustment simulator whichwas implemented using Matlab The simulator was builtby one of the authors for previous research as outlined in[38] As can be seen in Figure 1 the simulator consists ofthree separate module blocks namely

bull Voice Simulator Blockbull Delay Simulator Blockbull Playout Algorithm Simulator Block

Each block is described in the following subsections

311 Voice simulator block

The simulated voice streams were based on live speechsamples The critical factor here is the distribution of talk-spurts which were extracted directly from voice tests intotext files and used to reproduce speech characteristicsThis was done by recording normal VoIP speech with VADenabled and extracting the Marker bits within the RTPpacket headers where lsquo1rsquo indicates the start of a talkspurtand lsquo0rsquo represents an active speech packet within a talk-spurt An array V thus represents the distribution of talk-spurt packets from normal speech ndash eg a sample subset ofV [10000001000] represents 2 separate talkspurts ofduration 7 packets and 4 packets respectively

312 Delay simulator block

Significant research has focused on modelling of Internetdelay and loss characteristics [39 40 41 42 43 44] In[8] Sun and Ifeachor show that for VoIP a Weibull dis-tribution models traffic better than exponential or ParetoOur model is designed to model the temporal relationshipor burstiness of delay traffic which is commonly found andlogically follows from the research that has proposed theuse of bursty packet loss models As such we propose aseries of 2-state Markov models to simulate varying net-work conditions Figure 2 illustrates its application to de-lay modelling The following summarises the most rele-vant characteristics of the models developed

PacketDelay

Mapping

Playout

Algorithm

Simulator

Voice

Simulator

VoIP Tests

Talkspurt

Distribution etc

Markov Delay

Models

Talkspurtdistribution V

Delaydistribution D

PlayoutAdj distribution A

Figure 1 Playout adjustment simulator

Figure 2 2-state Markov Delay Model

bull BADGOOD State Jitter Level The delay models thatwere developed used different ranges of jitter to differ-entiate between GOOD and BAD states Essentially aGOOD state had a low jitter metric (set as a of basedelays) and a multiplier was applied to this metric torepresent the BAD state

bull BAD State Probability This represents the percentageof packets that are affected by high delay variance (jit-ter)

bull Average BAD State Burst Length This determines howthe BAD state packets are distributed Much of the lit-erature on network analysis has reported that both lossand delayjitter have strong temporal dependency orburstiness Where strong temporal dependency of jit-terdelay is present this will result in clusters of BADstate packets resulting in BAD delayjitter bursts span-ning more than one packet Longer BAD bursts will bereflected in higher values for PBB from Figure 2

bull Using these values we can derive values for all 4 prob-abilitiesndash PBB Probability that packet n+1 will have high jitter

(BAD state) given that packet n has high jitterndash PBG Probability that packet n+1 will have low jitter

given that packet n has high jitter ie switch statesndash PGG Probability that packet n+1 will have low jitter

(GOOD state) given that packet n has low jitterndash PGB Probability that packet n+1 will have high jitter

given that packet n has low jitter ie switch states

An additional important requirement from the delay blockwas to ensure that out-of-order packets would not arisein reality such events are largely due to route changes andoccur infrequently and thus it was important to reproducethis

620

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

313 Playout algorithm simulator block

Two per-talkspurt adaptive strategies (namely algorithm 1and 4 from [1] and referred to here as algorithm 1 and 2 re-spectively) were simulated Both algorithms utilise linearrecursive filters in tracking network conditions but differin that algorithm 2 responds more quickly due to differentparameters and also includes a spike mode that respondsmore rapidly to changing network conditions although itmust still wait for the next silence period to do so Bothalgorithms adjust playout time accordingly at the start of atalkspurt as given by

di = αdiminus1 + 1 minus α ni (1)

pi = ti + di + βvi (2)

In the above i refers to packet i di is the estimated end-to-end delay α is the filter gain ni is the measured delay piis the playout time ti is the send time vi is the estimatedvariation in delay and β is a multiplication factor For ex-ample in [1] the authors choose a β value of 4 for bothalgorithms whereas the history factor α was set to 0875for algorithm 2 versus 0998 for algorithm 1 The choiceof parameters α β and the spike detection threshold (foralgorithm 2) impact greatly on the performance of thesealgorithms and are usually tuned to match precise networkconditions ie from stable to unstable For this reason weutilised a range of values as described in the followingsection As described earlier adjusting on a per-talkspurtbasis maintains the integrity of speech within talkspurtswhilst altering the inter-talkspurt silence periodsThe delays from the Delay Block are mapped to the

Voice talkspurt distribution series V and applied to the var-ious playout algorithms The delays are applied to eachpacket in turn and processed by the playout algorithm andadjustments made at the start of each talkspurt indicatedby a lsquo1rsquo in the V array This then generates a series of play-out adjustments A As outlined in Section 22 the extentof adjustment is dependent not only on the network con-dition and playout algorithm but also on the specific VADsettings of the VoIP application as the latter will greatlyimpact on the talkspurt distribution for a given speech seg-ment In any event the resulting required adjustment (si-lence) can be added or removed at this point before thenext talkspurt is processed

314 Simulation work flow and outputs

The overall simulator works as follows User firstly speci-fies a talkspurt distribution file which is extracted from livespeech and loaded as an array V User then specifies net-work conditions for test Delay block returns a sequence ofnetwork delays D corresponding to those conditions Theinput parameters are

1 Number of packets2 Packet interval (ms)3 Base delay (ms)4 BAD state burst length (ms)5 BAD state probability ()6 GOOD state jitter ( of base delay)

7 BAD state jitter multiplier

For our testing parameters 1ndash4 were kept constantwhile parameters 5ndash7 were varied as outlined below togive different network characteristics ranging from stableto unstable

1 Number of packets = 40002 Packet interval = 20ms3 Base delay = 50ms4 BAD state burst length = 10 Packets (200ms)5 BAD state probability = 20 40 60 806 GOOD state jitter = 25 507 BAD state jitter multiplier = 2 3 4

Note that in arising at these parameters particularlythose relating to jitter a detailed series of delay and jit-ter tests were undertaken measuring delayjitter betweenIreland and the USmainland Europe Further details canbe found in [38] More recent testing by [45 46] has high-lighted the particular problem of very high jitterdelay incongested IEEE 80211 networks The final step was tomap the delays to packets and apply them to adaptive jitterbuffering (AJB) algorithms 1 and 2 This generates a se-ries of playout adjustments A for every network test condi-tion and playout algorithm Figure 1 summarises the over-all flow within the simulatorIn [47] details of the comprehensive tests using the sim-

ulator are presented In summary 24 different network de-lay models were used generating 24 network delay se-quences D These delay values were fed to 2 differentplayout algorithms as described in Section 313 For eachplayout algorithm tests were repeated using different pa-rameters such as α (history weighting - varied from 08 to0998) β (jitter multiplier - varied from 4 to 6) and spikemode threshold (only algorithm 2) see again 313 for de-tails Each combination resulted in a distinct set of playoutadjustments A for each test scenario Note that the voicesamples V were based on 80 seconds of active speech with40 talkspurts (Marker bit = 1) whereas the speech sampleschosen for this experiment were 8 seconds long thus thisalso had to be factored Essentially a pro-rata approachwas taken in that for the 80 seconds of speech used fortests there were 40 playout adjustments so for our 8 sec-onds ITU-T speech samples we implemented 4 adjust-ments Arising from the full range of test combinationsdescribed above which numbered 96 and resulting adjust-ments a subset of 12 sets of playout adjustments contain-ing 4 adjustments each were taken to represent a spectrumof network conditions ranging from a stable network toan unstable network Table I illustrates the actual playoutadjustments selected that were applied to the speech sam-ples

32 Speech samples

As normal for quality testing 4 reference speech sampleswere used The English subset of ITU-T P Supplement 23[48] database was used for speech material consisting of

621

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

tion with PLC applied to the inserted silence As such thepause and jump impairments are of most interest as theyinvolve temporal discontinuity and thus most closely re-flect the type of impairment caused by per-talkspurt play-out strategies However one key distinction between Vo-ranrsquos work and the operation of per-talkspurt strategies isthat he applied all impairments (pauselossjump) to ac-tive speech segments It is important to note that his pauseimpairment which introduced a silence gap within activespeech was then compensated for through PLC and hisjump impairment essentially removed a segment of activespeech as if it never existed The impact of both magni-tude and frequency of each of the three impairments wereexamined independently as well as a combination of pauseand jump Impairments were added at random locationswithin G723-encoded active speech From [37] his mainfindings are summarised as follows

bull For a given frequency and magnitude of impairmentthe impact of the four impairments (loss pause jumppause and jump) on MOS scores was found to beroughly similar

bull As the magnitude of impairment increased the reportedMOS scores decreased at an almost linear rate Forexample at a frequency of one impairment per 100frames a 3060120ms pause impairment resulted intheMOS score dropping by 021041115 respectively

bull As the frequency of impairment increased the reportedMOS scores decreased at a non-linear rate

In addition to the above findings he showed that for a verylarge and noticeable single adjustment (430ms silence re-moval) PESQ failed to register any impact This to someextend agrees with Hoenersquos et al analysis in that adjust-ments within silences of up to approx 320ms are ignoredby PESQAs emphasised earlier both Hoenersquos et al and Voranrsquos

detailed work introduced the impairments throughout ac-tive speech (talkspurts) only As such the results cannotbe directly compared with per talkspurt playout strate-gies where the temporal adjustment impairments only oc-cur during silences (ie the silence period is contractedor elongated) In particular Voranrsquos additional finding re-garding the very noticeable 430ms temporal adjustment(silence removal) coupled with both Hoenersquos et al andVoranrsquos findings that PESQ ignores such large single ad-justments during silences strengthened the argument thatboth PESQ and POLQA need to be tested for the impact offrequent though smaller playout delay adjustments moretypical of VoIP and thus prompted us to undertake this re-search

23 Research motivation

As outlined thus far the literature to date has not quanti-fied either objectively or subjectively the precise impactof multiple small silence period adjustments typical ofVoIP applications on quality perceived by the end userexpressed by MOS values As described above researchby Voran and Hoene et al make some contribution in this

area They both outlined firstly that the PESQ scores werenot impacted by very large and noticeable single adjust-ments during silences Secondly both showed that signif-icant adjustments during active speech did impact on sub-jective results (MOS-LQS) though these are quite differentto silence period adjustments

Considering all this our primary research motivationwas to address the gap in the literature by assessing the im-pact of frequent and small silence period adjustments onlistening quality perceived by the end user both throughobjective and subjective tests In particular we identifieda number of key research questions that we wished to an-swer namely

1 What impact do frequent and small silence period ad-justments have on subjective listening MOS scores

2 What impact do frequent and small silence periodadjustments have on objective listening MOS scoresspecifically those predicted by both PESQ as well asPOLQA

3 Can the PESQ andor POLQA model correctly predictthe impact of frequent and small silence period adjust-ments on listening quality perceived by the end user asquantified by a subjective test If so how accurate arethose predictions

Further questions include

4 What relationship exists between the magnitude of ad-justments and objective and subjective listening MOSscores

5 What impact does the position of adjustments withinspeech samples have on objective and subjective listen-ing MOS scores

3 Methodology

In this section we describe the methodology used to gen-erate the speech samples for both objective and subjectivetests and then provide details of the testing process Acustom-built Matlab-based simulator was developed andused to generate playout adjustments (as depicted in Fig-ure 1) The overall methodology comprised a number ofstages as follows

bull Generate a series of network packet delays consistentwith varying network conditions

bull Using these delays and simulated voice patterns (talk-spurt distribution) and applied to different playout al-gorithms generate a series of playout adjustments

bull Apply these set of adjustments to different locationswithin reference speech samples

Using this set of degraded and reference speech sam-ples we carried out both ITU-T standardised subjectivelistening test and objective tests the latter using bothPESQ and POLQA This facilitated a comparison betweenboth approaches and between PESQ and POLQA Theprocess of generating adjustments is described in Sec-tion 31 The section starts with a description of the playoutadjustment simulator and ends up with a simulation work

619

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

flow and outputs The process of applying adjustments tospeech is described in Section 32 Details related to theactual testing are given in Section 33

31 Playout Adjustment Generation

In order to assess both objectively and subjectively theimpact on listening quality perceived by the end user ofsilence period adjustments typical of VoIP applications adetailed simulator was built to generate such adjustmentsOverall objectives were to

bull Generate a sequence of VoIP packets V with a talkspurtdistribution typical of real speech

bull Generate a range n of network delay sequences Dwhere each range represents different network condi-tions ie nD delay values

bull For each of the n delay sequences D generate a seriesof playout adjustments A that would result from apply-ing this sequence to the VoIP packets V using typicalplayout algorithms

Figure 1 depicts the playout adjustment simulator whichwas implemented using Matlab The simulator was builtby one of the authors for previous research as outlined in[38] As can be seen in Figure 1 the simulator consists ofthree separate module blocks namely

bull Voice Simulator Blockbull Delay Simulator Blockbull Playout Algorithm Simulator Block

Each block is described in the following subsections

311 Voice simulator block

The simulated voice streams were based on live speechsamples The critical factor here is the distribution of talk-spurts which were extracted directly from voice tests intotext files and used to reproduce speech characteristicsThis was done by recording normal VoIP speech with VADenabled and extracting the Marker bits within the RTPpacket headers where lsquo1rsquo indicates the start of a talkspurtand lsquo0rsquo represents an active speech packet within a talk-spurt An array V thus represents the distribution of talk-spurt packets from normal speech ndash eg a sample subset ofV [10000001000] represents 2 separate talkspurts ofduration 7 packets and 4 packets respectively

312 Delay simulator block

Significant research has focused on modelling of Internetdelay and loss characteristics [39 40 41 42 43 44] In[8] Sun and Ifeachor show that for VoIP a Weibull dis-tribution models traffic better than exponential or ParetoOur model is designed to model the temporal relationshipor burstiness of delay traffic which is commonly found andlogically follows from the research that has proposed theuse of bursty packet loss models As such we propose aseries of 2-state Markov models to simulate varying net-work conditions Figure 2 illustrates its application to de-lay modelling The following summarises the most rele-vant characteristics of the models developed

PacketDelay

Mapping

Playout

Algorithm

Simulator

Voice

Simulator

VoIP Tests

Talkspurt

Distribution etc

Markov Delay

Models

Talkspurtdistribution V

Delaydistribution D

PlayoutAdj distribution A

Figure 1 Playout adjustment simulator

Figure 2 2-state Markov Delay Model

bull BADGOOD State Jitter Level The delay models thatwere developed used different ranges of jitter to differ-entiate between GOOD and BAD states Essentially aGOOD state had a low jitter metric (set as a of basedelays) and a multiplier was applied to this metric torepresent the BAD state

bull BAD State Probability This represents the percentageof packets that are affected by high delay variance (jit-ter)

bull Average BAD State Burst Length This determines howthe BAD state packets are distributed Much of the lit-erature on network analysis has reported that both lossand delayjitter have strong temporal dependency orburstiness Where strong temporal dependency of jit-terdelay is present this will result in clusters of BADstate packets resulting in BAD delayjitter bursts span-ning more than one packet Longer BAD bursts will bereflected in higher values for PBB from Figure 2

bull Using these values we can derive values for all 4 prob-abilitiesndash PBB Probability that packet n+1 will have high jitter

(BAD state) given that packet n has high jitterndash PBG Probability that packet n+1 will have low jitter

given that packet n has high jitter ie switch statesndash PGG Probability that packet n+1 will have low jitter

(GOOD state) given that packet n has low jitterndash PGB Probability that packet n+1 will have high jitter

given that packet n has low jitter ie switch states

An additional important requirement from the delay blockwas to ensure that out-of-order packets would not arisein reality such events are largely due to route changes andoccur infrequently and thus it was important to reproducethis

620

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

313 Playout algorithm simulator block

Two per-talkspurt adaptive strategies (namely algorithm 1and 4 from [1] and referred to here as algorithm 1 and 2 re-spectively) were simulated Both algorithms utilise linearrecursive filters in tracking network conditions but differin that algorithm 2 responds more quickly due to differentparameters and also includes a spike mode that respondsmore rapidly to changing network conditions although itmust still wait for the next silence period to do so Bothalgorithms adjust playout time accordingly at the start of atalkspurt as given by

di = αdiminus1 + 1 minus α ni (1)

pi = ti + di + βvi (2)

In the above i refers to packet i di is the estimated end-to-end delay α is the filter gain ni is the measured delay piis the playout time ti is the send time vi is the estimatedvariation in delay and β is a multiplication factor For ex-ample in [1] the authors choose a β value of 4 for bothalgorithms whereas the history factor α was set to 0875for algorithm 2 versus 0998 for algorithm 1 The choiceof parameters α β and the spike detection threshold (foralgorithm 2) impact greatly on the performance of thesealgorithms and are usually tuned to match precise networkconditions ie from stable to unstable For this reason weutilised a range of values as described in the followingsection As described earlier adjusting on a per-talkspurtbasis maintains the integrity of speech within talkspurtswhilst altering the inter-talkspurt silence periodsThe delays from the Delay Block are mapped to the

Voice talkspurt distribution series V and applied to the var-ious playout algorithms The delays are applied to eachpacket in turn and processed by the playout algorithm andadjustments made at the start of each talkspurt indicatedby a lsquo1rsquo in the V array This then generates a series of play-out adjustments A As outlined in Section 22 the extentof adjustment is dependent not only on the network con-dition and playout algorithm but also on the specific VADsettings of the VoIP application as the latter will greatlyimpact on the talkspurt distribution for a given speech seg-ment In any event the resulting required adjustment (si-lence) can be added or removed at this point before thenext talkspurt is processed

314 Simulation work flow and outputs

The overall simulator works as follows User firstly speci-fies a talkspurt distribution file which is extracted from livespeech and loaded as an array V User then specifies net-work conditions for test Delay block returns a sequence ofnetwork delays D corresponding to those conditions Theinput parameters are

1 Number of packets2 Packet interval (ms)3 Base delay (ms)4 BAD state burst length (ms)5 BAD state probability ()6 GOOD state jitter ( of base delay)

7 BAD state jitter multiplier

For our testing parameters 1ndash4 were kept constantwhile parameters 5ndash7 were varied as outlined below togive different network characteristics ranging from stableto unstable

1 Number of packets = 40002 Packet interval = 20ms3 Base delay = 50ms4 BAD state burst length = 10 Packets (200ms)5 BAD state probability = 20 40 60 806 GOOD state jitter = 25 507 BAD state jitter multiplier = 2 3 4

Note that in arising at these parameters particularlythose relating to jitter a detailed series of delay and jit-ter tests were undertaken measuring delayjitter betweenIreland and the USmainland Europe Further details canbe found in [38] More recent testing by [45 46] has high-lighted the particular problem of very high jitterdelay incongested IEEE 80211 networks The final step was tomap the delays to packets and apply them to adaptive jitterbuffering (AJB) algorithms 1 and 2 This generates a se-ries of playout adjustments A for every network test condi-tion and playout algorithm Figure 1 summarises the over-all flow within the simulatorIn [47] details of the comprehensive tests using the sim-

ulator are presented In summary 24 different network de-lay models were used generating 24 network delay se-quences D These delay values were fed to 2 differentplayout algorithms as described in Section 313 For eachplayout algorithm tests were repeated using different pa-rameters such as α (history weighting - varied from 08 to0998) β (jitter multiplier - varied from 4 to 6) and spikemode threshold (only algorithm 2) see again 313 for de-tails Each combination resulted in a distinct set of playoutadjustments A for each test scenario Note that the voicesamples V were based on 80 seconds of active speech with40 talkspurts (Marker bit = 1) whereas the speech sampleschosen for this experiment were 8 seconds long thus thisalso had to be factored Essentially a pro-rata approachwas taken in that for the 80 seconds of speech used fortests there were 40 playout adjustments so for our 8 sec-onds ITU-T speech samples we implemented 4 adjust-ments Arising from the full range of test combinationsdescribed above which numbered 96 and resulting adjust-ments a subset of 12 sets of playout adjustments contain-ing 4 adjustments each were taken to represent a spectrumof network conditions ranging from a stable network toan unstable network Table I illustrates the actual playoutadjustments selected that were applied to the speech sam-ples

32 Speech samples

As normal for quality testing 4 reference speech sampleswere used The English subset of ITU-T P Supplement 23[48] database was used for speech material consisting of

621

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

flow and outputs The process of applying adjustments tospeech is described in Section 32 Details related to theactual testing are given in Section 33

31 Playout Adjustment Generation

In order to assess both objectively and subjectively theimpact on listening quality perceived by the end user ofsilence period adjustments typical of VoIP applications adetailed simulator was built to generate such adjustmentsOverall objectives were to

bull Generate a sequence of VoIP packets V with a talkspurtdistribution typical of real speech

bull Generate a range n of network delay sequences Dwhere each range represents different network condi-tions ie nD delay values

bull For each of the n delay sequences D generate a seriesof playout adjustments A that would result from apply-ing this sequence to the VoIP packets V using typicalplayout algorithms

Figure 1 depicts the playout adjustment simulator whichwas implemented using Matlab The simulator was builtby one of the authors for previous research as outlined in[38] As can be seen in Figure 1 the simulator consists ofthree separate module blocks namely

bull Voice Simulator Blockbull Delay Simulator Blockbull Playout Algorithm Simulator Block

Each block is described in the following subsections

311 Voice simulator block

The simulated voice streams were based on live speechsamples The critical factor here is the distribution of talk-spurts which were extracted directly from voice tests intotext files and used to reproduce speech characteristicsThis was done by recording normal VoIP speech with VADenabled and extracting the Marker bits within the RTPpacket headers where lsquo1rsquo indicates the start of a talkspurtand lsquo0rsquo represents an active speech packet within a talk-spurt An array V thus represents the distribution of talk-spurt packets from normal speech ndash eg a sample subset ofV [10000001000] represents 2 separate talkspurts ofduration 7 packets and 4 packets respectively

312 Delay simulator block

Significant research has focused on modelling of Internetdelay and loss characteristics [39 40 41 42 43 44] In[8] Sun and Ifeachor show that for VoIP a Weibull dis-tribution models traffic better than exponential or ParetoOur model is designed to model the temporal relationshipor burstiness of delay traffic which is commonly found andlogically follows from the research that has proposed theuse of bursty packet loss models As such we propose aseries of 2-state Markov models to simulate varying net-work conditions Figure 2 illustrates its application to de-lay modelling The following summarises the most rele-vant characteristics of the models developed

PacketDelay

Mapping

Playout

Algorithm

Simulator

Voice

Simulator

VoIP Tests

Talkspurt

Distribution etc

Markov Delay

Models

Talkspurtdistribution V

Delaydistribution D

PlayoutAdj distribution A

Figure 1 Playout adjustment simulator

Figure 2 2-state Markov Delay Model

bull BADGOOD State Jitter Level The delay models thatwere developed used different ranges of jitter to differ-entiate between GOOD and BAD states Essentially aGOOD state had a low jitter metric (set as a of basedelays) and a multiplier was applied to this metric torepresent the BAD state

bull BAD State Probability This represents the percentageof packets that are affected by high delay variance (jit-ter)

bull Average BAD State Burst Length This determines howthe BAD state packets are distributed Much of the lit-erature on network analysis has reported that both lossand delayjitter have strong temporal dependency orburstiness Where strong temporal dependency of jit-terdelay is present this will result in clusters of BADstate packets resulting in BAD delayjitter bursts span-ning more than one packet Longer BAD bursts will bereflected in higher values for PBB from Figure 2

bull Using these values we can derive values for all 4 prob-abilitiesndash PBB Probability that packet n+1 will have high jitter

(BAD state) given that packet n has high jitterndash PBG Probability that packet n+1 will have low jitter

given that packet n has high jitter ie switch statesndash PGG Probability that packet n+1 will have low jitter

(GOOD state) given that packet n has low jitterndash PGB Probability that packet n+1 will have high jitter

given that packet n has low jitter ie switch states

An additional important requirement from the delay blockwas to ensure that out-of-order packets would not arisein reality such events are largely due to route changes andoccur infrequently and thus it was important to reproducethis

620

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

313 Playout algorithm simulator block

Two per-talkspurt adaptive strategies (namely algorithm 1and 4 from [1] and referred to here as algorithm 1 and 2 re-spectively) were simulated Both algorithms utilise linearrecursive filters in tracking network conditions but differin that algorithm 2 responds more quickly due to differentparameters and also includes a spike mode that respondsmore rapidly to changing network conditions although itmust still wait for the next silence period to do so Bothalgorithms adjust playout time accordingly at the start of atalkspurt as given by

di = αdiminus1 + 1 minus α ni (1)

pi = ti + di + βvi (2)

In the above i refers to packet i di is the estimated end-to-end delay α is the filter gain ni is the measured delay piis the playout time ti is the send time vi is the estimatedvariation in delay and β is a multiplication factor For ex-ample in [1] the authors choose a β value of 4 for bothalgorithms whereas the history factor α was set to 0875for algorithm 2 versus 0998 for algorithm 1 The choiceof parameters α β and the spike detection threshold (foralgorithm 2) impact greatly on the performance of thesealgorithms and are usually tuned to match precise networkconditions ie from stable to unstable For this reason weutilised a range of values as described in the followingsection As described earlier adjusting on a per-talkspurtbasis maintains the integrity of speech within talkspurtswhilst altering the inter-talkspurt silence periodsThe delays from the Delay Block are mapped to the

Voice talkspurt distribution series V and applied to the var-ious playout algorithms The delays are applied to eachpacket in turn and processed by the playout algorithm andadjustments made at the start of each talkspurt indicatedby a lsquo1rsquo in the V array This then generates a series of play-out adjustments A As outlined in Section 22 the extentof adjustment is dependent not only on the network con-dition and playout algorithm but also on the specific VADsettings of the VoIP application as the latter will greatlyimpact on the talkspurt distribution for a given speech seg-ment In any event the resulting required adjustment (si-lence) can be added or removed at this point before thenext talkspurt is processed

314 Simulation work flow and outputs

The overall simulator works as follows User firstly speci-fies a talkspurt distribution file which is extracted from livespeech and loaded as an array V User then specifies net-work conditions for test Delay block returns a sequence ofnetwork delays D corresponding to those conditions Theinput parameters are

1 Number of packets2 Packet interval (ms)3 Base delay (ms)4 BAD state burst length (ms)5 BAD state probability ()6 GOOD state jitter ( of base delay)

7 BAD state jitter multiplier

For our testing parameters 1ndash4 were kept constantwhile parameters 5ndash7 were varied as outlined below togive different network characteristics ranging from stableto unstable

1 Number of packets = 40002 Packet interval = 20ms3 Base delay = 50ms4 BAD state burst length = 10 Packets (200ms)5 BAD state probability = 20 40 60 806 GOOD state jitter = 25 507 BAD state jitter multiplier = 2 3 4

Note that in arising at these parameters particularlythose relating to jitter a detailed series of delay and jit-ter tests were undertaken measuring delayjitter betweenIreland and the USmainland Europe Further details canbe found in [38] More recent testing by [45 46] has high-lighted the particular problem of very high jitterdelay incongested IEEE 80211 networks The final step was tomap the delays to packets and apply them to adaptive jitterbuffering (AJB) algorithms 1 and 2 This generates a se-ries of playout adjustments A for every network test condi-tion and playout algorithm Figure 1 summarises the over-all flow within the simulatorIn [47] details of the comprehensive tests using the sim-

ulator are presented In summary 24 different network de-lay models were used generating 24 network delay se-quences D These delay values were fed to 2 differentplayout algorithms as described in Section 313 For eachplayout algorithm tests were repeated using different pa-rameters such as α (history weighting - varied from 08 to0998) β (jitter multiplier - varied from 4 to 6) and spikemode threshold (only algorithm 2) see again 313 for de-tails Each combination resulted in a distinct set of playoutadjustments A for each test scenario Note that the voicesamples V were based on 80 seconds of active speech with40 talkspurts (Marker bit = 1) whereas the speech sampleschosen for this experiment were 8 seconds long thus thisalso had to be factored Essentially a pro-rata approachwas taken in that for the 80 seconds of speech used fortests there were 40 playout adjustments so for our 8 sec-onds ITU-T speech samples we implemented 4 adjust-ments Arising from the full range of test combinationsdescribed above which numbered 96 and resulting adjust-ments a subset of 12 sets of playout adjustments contain-ing 4 adjustments each were taken to represent a spectrumof network conditions ranging from a stable network toan unstable network Table I illustrates the actual playoutadjustments selected that were applied to the speech sam-ples

32 Speech samples

As normal for quality testing 4 reference speech sampleswere used The English subset of ITU-T P Supplement 23[48] database was used for speech material consisting of

621

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

313 Playout algorithm simulator block

Two per-talkspurt adaptive strategies (namely algorithm 1and 4 from [1] and referred to here as algorithm 1 and 2 re-spectively) were simulated Both algorithms utilise linearrecursive filters in tracking network conditions but differin that algorithm 2 responds more quickly due to differentparameters and also includes a spike mode that respondsmore rapidly to changing network conditions although itmust still wait for the next silence period to do so Bothalgorithms adjust playout time accordingly at the start of atalkspurt as given by

di = αdiminus1 + 1 minus α ni (1)

pi = ti + di + βvi (2)

In the above i refers to packet i di is the estimated end-to-end delay α is the filter gain ni is the measured delay piis the playout time ti is the send time vi is the estimatedvariation in delay and β is a multiplication factor For ex-ample in [1] the authors choose a β value of 4 for bothalgorithms whereas the history factor α was set to 0875for algorithm 2 versus 0998 for algorithm 1 The choiceof parameters α β and the spike detection threshold (foralgorithm 2) impact greatly on the performance of thesealgorithms and are usually tuned to match precise networkconditions ie from stable to unstable For this reason weutilised a range of values as described in the followingsection As described earlier adjusting on a per-talkspurtbasis maintains the integrity of speech within talkspurtswhilst altering the inter-talkspurt silence periodsThe delays from the Delay Block are mapped to the

Voice talkspurt distribution series V and applied to the var-ious playout algorithms The delays are applied to eachpacket in turn and processed by the playout algorithm andadjustments made at the start of each talkspurt indicatedby a lsquo1rsquo in the V array This then generates a series of play-out adjustments A As outlined in Section 22 the extentof adjustment is dependent not only on the network con-dition and playout algorithm but also on the specific VADsettings of the VoIP application as the latter will greatlyimpact on the talkspurt distribution for a given speech seg-ment In any event the resulting required adjustment (si-lence) can be added or removed at this point before thenext talkspurt is processed

314 Simulation work flow and outputs

The overall simulator works as follows User firstly speci-fies a talkspurt distribution file which is extracted from livespeech and loaded as an array V User then specifies net-work conditions for test Delay block returns a sequence ofnetwork delays D corresponding to those conditions Theinput parameters are

1 Number of packets2 Packet interval (ms)3 Base delay (ms)4 BAD state burst length (ms)5 BAD state probability ()6 GOOD state jitter ( of base delay)

7 BAD state jitter multiplier

For our testing parameters 1ndash4 were kept constantwhile parameters 5ndash7 were varied as outlined below togive different network characteristics ranging from stableto unstable

1 Number of packets = 40002 Packet interval = 20ms3 Base delay = 50ms4 BAD state burst length = 10 Packets (200ms)5 BAD state probability = 20 40 60 806 GOOD state jitter = 25 507 BAD state jitter multiplier = 2 3 4

Note that in arising at these parameters particularlythose relating to jitter a detailed series of delay and jit-ter tests were undertaken measuring delayjitter betweenIreland and the USmainland Europe Further details canbe found in [38] More recent testing by [45 46] has high-lighted the particular problem of very high jitterdelay incongested IEEE 80211 networks The final step was tomap the delays to packets and apply them to adaptive jitterbuffering (AJB) algorithms 1 and 2 This generates a se-ries of playout adjustments A for every network test condi-tion and playout algorithm Figure 1 summarises the over-all flow within the simulatorIn [47] details of the comprehensive tests using the sim-

ulator are presented In summary 24 different network de-lay models were used generating 24 network delay se-quences D These delay values were fed to 2 differentplayout algorithms as described in Section 313 For eachplayout algorithm tests were repeated using different pa-rameters such as α (history weighting - varied from 08 to0998) β (jitter multiplier - varied from 4 to 6) and spikemode threshold (only algorithm 2) see again 313 for de-tails Each combination resulted in a distinct set of playoutadjustments A for each test scenario Note that the voicesamples V were based on 80 seconds of active speech with40 talkspurts (Marker bit = 1) whereas the speech sampleschosen for this experiment were 8 seconds long thus thisalso had to be factored Essentially a pro-rata approachwas taken in that for the 80 seconds of speech used fortests there were 40 playout adjustments so for our 8 sec-onds ITU-T speech samples we implemented 4 adjust-ments Arising from the full range of test combinationsdescribed above which numbered 96 and resulting adjust-ments a subset of 12 sets of playout adjustments contain-ing 4 adjustments each were taken to represent a spectrumof network conditions ranging from a stable network toan unstable network Table I illustrates the actual playoutadjustments selected that were applied to the speech sam-ples

32 Speech samples

As normal for quality testing 4 reference speech sampleswere used The English subset of ITU-T P Supplement 23[48] database was used for speech material consisting of

621

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Figure 3 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 4 Demonstration of how playout adjustments were applied to 1st Female speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 5 Demonstration of how playout adjustments were applied to 2nd Female 2 speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 6 Demonstration of how playout adjustments were applied to 2nd Female speech sample - Variant B Arrows indicate whereadjustments were placed

a pair of utterances with a small pause between the ut-terances Two male and two female speakers uttering dif-ferent sentences were included in the stimuli The speechsamples used (source samples available in the database)were 8 seconds in length and stored in 16 bit 8000Hz lin-ear PCM

Each speech sample was modified by inserting and re-moving silence periods to reflect the adjustments as spec-ified above The adjustments were a mix of positive andnegative adjustments (adding and removing silence peri-ods) as shown in Table I

As a further experimental variable the set of four ad-justments were applied to each sample in two differentlocations (referred to hereafter as variant A and B) Theonly distinction between variant A and B is that the im-pairments in variant B were applied in the latter part of

each sample All the adjustments were made using a freesound editor Figures 3ndash10 illustrate how the 4 playout ad-justments were applied to all speech samples involved inthe experiment in 2 different places (Variant A and B)

The overall result of this sampling created 96 speechsamples (4 voices x 12 test conditions x 2 variants)

33 Speech quality assessment

The speech quality assessment process was divided intotwo parts namely subjective assessment (listening test)and objective assessment (using the PESQ and POLQAmodels) Both assessment procedures are described inmore detail below

The ACR subjective listening test was performed inMay 2012 in accordance with ITU-T Recommendation

622

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Figure 7 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 8 Demonstration of how playout adjustments were applied to 1st Male speech sample - Variant B Arrows indicate whereadjustments were placed

Figure 9 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant A Arrows indicate whereadjustments were placed

Figure 10 Demonstration of how playout adjustments were applied to 2nd Male speech sample - Variant B Arrows indicate whereadjustments were placed

P800 [15] In every case up to 2 listeners were seated ina small listening room (acoustically treated) with a back-ground noise well below 20 dB SPL (A) All together30 naiumlve (non-expert) listeners (16 male 14 female 20-55 years mean 3443 years) participated in the test Allsubjects were Irish Nationals whose first language wasEnglish The subjects were remunerated for their effortsThe samples (96 degraded samples + 4 reference samples)were played out using high quality studio equipment in arandom order and diotically presented over Sennheiser HD455 headphones (presentation level 73 dB SPL (A)) to thetest subjects The results of the opinion scores from 1 (bad)to 5 (excellent) were averaged to obtain MOS-ListeningQuality Subjective narrowband (MOS-LQSn) values foreach sample

In the next step the 100 samples (essentially the 96degraded samples (using test conditions No1-12) and 4reference samples (Ref test condition)) were compared totheir respective reference samples using both the PESQmodel described in ITU-T Rec P862 [22 23 27] andthe POLQA model described in P863 [24 25 28] in or-der to get objective listening quality scores In the caseof PESQ the output (raw PESQ scores) was convertedto MOS-Listening Quality Objective narrowband (MOS-LQOn) values by the equation defined in [49]

4 Experimental resultsIn this section we present experimental results for bothsubjective and objective assessment (PESQ and POLQAmodels) as well as a detailed analysis and comparison ofboth

623

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

41 Experimental results for subjective assessment

In Figure 11 we summarise the results of subjective lis-tening test averaged over the 4 different voices involved inthe experiment as described in Section 3 It can be seenthat the impact of test conditions No 1ndash12 (playout ad-justments) on average MOS-LQSn scores relative to thereference samples is quite limited On the other hand wecan also see that the subjects gave surprisingly low MOS-LQS scores to all samples including the reference samplesinvolved in the test ie the average values oscillate around36 MOS This value is quite low considering that the sam-ples contain either no degradations (reference samples) orvery moderate degradations in a narrowband context Thisresult warrants further analysis beyond the scope of thispaper but one possibility is that the subjectrsquos opinion hasbeen affected by their previous long-term experience withwideband telephony (wideband speech) though this wasnot validated Such experience would alter their internalreference to wideband speech (extended frequency rangeresulting in higher speech quality) and thus explain thelower scores given to the narrowband samples involvedin the test One other possibility relates to the fact thatthe range of conditions (impairments) introduced into thespeech samples was quite limited This issue is discussedin more detail in [50] As evident the impact of varying theextent of playout adjustments across all test conditions wasvery small (insignificant) However one characteristic ofnote that emerged is the small impact of the location of theadjustments on scores ie most of the test conditions usingvariant B obtained slightly lower scores than the same con-ditions with variant A The biggest difference (02 MOS)between both investigated variants (location impact) hasbeen achieved for test condition No10 As discussed andevident from Figures 3ndash10 variant B adjustments were de-signed to be towards the end of the sample so one expla-nation is that the distortions presented closer to the end ofthe sample were a bit more annoying for the subjects thanthose presented in the first half of the sample In princi-ple this result has echoes of the so-called recency effectreported in the literature (eg [51 52 53]) However suchtests used samples longer than 60 seconds In any eventand as discussed later the differences are not significant inthe context of the confidence intervalOne three-way analysis of variance (ANOVA) test was

conducted on the subjective results using test conditionvoice and variant (location) as fixed factors (Table V) Itshould be noted here that the voice factor is a combinationof voice and content The highest F-ratio for the voice (F =7312 p lt 0001) was determined The effect of voice wasfound to be highly statistically significant Moreover thetest condition factor appeared to a have a weaker effect onquality than the voice factor with F= 082 p= 0635 Fur-thermore the effect of test condition was not statisticallysignificant whereas the voice factor was The last factor in-vestigated in the ANOVA test was the variant factor and itturns out to have a weaker effect on quality than the voicefactor on its own and to not be statistically significant sim-ilar to the test condition factor (F = 107 p = 03032) Re-

Table I Playout adjustments (in ms) applied to speech samplesΣ Absolute sum of adjustments

Test conditions 1st 2nd 3rd 4th Σ

Ref 0 0 0 0 01 2 minus2 3 minus3 102 4 minus4 minus4 4 163 3 minus3 minus6 6 184 5 minus5 minus5 5 205 3 minus6 minus7 10 266 16 minus12 minus8 4 407 10 minus17 minus6 13 468 10 minus15 minus10 15 509 8 minus23 minus3 18 5210 5 10 minus30 15 6011 minus15 15 minus15 15 6012 minus25 22 minus8 11 66

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQSn

Variant A

Variant B

Figure 11 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQSn The vertical bars show 95 CI computed over 120 MOS-LQSn values (30 subjective scores per sample for 4 samples)

garding interactions of all the involved factors the resultsshow that none of them is statistically significant

To summarise the results of the ANOVA test revealedthat subjects were more sensitive to the voice than to allthe test conditions and variants and no statistically sig-nificant interactions between all the investigated factorswere found (assuming no impact of content due to care-fully chosen speech samples) It should be noted here thata variability caused by content is considered one of sam-pling factors as defined in the Handbook of subjectivetesting practical procedures [16] The ANOVA test alsorevealed that small differences between quality scores ofvariant A and B reported previously are not statisticallysignificant In other words the results of the ANOVA testproved our assertion that the differences are not significantin the context of the confidence intervalAs is often reported in the literature some impact re-

lating to the voice effect was expected in this experimentbut not to such an extent A diagnostic analysis of the testdata revealed that one of the voices (1st male) was liked

624

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

more than the others (ie over all conditions this voicewas rated on average by approx 04 MOS-LQSn higherthan second male voice and by approx 06 MOS-LQSnhigher than the female voices) It is also worth noting thatboth male voices have obtained higher scores than the fe-male voices

As can be seen above the impact of playout adjustmentsis not statistically significant This fact raises a questionas to whether the impact would be higher if the Degra-dation Category Rating (DCR) testing approach was de-ployed We considered this during the research designphase We carried out limited DCR subjective testing (4experts involved) and the subjects noted no degradationcaused by time-shifting (playout adjustments) On this ba-sis we came to the conclusion that the introduced impair-ments would not be noticed by subjects in a DCR test Forthat reason we decided to use an ACR test

Furthermore it should be noted that in telephony sub-jects have no access the speech from their conversationalpartner and thus ACR testing is commonly used in tele-phony speech quality assessment

To the best of our knowledge the results presented inthis section are a first proof of the assertion published inthe literature [1 2 10] that small and frequent silence pe-riod adjustments typical of VoIP playout algorithms do nothave a noticeable effect on listening quality perceived bythe end userIt is interesting at this juncture to compare our results

with subjective results of Voran [11] and Hoene et al [6]Voran introduced pausejump impairments at the rate of 1to 4 per 3 second sentence with pausejump magnitude of30 60 and 120ms We introduced 4 impairments in an 8second sample based on our observations of actual speechand using adjustments which were derived from realisticnetwork delay models and real playout algorithms As a re-sult our adjustments were typically much smaller (largestwas 30ms) Voran noted that for 1 pausejump adjustmentof 30ms MOS scores fell by 02 whereas we found nosignificant drop in MOS scores as the extent of adjust-ments increased even for conditions 10-12 where the mag-nitude of some of the adjustments were similar to Voranrsquosat 20ndash30ms One key distinction as stated before is thatVoran applied such impairments to active speech whereasall our adjustments were made to silence periods Further-more in the case of Voranrsquos pause impairment PLC wasused Finally Voran reported a significant impact of a verylarge single adjustment (silence removal) of 430ms Themagnitude of adjustments introduced in our samples wasmuch smaller and as reported above no significant sub-jective impact was found Hoene et al in [6] introducedvery large (in comparison to adjustments introduced by jit-ter buffers) single adjustments into active speech and alsonoted a significant impact consistent with PESQ

42 Experimental results for objective assessment

Figure 12 depicts the results of objective assessment doneby PESQ (MOS-LQOn (PESQ)) using the same test con-ditions and speech samples We can observe that the sever-

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 12 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by PESQ The vertical bars show 95 CI com-puted over 4 MOS-LQOn values (4 samples)

ity of the test conditions (playout adjustments) has a rel-atively big impact on the predicted MOS values In sum-mary the MOS scores decrease as adjustments increasendash ie as network instability increases This is interestingin context of findings by Voran [11] that PESQ does notregister any impact arising from a single 430ms silenceperiod adjustment (silence removal) Hoenersquos et al resultsfrom PESQ analysis published in [5] are somewhat simi-lar to Voran showing no impact for single adjustments ofup to approx 320ms One possible explanation is that inour tests we introduced frequent and small adjustmentsrather than single and large adjustments introduced by Vo-ranHoene et al We speculate that PESQ had difficultieswith our adjustments profile during the time-alignmentprocess As detailed in Section 22 the frequency and ex-tent of adjustments is impacted greatly by VAD settingsbut our test design of 4 adjustments in 8 seconds is notatypical of VoIPOne positive correlation between PESQ and the sub-

jective test results is that the biggest difference betweenvariant A and B (047 MOS) was reported for test condi-tion No10 It should be also noted that there is a signifi-cant difference between the scores obtained for conditionsNo10 and 11 This is interesting in that whilst the absolutesum of the adjustments for both conditions is the samethe individual adjustments are different The test conditionNo10 represents very different adjustments varying from5 to minus30ms On the other hand the condition No11 con-tains adjustments of similar magnitude two adjustmentsminus15ms and two adjustments 15ms The second one (con-dition No11) has obtained the much lower scores in bothcases (Variant A and B) As such it seems that PESQis better able to cope with large adjustments (30ms waslargest of all conditions) In contrast the opposite resultswere obtained for the subjective data ie listeners scoredtest condition No10 lower than No11 However differ-ences between the subjective results obtained for test con-dition No10 and 11 are much smaller than those obtained

625

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

for objective results and of the same order as the confi-dence interval (see Figure 11)

As with the subjective results lower MOS-LOQn(PESQ) scores have been reported for most of the testconditions of variant B than for the same test conditionsof variant A The average quality scores (averaged overvoices) predicted by PESQ ranged from 3665 to 455MOS for variant A and from 33 to 455 MOS for vari-ant B It can be clearly seen in Figure 12 that the impact ofadjustment location is noticeable for objective scores pre-dicted by PESQ especially for higher magnitudes of theinvestigated playout adjustments

A diagnostic analysis of the objective data predictedby PESQ revealed that the voice impact has been muchweaker than that reported above for the subjective dataIn fact intrusive signal-based models (eg PESQ andPOLQA) are designed to focus more on impairments thanon the special characteristics of voice Due to that suchmodels are sometimes called impairment or degradationmodels However it seems that there is some interactionbetween the extent of impairments introduced in a sample(test condition) and the voice sample used because thedeviation of the MOS-LQOn (PESQ) scores between thetest conditions is different for all 4 voices involved in thetest Much weaker and not statistically significant interac-tion of test condition and voice has been obtained for thesubjective data as shown in Table V

Figure 13 shows the results of objective assessmentdone by POLQA (MOS-LQOn (POLQA)) using the sametest conditions and speech samples The trend of POLQApredictions is much more in line with the subjective re-sults presented in Figure 11 than that of PESQ predictionsNonetheless it seems that POLQA was impacted more bythe test conditions introducing playout adjustments withan absolute sum of adjustments above 45ms (test condi-tions No 7ndash12) especially those belonging to variant AHowever it is not possible to clearly identify a trend ofPOLQA scores as has been done for PESQ above

Regarding the biggest difference between variant A andB for test condition No10 reported above for both sub-jective scores and objective scores predicted by PESQ itis worth noting that this effect has not been captured byPOLQA at all In other words the scores predicted byPOLQA for variant A and B of test condition No10 werevery similar (002 MOS difference)

Moreover the scores predicted by POLQA largely doexhibit same behaviour as scores obtained from subjectivetest from a location perspective (Variant A and B) In otherwords it has been reported above that PESQ and sub-jects involved in the subjective test provided lower MOSscores for most of the test conditions of variant B than forthe same test conditions of variant A The average qual-ity scores (averaged over voices) predicted by POLQAranged from 410 to 443 MOS for variant A and from 408to 443 MOS for variant B This contrasts with the resultsfor PESQ where adjustment location had a significant im-pact It can be clearly seen in Figure 13 that the adjustmentlocation plays a less important role here except for some

Ref No1 No2 No3 No4 No5 No6 No7 No8 No9 No10No11No121

15

2

25

3

35

4

45

5

Test Conditions

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 13 Effect of test conditions (see Table I for more infor-mation about the investigated test conditions) on average MOS-LQOn predicted by POLQA The vertical bars show 95 CIcomputed over 4 MOS-LQOn values (4 samples)

of the higher magnitudes of investigated playout adjust-ments

A diagnostic analysis of the objective data predicted byPOLQA revealed that the voice impact has been muchweaker than that reported above for the subjective data Asalready stated above intrusive signal-based models (egPESQ and POLQA) are designed to more focus on im-pairments than on special characteristics of voice Regard-ing the interaction between the extent of impairments in-troduced in a sample (test condition) and the voice sam-ple used reported above for PESQ model this effect hasbeen also obtained for POLQA model but only for femalevoices

43 Comparison between subjective and objectivequality scores

In the following subsection subjective MOS values(MOS-LQSn) are compared to the predictions provided byboth PESQ and POLQA (MOS-LQOn (PESQ POLQA))The comparison is performed for all experimental condi-tions ie all combinations of voice test conditions andboth investigated location variants However the MOS-LQSn values will have been influenced by the choice ofconditions in the actual experiment In order to account forsuch influences model predictions are commonly trans-formed to a range of conditions that are part of the respec-tive test [54] This may be done for example by using amonotonic 3rd order mapping function presuming such afunction can be foundThe performance of PESQ and POLQAmodels is quan-

tified in terms of the Pearson correlation coefficient Rthe respective root mean square error (rmse) and epsilon-insensitive root mean square error (rmselowast) as [55 56]

R =Ni=1 Xi minusX Yi minus Y

Ni=1 Xi minusX

2 Ni=1 Yi minus Y

2(3)

626

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

and

rmse =1

N minus d

N

i=1

Xi minus Yi2 (4)

with Xi the subjective MOS value for stimulus i Yi theobjective (predicted) MOS value for stimulus i X and Ythe corresponding arithmetic mean values N the numberof stimuli considered in the comparison and d the numberof degrees of freedom provided by the mapping function(d = 4 in the case of 3-order mapping function d = 1 inthe case of no regression) On the other hand the epsilon-insensitive root mean square error can be described as

Perrori = max 0 Xi minus Yi minus ci95i (5)

where the ci95i represents the 95 confidence interval andis defined by [56]

ci95i = t(005M)δiradicM

(6)

where M denotes the number of individual subjectivescores and δi is the standard deviation of subjective scoresfor stimulus i The final epsilon-insensitive root meansquare error is calculated as usual but based on the Per-ror with the formula (5)

rmselowast =1

N minus d

N

i=1

Perror2i (7)

The correlation R indicates the strength and the directionof a linear relationship between the subjective (auditory)and the predicted MOS values it is largely influenced bythe existence of data points at the extremities of the scalesThe root mean square error (rmse) describes the spread ofthe data points around the linear relationship The epsilon-insensitive root mean square error (rmselowast) is a similarmeasure to classical rmse but rmselowast considers only differ-ences related to epsilon-wide band around the target valueThe lsquoepsilonrsquo is defined as the 95 confidence interval ofthe subjective MOS value By definition the uncertaintyof MOS is taken into account in this evaluation In thecase of perfect agreement between subjective and objec-tive scores the correlation would be R = 10 and the rmseand rmselowast = 00

All R rmse and rmselowast are calculated for the raw (non-regressed) MOSn predictions and for the regressed MOS-LQOn values obtained with the help of the monotonicmapping function (if such a function can be determined)and both (the regressed and the non-regressed MOSnpredictions) are separated according to the variants inorder to get an indication of the characteristics of thePESQPOLQA models on different types of test dataFigure 14 compares the MOS-LQSn values with the

raw model predictions (MOS-LQOn (PESQ)) The corre-sponding correlations R root mean square errors (rmse)and epsilon-insensitive root mean square errors (rmselowast)

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(PESQ)

Variant A

Variant B

Figure 14 Subjective results (MOS-LQSn) versus MOS-LQOn(PESQ) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 15 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (non-regressed) per sample

1 15 2 25 3 35 4 45 51

15

2

25

3

35

4

45

5

MOS-LQSn

MOS-LQOn(POLQA)

Variant A

Variant B

Figure 16 Subjective results (MOS-LQSn) versus MOS-LQOn(POLQA) scores (regressed) per sample

are given in Table II The correlation is calculated over alltest conditions and voices for both adjustment locations(Variant A and B) The correlation coefficient is positivethough very low (R = 017) for variant A and negative forvariant B (R = minus015) As normal positive correlation

627

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Table II Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (PESQ) before regression

Variant R rmse rmselowast

A 017 0639 0367B -015 0690 0424

Table III Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) before regression

Variant R rmse rmselowast

A 030 0783 0448B 047 0806 0475

Table IV Pearson correlation coefficient root mean square errorand epsilon-insensitive root mean square error between MOS-LQSn and MOS-LQOn (POLQA) after regression

Variant R rmse rmselowast

A 030 0265 0243B 047 0270 0217

indicates that both variables increase or decrease togetherwhereas negative correlation indicates that as one variableincreases the other decreases and vice versa Moreoverthe smallest rmse and rmselowast were also obtained for vari-ant AThe 3-rd order regression as recommended in [54]

leads in this case to a non-monotonically decreasing map-ping function as opposed to a function that should bemonotonically increasing There are several options avail-able to try to achieve monotonicity in such cases (eg out-liers influence weighting polynomial order change or non-polynomial function regression) In an attempt to use com-mon polynomial regression and to avoid the sometimesquestionable outlier penalization we tried the 2-nd and1-st order polynomial regression The latter led to mono-tonic results but unfortunately the function was still mono-tonically decreasing As such we were not able to find amonotonically increasing mapping function for this datasetFigure 17 presents the results broken down by speaker

and variant The subjective scores confirm again thatthere was little difference between location variants intra-speaker significant inter-speaker variability and very lit-tle intra-speaker variation across conditions No1ndash12 Aspreviously shown the PESQ results exhibit a trend forhigher MOS-LQOn scores in variant A over variant B sig-nificant intra-speaker variation across conditions No1ndash12and significant inter-speaker variationRegarding PESQ results ie very low correlation be-

tween subjective and objective data and inability to finda monotonically increasing mapping function for this dataset we conclude that PESQ fails to correctly predict qual-

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQS

Subjective results

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

SpeakerVariant

PESQ

M1AM1BM2AM2B F1A F1B F2A F2B3

35

4

45

5

MOS-LQOn

POLQA

Figure 17 Dominant Experimental Factors Results aggregatedby speaker (eg M1 is male speaker 1) and by playout variant(ie A or B) The subjective scores and predictions provided byPESQ and POLQA are presented along with the 95 confidenceintervals

ity scores for this kind of degradation In other wordsPESQ is not able to correctly model the average user per-ception of the impact of frequent playout delay adjust-ments introduced by VoIP jitter buffers

Moving to POLQA results show little variation intra-speaker across variants (except for Female 1) very lit-tle intra-speaker variation across conditions No1ndash12 (ex-cept Female 1) and much less variation inter-speaker Thisclearly shows that POLQA performs much better thanPESQ in predicting the insignificant impact of conditionsNo1ndash12 and also the relatively insignificant impact ofvariants AB (except for Female 1) Finally the correla-tion data for POLQA is significantly better as shown inboth Tables III and IV and rmse data for POLQA (afterregression (1st order polynomial regression applied)) isalso much better than PESQ It is worth noting that the lowcorrelations obtained for both models are due to individualuser preferences for voice

628

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

Table V Summary of ANOVA test conducted on the MOS-LQSnrsquos

Effect SS df MS F p

Test condition (TC) 923 12 07693 082 06350Voice 20702 3 690073 7312 00000Variant 101 1 10051 107 03022TCVoice 1821 36 05059 054 09895TCVariant 468 12 03899 041 09593VoiceVariant 068 3 02282 024 08672

Error 288037 3052 09438Total 31212 3119

We suggest a number of reasons for the particularlypoor performance of the PESQ model in predicting qual-ity scores for the investigated conditions Firstly we haveshown that PESQ is more sensitive to the investigated ad-justments than subjects are (see Figures 11 and 12) andthe impact is proportional to the adjustments Secondlyalthough the impact of voice on subjective scores is wellknown the impact was more significant in our subjec-tive test than expected Thirdly as discussed in subsec-tion 41 we speculate that exposure to wideband telephonyandor the small range of impairments also influenced thesubjective results and thus the prediction performance ofthe PESQ model It should be also noted here that thesefactors may also have had an impact on the performanceof POLQA model (correlation between the objective andsubjective data) in this experiment

5 Conclusions and future work

In this paper we have investigated the impact of play-out adjustments introduced by VoIP applications on qual-ity scores obtained from a subjective listening test (MOS-LQSn) and listening quality scores predicted by boththe PESQ and POLQA models (MOS-LQOn (PESQPOLQA)) Moreover the accuracy of both PESQ andPOLQAmodels has also been assessed by comparing theirpredicted values with subjective scores Five specific ques-tions outlined in Section 23 were addressed in our study

Addressing the first question we report that the impactof frequent and small silence period adjustments (playoutadjustments introduced by jitter buffers in VoIP) on sub-jective listening quality scores is insignificant To the bestof our knowledge the subjective results presented in thispaper are a first proof of the assertion published in the lit-erature [1 2 10] that the playout adjustments introducedby jitter buffers in VoIP scenarios do not have a noticeableeffect on listening quality perceived by the end user

Regarding the second question we report that the inves-tigated impairments (playout adjustments introduced byjitter buffers) have a significant impact on objective listen-ing MOS scores predicted by PESQ model whereas theimpact on POLQA though present was much less Notethat both Voranrsquos and Hoenersquos et al research showed thatsingle adjustments (430ms in the case of Voran 0-320ms(approximately) in the case of Hoene et al) were found tobe disregarded by PESQ Regarding question 3 and PESQ

a comparison of the subjective assessments and predic-tions provided by PESQ has shown that PESQ is not ableto accurately predict the impact of frequent adjustmentsintroduced by VoIP jitter buffers on listening quality per-ceived by the end user Although Hoenersquos et al research[6] has shown that PESQ model provides relatively accu-rate predictions for adjustments in active speech our resultsuggest that PESQ performance for multiple small adjust-ments within silences is inaccurate It has to be empha-sized here that the PESQ model was not explicitly verifiedduring its integration and characterization phase for fre-quent time shifting (playout adjustments) that results fromVoIP applications with adaptive buffering over congestednetworks As such our research represents a somewhat out-of-domain use case for this model

Regarding question 3 and POLQA our results showthat POLQA is noticeably better at predicting the subjec-tive scores It has to be emphasized here that the POLQAmodel was not explicitly verified either during its designand integration phase for frequent time shifting (play-out adjustments) that results from VoIP applications withadaptive buffering over congested networks As such ourresearch represents a somewhat out-of-domain use case forthis model It should also be noted that the POLQA resultspresented in this paper are also a part of characterizationphase of POLQA model (and will be published in an ap-plication guide of P863 in very limited form)

Question 4 sought to determine a relationship betweenthe magnitude of adjustments and impact on objective andsubjective listening quality scores Results indicate an in-significant impact on subjective scores in this study Incontrast we report a strong relationship between the extentof adjustments and objective scores predicted by PESQ Inparticular the impact of the investigated impairment in-creases with its extent Regarding POLQA we report thatwhilst some relationship exists it is both much less notice-able and characterisable

Addressing the last question the impact of the positionin the sample where adjustments are made is insignificantfor subjective scores On the other hand the effect of theposition was found to be both noticeable and consistent forobjective scores predicted by PESQ model especially forhigher magnitudes of the investigated playout adjustmentsRegarding POLQA however results were much closer tosubjective scores (insignificant) except for female 1

629

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

ACTA ACUSTICA UNITED WITH ACUSTICA Pocta et al Impact of playout delay adjustmentsVol 101 (2015)

Future work will focus on repeating this study in wide-band and super-wideband telecommunication scenariosOther questions that arise from this research and whichare worthy of further investigation include

bull Why in subjective test did listeners return scores sig-nificantly lower than predicted PESQPOLQA scoreseven for reference samples We suggest that widebandexperience andor low impairment range plays a role

bull In our study the extent of adjustments introduced inthe samples is typical of VoIP applications experiencingmoderate to severe network jitter For VoIP applicationsthat introduce more extreme adjustments (eg the im-pact of TCP fallback process utilised by some VoIP ap-plications to bypass firewalls) what if any impact willthis have on subjective listening quality scores

bull What precise relationship can be established betweenthe location of adjustments and impact on subjec-tiveobjective listening quality scores We noted thatsubjective scores for variant B were slightly lowerthough not statistically significant Variant B adjust-ments were designed to be towards the end of thespeech segment We raised the point that these resultssuggested a recency effect albeit with much smallersamples than used in similar tests published in the lit-erature [51 52 53] Interestingly this relationship wasmore clearly evident in PESQ objective results but notso in POLQA where a consistent trend was absent

Table V shows the results of the ANOVA test carried outon the subjective data (Dependent variable MOS-LQSn)described in more detail in Section 41

Acknowledgement

Andrew Hines thanks Google Inc for support

References

[1] R Ramjee J Kurose D Towsley H Schulzrinne Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks Proceedings of IEEE Infocom 1994Los Alamitos USA June 1994 680ndash688

[2] S Moon J Kurose D Towsley Packet audio playoutdelay adjustment performance bounds and algorithmsACMSpringer Multimedia Systems vol 6 January 199817ndash28

[3] Y Liang N Farber B Girod Adaptive playout schedulingusing time-scale modification in packet voice communica-tions Proceedings of IEEE International Conference onAcoustics Speech and Signal Processing ICASSP 2001Salt Lake City Utah USA May 2001

[4] F Liu J Kim C Jay Kuo Quality enhancement of packetaudio with time-scale modification Proceedings of SPIEITCOM 2002 Multimedia Systems and Applications Bos-ton USA July 2002

[5] C Hoene H Karl A Wolisz A perceptual quality modelintended for adaptive VoIP applications International Jour-nal Communication Systems 19 (2005) 299ndash316

[6] C Hoene S Wietholter A Wolisz Predicting the percep-tual Service quality using a trace of VoIP packets Pro-ceedings of Fifth International Workshop on Quality of fu-ture Internet Services (QofISrsquo04) LNCS 3266 BarcelonaSpain September 2004 21ndash30

[7] L Sun E Ifeachor Prediction of perceived converstaionalspeech quality and effects of playout buffer algorithmsProceedings of International IEEE Conference on Commu-nications (ICC 2003) Anchorage USA May 2003 vol11ndash6

[8] L Sun E Ifeachor Newmodels for perceived voice qualityprediction and their optimization for VoIP networks Pro-ceedings of IEEE International Conference on Communi-cations Paris France June 2004 vol3 1478ndash1483

[9] H Melvin L Murphy An evaluation of the potential ofsynchronized time to improve VoIP quality Proceedings ofInternational IEEE Conference on Communications (ICC2003) Anchorage USA May 2003

[10] W Montgomery Techniques for packet voice synchroniza-tion IEEE Journal on Selected Areas in Communicationvol SAC-1 no 6 December 1983

[11] S Voran Perception of temporal discontinuity impairmentsin coded speech - A proposal for objective estimators andsome subjective test results Proceedings of the 2nd Inter-national Conference on Measurement of Speech and AudioQuality in Networks 2003 Prague Czech Republic May2003

[12] M Lee J W McGowan M C Recchione Enabling wire-less VoIP Bell Labs Technical Journal 11 (2007) 201ndash215

[13] Q Gong P Kabal Improved quality for conversationalVoIP using path diversity Proceedings of Interspeech 2011Florence Italy 2011 2549ndash2552

[14] O Stapleton M Melvin P Pocta Quantifying the effec-tiveness of PESQ (Perceptual Evaluation of Speech Qual-ity) in coping with frequent time shifting Proceedings ofthe International Conference on Measurement of Speechand Audio Quality in Networks 2011 Prague Czech Re-public June 2011

[15] ITU-T Rec P800 Methods for subjective determinationof transmission quality International TelecommunicationUnion Geneva Switzerland 1996

[16] Handbook of practical procedures for subjective testingInternational Telecommunication Union Geneva Switzer-land 2011

[17] S Moeller A Raake Telephone speech quality predictionTowards network planning and monitoring models for mod-ern network scenarios Speech Communication 38 (2002)47ndash75

[18] J G Beerends J A Stemerdink A perceptual speech qual-ity measure based on a psychoacoustic sound representa-tion J Audio Eng Soc 42 (1994) 115ndash123

[19] S Voran Objective estimation of perceived speech quality- Part I Development of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 371ndash382

[20] S Voran Objective estimation of perceived speech quality- Part II Evaluation of the measuring normalizing blocktechnique IEEE Trans on Speech and Audio Processing 7(1999) 383ndash390

[21] A W Rix M P Hollier The perceptual analysis mea-surement system for robust end-to-end speech quality as-sessment Proceedings of IEEE ICASSP 2000 IstanbulTurkey 2000 vol3 1515ndash1518

[22] A W Rix M P Hollier A P Hekstra J G Beerends Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part I Time-delay compensation J Audio EngSoc 50 (2002) 755ndash764

630

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631

Pocta et al Impact of playout delay adjustments ACTA ACUSTICA UNITED WITH ACUSTICAVol 101 (2015)

[23] J G Beerends A P Hekstra A W Rix M P Hollier Per-ceptual evaluation of speech quality (PESQ) - The new ITUstandard for objective measurement of perceived speechquality Part II psychoacoustic model J Audio Eng Soc50 (2002) 765ndash778

[24] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartI Temporal alignment J Audio Eng Soc 61 (2013) 366ndash384

[25] J G Beerends C Schmidmer J Berger M Obermann RUllman J Pomy M Keyhl Perceptual objective listeningquality assessment (POLQA) The third generation ITU-Tstandard for end-to-end speech quality measurement PartII Perceptual model J Audio Eng Soc 61 (2013) 385ndash402

[26] ITU-T Rec P861 Objective quality measurement of tele-phone-band (300ndash3400Hz) speech codecs InternationalTelecommunication Union Geneva Switzerland 1998

[27] ITU-T Rec P862 Perceptual evaluation of speech qual-ity (PESQ) An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs International Telecommunication UnionGeneva Switzerland 2001

[28] ITU-T Rec P863 Perceptual objective listening qual-ity assessment International Telecommunication UnionGeneva Switzerland 2011

[29] D-S Kim ANIQUE An auditory model for single-endedspeech quality estimation IEEE Transaction on Speech andAudio Processing 13 (2005) 821ndash831

[30] L Malfait J Berger M Kastner P563 - The ITU-T stan-dard for single-ended speech quality assessment IEEETransaction on Audio Speech and Language Processing 14(2006) 1924ndash1934

[31] ITU-T Rec P563 Single-ended method for objectivespeech quality assessment in narrow-band telephony appli-cations International Telecommunication Union GenevaSwitzerland 2004

[32] ITU-T Rec G107 The E-model a computational modelfor use in transmission planning International Telecom-munication Union Geneva Switzerland 2009

[33] ITU-T Rec G1071 Wideband E-model InternationalTelecommunication Union Geneva Switzerland 2011

[34] W Jiang H Schulzrinne Analysis of on-off patterns inVoIP and their effect on voice traffic aggregation Proceed-ings of International Conference on Computer Communi-cations and Networks (ICCCN 2000) Las Vegas USA Oc-tober 2000

[35] Z Qiao R Venkatasubramanian L Sun E Ifeachor Anew buffer algorithm for speech quality improvement inVoIP systems Wireless Personal Communications 45(2008) 189ndash207

[36] L Ding R Goubran Speech quality prediction in VoIP us-ing the extended E-model Proceedings of IEEE GlobecomSan Francisco USA 2003 vol7 3974ndash3978

[37] H Melvin L Murphy Exploring the extent and impact ofplayout adjustments within VoIP applications on the MOSProceedings of the International Conference on Measure-ment of Speech and Audio Quality in Networks 2005Prague Czech Republic Jun 2005

[38] H Melvin The use of synchronised time in Voice overIP (VoIP) applications PhD Thesis University CollegeDublin October 2004

[39] W Jiang H Schulzrinne QoS measurement of internetreal-time multimedia services Technical report CUCS-05-99m Columbia Univ NY Dec 1999

[40] W Jiang H Schulzrinne Modeling of packet loss and de-lay and their effect on real-time multimedia service qualityProceedings of NOSSDAV 2000 Chapel Hill USA Jun2000

[41] W Kellerer E Steinbach P Eisert B Girod A real-timeinternet streaming media testbed Proceedings of the IEEEInternational Conference on Multimedia and Expo Lau-sanne Switzerland Aug 2002

[42] R Koodli R Ravikanth One-way loss pattern sample met-rics IETF RFC 3357 Aug 2002

[43] C Boutremans J Boudec Adaptive joint playout bufferand fec adjustment for internet telephony Proceedings ofIEEE Infocom 2003 San Francisco USA Mar 2003

[44] Z Li J Chakareski X Niu Y Zhang W Gu Modelingand analysis of distortion caused by Markov-model burstpacket losses in video transmission IEEE Transactions oncircuits and systems for video technology 19 (2009) 917ndash931

[45] H Melvin P O Flaithearta J Shannon L B YusteTime synchronisation at application level Potential bene-fits challenges and solutions International Telecom Syn-chronisation Forum (ITSF) 2009bdquo Rome Italy Nov 2009

[46] P O Flaithearta H Melvin E-Model based prioritiza-tion of multiple VoIP sessions over 80211e Proceedingsof Digital Technologies 2010 conference Žilina Slovakia2010

[47] O Stapleton Quantifying the effectiveness of PESQ (Per-ceptual Evaluation of Speech Quality) in copying with fre-quent time shifting Master Thesis NUI GalwayRegisUniversity Colorado USA 2010

[48] ITU-T P Supplement 23 ITU-T coded-speech databaseInternational Telecommunication Union Geneva Switzer-land 1998

[49] ITU-T Rec P8621 Mapping function for transform-ing P862 raw result scores to MOS-LQO InternationalTelecommunication Union Geneva Switzerland 2003

[50] A W Rix Comparison between subjective listening qual-ity and P862 PESQ White paper Psytechnics LtdSeptember 2003

[51] I D C D S of the relationship between instantaneous andoverall subjective speech quality for time-varying qualityspeech sequences Influence of a recency effect SourceFrance Telecom RampD France (N Chateau) InternationalTelecommunication Union Geneva Switzerland 2000

[52] J H Rosenbluth Testing the quality of connections hav-ing time varying impairments Committee contributionT1A1798-031

[53] L Gros N Chateau Instantaneous and overall judgmentsfor time-varying speech quality Assessments and relation-ships Acta Acustica united with Acustica 87 (2001) 367ndash377

[54] A W Rix J G Beerends D-S Kim P Kroon O GhitzaObjective assessment of speech and audio quality - Tech-nology and applications IEEE Transaction on AudioSpeech and Language Processing 14 (2006) 1890ndash1901

[55] ITU-T Del Contr D123 Proposed procedure for the eval-uation of objective metrics L M Ericsson (Author IrinaCotanis) ITU-T SG 12 Meeting Geneva SwitzerlandJune 5ndash13 2006

[56] ITU-T TD12rev1 Statistical evaluation procedure forPOLQA v10 SwissQual AG (Author Jens Berger) ITU-T SG 12 Meeting Geneva Switzerland March 10ndash192009

631