Limited error based event localizing temporal decomposition and its application to variable-rate...

www.elsevier.com/locate/specom

Speech Communication 49 (2007) 292–304

Limited error based event localizing temporal decomposition andits application to variable-rate speech coding

Phu Chien Nguyen *, Masato Akagi, Binh Phu Nguyen

Graduate School of Information Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan

Received 16 November 2005; received in revised form 4 January 2007; accepted 4 February 2007

Abstract

This paper proposes a novel algorithm for temporal decomposition (TD) of speech, called ‘limited error based event localizingtemporal decomposition’ (LEBEL-TD), and its application to variable-rate speech coding. In previous work with TD, TD analysiswas usually performed on each speech segment of about 200–300 ms or more, making it impractical for online applications. In thispresent work, the event localization is determined based on a limited error criterion and a local optimization strategy, which resultsin an average algorithmic delay of 65 ms. Simulation results show that an average log spectral distortion of about 1.5 dB can be achiev-able at an event rate of 20 events/s. Also, LEBEL-TD uses neither the computationally costly singular value decomposition routine northe event refinement process, thus reducing significantly the computational cost of TD. Further, a method for variable-rate speech codingan average rate of around 1.8 kbps based on STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation ofweiGHTed spectrum), which is a high-quality speech analysis–synthesis framework, using LEBEL-TD is also realized. Subjective testresults indicate that the performance of the proposed speech coding method is comparable to that of the 4.8 kbps FS-1016 CELP coder.� 2007 Elsevier B.V. All rights reserved.

Keywords: Temporal decomposition; Event vector; Event function; STRAIGHT; Speech coding; Line spectral frequency

1. Introduction

Most existing low rate speech coders analyze speech inframes according to a model of speech production. Sucha model is the linear predictive coding (LPC) model. How-ever, speech production can be considered as a sequence ofoverlapping articulatory gestures, each producing anacoustic event that should approximate an articulatory tar-get (Fallside and Woods, 1985). Due to co-articulation andreduction in fluent speech, a target may not be reachedbefore articulation towards the next phonetic target begins.The non-uniform distribution of these speech events is notexploited in frame-based systems.

0167-6393/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2007.02.007

* Corresponding author. Present address: Institute of Scientific andIndustrial Research, Osaka University, Japan. Tel.: +81 6 6879 8542; fax:+81 6 6879 8544.

E-mail addresses: [email protected] (P.C. Nguyen),[email protected] (M. Akagi), [email protected] (B.P. Nguyen).

The so-called temporal decomposition (TD) method(Atal, 1983) for analyzing the speech signals achieves theobjective of decomposing speech into targets and theirtemporal evolutionary patterns without any recourse toany explicit phonetic knowledge. This model of speechtakes into account the above articulatory considerationsand results in a description of speech in terms of a sequenceof overlapping event functions and corresponding eventvectors as given in Eq. (1)

yðnÞ ¼XK

k¼1

ak/kðnÞ; 1 6 n 6 N ð1Þ

where ak, the kth event vector, is the speech parameterscorresponding to the kth target. The temporal evolutionof this target is described by the kth event function,/k(n). yðnÞ is the approximation of y(n), the nth spectralparameter vector, produced by the TD model. N and K

are the number of frames and number of events in the

mailto:[email protected]



φk+1(n)

φk (n)

P.C. Nguyen et al. / Speech Communication 49 (2007) 292–304 293

block of spectral parameters under consideration,respectively.

Despite the fact that TD has the potential to become aversatile tool in speech analysis, its high-computationalcomplexity and long-algorithmic delay make it impracticalfor online applications. In the original TD method by Atal(1983), TD analysis is performed on each speech segmentof about 200–300 ms, thus resulting in an algorithmic delayof more than 200 ms. In addition, Atal’s method is verycomputationally costly, which has been mainly attributedto the use of the singular value decomposition (SVD)routine and the iterative refinement process (Van Dijk-Kappers and Marcus, 1989). These prevent Atal’s methodfrom online applications.

Most of modified algorithms for TD have been mainlyproposed to overcome the drawback of high-computationalcost incurred by the original TD method. The algorithm forTD proposed in (Nandasena et al., 2001), S2BEL-TD,reduces the computational cost of TD by avoiding the useof SVD, but the long-algorithmic delay has more or lessremained the same. S2BEL-TD uses a spectral stability cri-terion to determine the initial event locations. Meanwhile,the event localization in the optimized TD (OTD) method(Athaudage et al., 1999) is performed using an optimizedapproach (dynamic programming). Although the OTDmethod can achieve very good results in terms of recon-struction accuracy, but its long-algorithmic delay (morethan 450 ms) makes it suitable for speech storage relatedapplications only. Also, both the OTD and S2BEL-TDmethods use the line spectral frequency (LSF) parameters(Itakura, 1975) as input, which might cause the correspond-ing LPC synthesis filter to be unstable. This is because thereis no guarantee that the selected LSF parameters are validafter the spectral transformation performed by these twoTD methods. The restricted TD (RTD) (Kim and Oh,1999) and the modified RTD (MRTD) (Nguyen and Akagi,2002a) methods, on the other hand, consider the orderingproperty of LSFs to make LSF parameters possible forTD. These methods require an average algorithmic delayof about 95 ms, while can achieve relatively good results.

In this paper, we propose a novel algorithm for tem-poral decomposition of speech called ‘limited error basedevent localizing temporal decomposition’ (LEBEL-TD).This method employs the restricted second order modeland a novel approach to event localization. Here, the eventlocalization is initially performed based on a limited errorcriterion, and then further refined by a local optimizationstrategy. In the following, the event vectors are set as theoriginal spectral parameter vectors at the event locationsand thus, it can be applied to decomposing the LSF param-eters without considering the ordering property of LSFs.This algorithm for TD requires only 65 ms average algo-rithmic delay,1 while can achieve results comparable to

1 The frame period is 10 ms long, as considered in S2BEL-TD, MRTD,and OTD.

the S2BEL-TD, RTD and MRTD methods. Moreover,LEBEL-TD uses neither the computationally costly SVDroutine nor the iterative refinement process, thus resultingin a very low computational cost required for TD analysis.

We have also investigated the usefulness of LEBEL-TDin speech coding. In this paper, a method for variable-ratespeech coding based on STRAIGHT (Speech Transforma-tion and Representation using Adaptive Interpolation ofweiGHTed spectrum) (Kawahara et al., 1999), which is ahigh-quality speech analysis–synthesis framework, usingLEBEL-TD is presented. For encoding spectral informa-tion of speech, LEBEL-TD based vector quantization(VQ) is utilized, whilst other speech parameters are quan-tized using scalar quantization (SQ). Subjective results indi-cate that the performance of the proposed speech codingmethod at an average rate of around 1.8 kbps can be com-parable to that of the 4.8 kbps US Federal Standard FS-1016 CELP coder (Campbell et al., 1991).

2. LEBEL-TD of speech spectral parameters

2.1. Restricted second order TD model

Assume that the co-articulation in speech productiondescribed by the TD model in terms of overlapping eventfunctions is limited to adjacent events, the second orderTD model (Niranjan and Fallside, 1989; Shiraki andHonda, 1991; Athaudage et al., 1999), where only twoadjacent event functions can overlap as depicted inFig. 1, is given by Eq. (2)

yðnÞ ¼ ak/kðnÞ þ akþ1/kþ1ðnÞ; nk 6 n < nkþ1 ð2Þ

where nk and nk+1 are the locations of event k and event(k + 1), respectively.

The so-called restricted second order TD model was uti-lized in (Dix and Bloothooft, 1994; Kim and Oh, 1999;Nguyen and Akagi, 2002a,b) and this work with an addi-tional restriction to the event functions in the second orderTD model that all event functions at any time sum up toone. The argument for imposing this constraint on theevent functions can be found in (Dix and Bloothooft,1994). Eq. (2) can be rewritten as follows:

nnk k+1n

Fig. 1. Example of two adjacent event functions in the second order TDmodel.

->

294 P.C. Nguyen et al. / Speech Communication 49 (2007) 292–304

yðnÞ ¼ ak/kðnÞ þ akþ1ð1� /kðnÞÞ; nk 6 n < nkþ1 ð3Þ
y (n )
a->

k

a->

k+1

y (n )->

y (n )->

2.2. Determination of event functions

Assume that the locations nk and nk+1 of two consecu-tive events are known. Then, the right half of the kth eventfunction and the left half of the (k + 1)th event functioncan be optimally evaluated by using ak = y(nk) andak+1 = y(nk+1). The reconstruction error, E(n), for the nthspectral parameter vector is

EðnÞ ¼ kyðnÞ � yðnÞk2

¼ kðyðnÞ � akþ1Þ � ðak � akþ1Þ/kðnÞk2 ð4Þ

where nk 6 n < nk+1. Therefore, /k(n) should be deter-mined so that E(n) is minimized.

a->

k

a->

k+1

y (n )->

y (n-1 )->^

Fig. 3. Determination of the event functions in the transition interval[nk,nk+1]. The point of the line segment between ak and ak+1 (a), betweenyðn� 1Þ and ak+1 (b) with minimum distance from y(n) is taken as the bestapproximation.

2.2.1. Geometric interpretation of TDTD yields an approximation of a sequence of spectral

parameters by a linear combination of event vectors. SinceTD’s underlying distance metric is Euclidean, a naturalrequirement is to have this approximation be invariantwith respect to a translation or rotation of the spectralparameters. Dix and Bloothooft (1994) considered the geo-metric interpretation of TD results and found that TD isrotation and scale invariant, but it is not translationinvariant.

In order to overcome this shortcoming and describe TDas a breakpoint analysis procedure in a multidimensionalvector space, where breakpoints are connected by straightline segments, Dix and Bloothooft (1994) enforced twoconstraints on the event functions: (i) at any moment oftime only two event functions, which are adjacent in time,are non-zero; and (ii) all event functions at any time sumup to one. In other words, the restricted second orderTD model was utilized in (Dix and Bloothooft, 1994).These constraints are needed to approximate the path inparameter space by means of straight line segmentsbetween breakpoints (see Fig. 2).

Geometrically speaking, the two event vectors ak andak+1 define a plane in P-dimensional vector space. Thedetermination of event functions /k(n) and /k+1(n) in theinterval [nk,nk+1] is now depicted in Fig. 3a as the projec-tion of vector y(n) onto this plane. Clearly the following

ak

ak −1

ak +1

y(n)

y(n)

Fig. 2. The path in parameter space described by the sequence of spectralparameters y(n) is approximated by means of straight line segmentsbetween breakpoints.

holds: /k(nk) = 1, /k(nk+1) = 0, and 0 6 /k(n) 6 1 fornk 6 n 6 nk+1.

While n ranges from nk to nk+1, the movement of vectory(n) is described by the transition of yðnÞ along the straightline segment connecting two breakpoints ak and ak+1. Astime is moving forward, the transition of yðnÞ should bemonotonic.

2.2.2. New determination of event functions

The TD model is based on the hypothesis of articulatorymovements towards and away from targets. An appealingresult of the above properties of event functions is thatone can interpret the values /k(n) as a kind of activationvalues of the corresponding event. During the transitionfrom one event towards the next the activation value ofthe left event decreases from one to zero, whilst the rightevent increases its activation value from zero to the valueof one. As mentioned earlier, to model the temporal struc-ture of speech more effectively no backwards transitionsare allowed. Therefore, each event function should have agrowth cycle; during which the event function grows fromzero to one and a decay cycle; during which the event func-tion decays from one to zero. In other words, each event


function should have only one peak, which is called thewell-shapedness property. On the contrary, an ill-shapedevent function can be viewed as an event function whichhas several growth and decay cycles, i.e. having more thanone peak.

Fig. 4 shows examples of well-shaped and ill-shapedevent functions. It can be seen that well-shaped eventfunctions are desirable from speech coding point of viewbecause the well-shapedness property helps reduce thequantization error of event functions when vector quan-tized.

However, the determination of event functions in (Dixand Bloothooft, 1994) has not guaranteed the well-shaped-ness property for them since their changes during the tran-sition from one event towards the next may not bemonotonic, which results in ill-shaped event functions. Inparticular, one may wonder that if an event function hassome values of one interlaced by other values, causingthe next event function to have more than one lobe, whichis not acceptable in the conventional TD method. Ill-shaped event functions are also undesirable from speechcoding point of view. They increase the quantization errorwhen vector quantized because the uncharacteristic valleysand secondary peaks are not normally captured by thecodebook functions. This is because an event function isquantized by its length and shape in the interval betweenits and the next event function’s locations. In that interval,a well-shaped event function is always a decreasing func-tion while an ill-shaped event function is always non-monotonic.

Taking into account the above considerations, we havemodified the determination of event functions correspond-ing to the point of the line segment between yðn� 1Þ andak+1 (see Fig. 3b) instead of ak and ak+1 as considered in(Dix and Bloothooft, 1994), with minimum distance from

growth cycle decay cycle

growth decay growth decay

Fig. 4. Examples of a well-shaped event function (a) and an ill-shapedevent function (b).

y(n). In mathematical form, the above determination ofevent functions can be written as

/kðnÞ ¼

1�/k�1ðnÞ; if nk�1 < n< nk

1; if n¼ nk

minð/kðn� 1Þ;maxð0; /kðnÞÞÞ; if nk < n< nkþ1

0; otherwise

8>>><>>>:ð5Þ

where

/kðnÞ ¼hðyðnÞ � akþ1Þ; ðak � akþ1Þi

kak � akþ1k2ð6Þ

Here, hÆ,Æi and kÆk denote the inner product of two vectorsand the norm of a vector, respectively.

This modification ensures that the value of event func-tion /k at n is always not greater than the value of eventfunction /k at n � 1 in the interval [nk;nk+1] (see the thirdline of Eq. (5)), and thereby the well-shapedness propertyis guaranteed. It should be noted that in (Dix and Blootho-oft, 1994), /k(n) is determined as minð1;maxð0; /kðnÞÞÞ, ifnk < n < nk+1.

2.3. LEBEL-TD algorithm

The section of spectral parameters, y(n), wherenk 6 n < nk+1, is termed a segment. The total accumulatederror, Eseg(nk,nk+1), for the segment is

Esegðnk; nkþ1Þ ¼Xnkþ1�1

n¼nk

EðnÞ ð7Þ

where E(n) can be calculated for every nk 6 n < nk+1 usingEq. (4) once nk and nk+1 are known. The buffering

1n 2n 3n

1n

1n

2n 3n

2n 3n

Initial event locations of the current block

Event locations of the current blockafter local optimizing

Initial event locations ofthe next block

*

Fig. 5. Buffering technique for LEBEL-TD.


technique for LEBEL-TD is depicted in Fig. 5, and thewhole algorithm is described as follows:

Step 0. Set k 1, n1 1, a1 y(1); set n2 as the lastlocation from n1 on so that the reconstructionerror for every frame in the interval (n1,n2) is lessthan a predetermined number e.

Step 1. Similarly, set n3 as the last location from n2 on sothat the reconstruction error for every frame in theinterval (n2,n3) is less than e.

Step 2. Local optimize the location of n2 in the interval(n1,n3).

n�2 ¼ arg minn1<n2<n3

fEsegðn1; n2Þ þ Esegðn2; n3Þg

where only n2 that makes E(n) < e for everyn1 < n < n3 is taken into account. If n3 is the lastframe, set k k þ 1; ak yðn�2Þ; akþ1 yðn3Þ;and exit.

Step 3. Set k k + 1, ak yðn�2Þ; then set n1 n�2;n2 n3; and go back to step 1.

The predetermined number e is called the reconstructionerror threshold, and it is the only parameter that effectsthe number and locations of the events. The reconstructionerror threshold controls the event rate, i.e. the number ofevents per second, and can be appropriately selected toachieve the optimal performance of TD analysis for differ-ent applications. This is the reason why the above TD algo-rithm is named ‘limited error based event localizingtemporal decomposition’ (LEBEL-TD).

e = 0.045 was empirically chosen as a suitable value forthe reconstruction error threshold to produce the event rateof about 20 events/s. It should be noted that the spectralparameter here is LSF with the frame period is set as10 ms. This event rate results in an average buffering delayof about 50 ms, i.e. 5 frames, along with a 10 ms, i.e. oneframe, look-ahead. On the other hand, the LPC analysiswindow is 30 ms long, which implies a 15 ms look-ahead.The calculation of algorithmic delay for LEBEL-TD isdepicted in Fig. 6. Note that this calculation is applied toanalyzing the first segment. From the second segment on,

0ms 10ms 20msTime

Segment to be TD analyzed(~5 frames = 50 ms)

1 frame look ahead (the first frame from n1 on whose

reconstruction error exceeds )

15 ms look ahead required by LPC analysis of the 6th frame

Fig. 6. Algorithmic del

the look-ahead frame in the last segment can be employed.Therefore, the average algorithmic delay for LEBEL-TD isabout 65 ms and has been known to be the lowest algorith-mic delay for TD so far. Moreover, LEBEL-TD has signif-icantly reduced the computational cost of TD because ituses neither the computationally costly SVD routine northe iterative refinement process. These make LEBEL-TDsuitable for online applications.

In the LEBEL-TD method, the event vectors are set asthe spectral parameter vectors corresponding to the eventlocations. Obviously, the event vectors are valid spectralparameter vectors and the stability of the correspondingLPC synthesis filter can be thus ensured after spectraltransformation performed by LEBEL-TD. Consequently,LEBEL-TD can be applied to analyzing any current typesof parametric representations of speech. Meanwhile, mostconventional TD methods use an iterative refinement ofevent vectors, which might cause the reconstructed spectralparameter vectors to be invalid, for example when TD isapplied to decomposing LSF parameters, and thus result-ing in an unstable LPC synthesis filter.

Fig. 7 shows the plot of event functions obtained fromLEBEL-TD analysis of LSF parameters for an exampleof a female/Japanese sentence utterance ‘shimekiri ha geN-shu desu ka.’ As can be seen from the figure, all event func-tions are well-shaped. In Fig. 8, the plots of original andreconstructed LSF parameters after LEBEL-TD analysisare shown for the same utterance as utilized in Fig. 7.

3. Performance evaluation

The ATR Japanese speech database was used for thespeech data. Line spectral frequency (LSF) parametersintroduced by Itakura (1975) have been selected as thespectral parameter for the LEBEL-TD. This is because itis well-known that LSF parameters have the best interpola-tion (Paliwal, 1995) and quantization (Paliwal and Atal,1993) properties over the other LPC related spectralparameters.

Log spectral distortion (LSD) is a commonly used mea-sure in evaluating the performance of LPC quantization

30ms 40ms 50ms 60ms 70ms 80ms

15ms15ms

LPC analysis windows = 30 ms

ay for LEBEL-TD.

pause g e

i h a

N sh u

d e su k a

sh m ei

k i r5 10 15 20 25 30 35 40 45 50

50 55 60 65 70 75 80 85 90 95 100

100 105 110 115 120 125 130 135 140 145 150

150 155 160 165 170 175 180 185 190 195 200

Speech waveform

Event functions

Fig. 7. Plot of the event functions obtained from the LEBEL-TD method for the female/Japanese sentence utterance ‘shimekiri ha geNshu desu ka.’ Thespeech waveform is also shown together with the phonetic transcription for reference. The numerals indicate the frame numbers.


(Paliwal and Atal, 1993) and interpolation (Paliwal, 1995).LSD measure is also used for evaluating the interpolationperformance of TD algorithms (Shiraki and Honda,1991; Athaudage et al., 1999; Nandasena et al., 2001). Thiscriterion is a function of the distortion introduced in thespectral density of speech in each particular frame. Log

0 50 100 150 200 2500

500

1000

1500

2000

2500

3000

3500

4000

Fig. 8. Plots of the original and reconstructed LSF parameters obtainedfrom the LEBEL-TD method for the female/Japanese speech utterance‘‘shimekiri ha geNshu desu ka.’’ The solid line indicates the original LSFparameter vector trajectory and the dashed line indicates the reconstructedLSF parameter vector trajectory. The average log spectral distortion wasfound to be 1.6276 dB.

spectral distortion, Dn, for the nth frame is defined (indB) as follows:

Dn ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

F s

Z F s

0

½10 log10ðP nðf ÞÞ � 10 log10ðbP nðf ÞÞ�2 df

swhere Fs is the sampling frequency, and Pn(f) and P nðf Þ arethe LPC power spectra corresponding to the nth frame ofthe original spectral parameters, y(n), and the recon-structed spectral parameters, yðnÞ, respectively. The resultsare provided in terms of log spectral distortion histograms,average log spectral distortion and percentage outliers hav-ing log spectral distortion greater than 2 dB. The outliersare divided into the following two types. Type 1: consistsof outliers in the range 2–4 dB, and Type 2: consists of out-liers having spectral distortion greater than 4 dB.

A set of 250 sentence utterances of the ATR Japanesespeech database were selected as the speech data. Thisspeech dataset consists of about 20 min of speech spokenby 10 speakers (five male and five female) re-sampled at8 kHz sampling frequency. Tenth order LSF parameterswere calculated using a LPC analysis window of 30 ms at10 ms frame intervals, and LEBEL-TD analyzed. Addi-tionally, log spectral distortion was also evaluated overthe same speech dataset for three other methods of TD:S2BEL-TD (Nandasena et al., 2001), RTD (Kim and Oh,1999), and MRTD (Nguyen and Akagi, 2002a) with LSFas the spectral parameter. The event rate was set as around20 events/s for all the four methods.

Table 1Event rate, average LSD, and percentage number of outlier framesobtained from the LEBEL-TD, S2BEL-TD, RTD and MRTD methods

Method Event rate Avg. LSD (dB) 2–4 dB (%) >4 dB (%)

LEBEL-TD 19.996 1.5125 32.52 0.07S2BEL-TD 19.455 1.4643 18.48 0.94RTD 20.163 1.5629 22.97 0.96MRTD 20.163 1.5681 23.15 0.98

The spectral parameter is LSF. Speech dataset consists of 250 sentenceutterances spoken by 10 speakers (five male and five female) of the ATRJapanese speech database.


Table 1 gives a comparison of the log spectral distortionresults for the LEBEL-TD, S2BEL-TD, RTD, and MRTDalgorithms. The distribution of the log spectral distortionin the form of histograms is shown in Fig. 9. Results indi-cate slightly better performance in the case of S2BEL-TDover the others, followed by LEBEL-TD and then RTD.However, it has been shown in (Nguyen and Akagi,

0 1 2 3 4 5 6 7 80

2

4

6

8

10

12

14

16

18

20

22

Log Spectral Distortion [dB]

Perc

enta

ge F

ram

es [

%]

Avg. LSD = 1.5125 dB

LEBEL−TD

0 1 2 3 4 5 6 7 80

2

4

6

8

10

12

14

16

18

20

22


Perc

enta

ge F

ram

es [

%]

RTD


Fig. 9. Distribution of the log spectral distortion (LSD) between the original anhistogram for LEBEL-TD. Top right: LSD histogram for S2BEL-TD. BottomSpeech dataset consists of 250 sentence utterances spoken by 10 speakers (five

2002a) that the S2BEL-TD and RTD methods, in the cur-rent forms, cannot always be applied to analyzing LSFparameters due to the stability problems in the LPC syn-thesis filter. Also, LEBEL-TD requires a lower computa-tional cost for TD analysis than MRTD, which is mainlyattributed to the fact that LEBEL-TD does not employthe iterative refinement process. In addition, LEBEL-TDalso needs a shorter algorithmic delay than MRTD. Forthese reasons, LEBEL-TD hereafter is used for analyzingthe LSF parameters.

We have also evaluated the performance of LEBEL-TDon the above speech dataset for some e. Table 2 givesthe summary of LSD and the event rate obtained fromLEBEL-TD analysis for some different values of e. Ascan be seen from the table, the event rate decreasesand the average LSD increases as e increases. Fig. 10 illus-trates the average log spectral distortion versus the eventrate.

0 1 2 3 4 5 6 7 80

2

4

6

8

10

12

14

16

18

20

22


Perc

enta

ge F

ram

es [

%]

S 2BEL−TD


0 1 2 3 4 5 6 7 80

2

4

6

8

10

12

14

16

18

20

22


Perc

enta

ge F

ram

es [

%]

MRTD


d reconstructed LSF parameters in the form of histograms. Top left: LSDleft: LSD histogram for RTD. Bottom right: LSD histogram for MRTD.male and five female) of the ATR Japanese speech database.

Table 2Event rate, average LSD, and percentage number of outlier framesobtained from the LEBEL-TD method for some e

e Event rate Avg. LSD (dB) 2–4 dB (%) >4 dB (%)

0.072 15.059 1.9220 48.52 1.600.065 16.051 1.8255 45.54 0.890.058 17.156 1.7270 41.91 0.460.053 18.107 1.6491 38.69 0.230.049 18.999 1.5802 35.64 0.130.045 19.996 1.5125 32.52 0.070.041 21.124 1.4400 29.02 0.040.038 22.117 1.3833 26.21 0.020.0355 23.050 1.3331 23.75 0.0140.033 24.106 1.2795 20.96 0.010.031 25.028 1.2336 18.69 0.00

The spectral parameter is LSF. Speech dataset consists of 250 sentenceutterances spoken by 10 speakers (five male and five female) of the ATRJapanese speech database.

15 16 17 18 19 20 21 22 23 24 251.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Event Rate [events/sec]

Ave

rage

Log

Spe

ctra

l Dis

tort

ion

[dB

]

Fig. 10. Average log spectral distortion (dB) versus the event rate(events/s).

55 60 65 70 75 80 851.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Average Algorithmic Delay [ms]

Ave

rage

Log

Spe

ctra

l Dis

tort

ion

[dB

]Fig. 11. Average log spectral distortion (dB) versus the average algorith-mic delay (ms).


It is clear that the event rate controls the delay. Theevent rate also controls the coding quality and the bit-rate.Fig. 11 shows an example of average log spectral distortionversus average algorithmic delay obtained from LEBEL-TD analysis of the above speech dataset. It is demonstratedthat the longer the algorithmic delay, the larger the logspectral distortion.

4. Variable-rate speech coding based on STRAIGHT

using LEBEL-TD

As shown earlier, in the temporal decomposition (TD)framework, the speech is no longer represented by a vectorupdated frame by frame, but instead by the continuoustrajectory of a vector. The trajectory is decomposed intoa set of phoneme-like events, i.e. a series of temporallyoverlapping event functions and a corresponding series ofevent vectors. Since the event rate varies in time, TD canbe considered as a technique to be used for variable-ratespeech coding.

The linear predictive coding (LPC) model of speech hasbeen widely adopted in many speech coding systems(Campbell and Tremain, 1986; Campbell et al., 1991; Pali-wal and Atal, 1993; Paliwal, 1995). However, since the linespectral frequency (LSF) parameters derived from LPCanalysis are independently extracted on a frame-by-framebasis, the corresponding LSF parameter vector trajectoryis not so smooth. In the other case, STRAIGHT (SpeechTransformation and Representation using Adaptive Inter-polation of weiGHTed spectrum), invented by Kawaharaet al. (1999), can extract very smooth spectrogram byemploying a time–frequency interpolation procedure. Asa consequence, LSF parameters extracted from the spectro-gram are correlated among frames, and the correspondingLSF parameter vector trajectory is thus smooth, which isdesirable for TD algorithms. In addition, STRAIGHT isknown as a high-quality speech analysis–synthesis frame-work with many promising applications in speech synthesisand modification (Kawahara et al., 1999). This versatilespeech manipulation toolkit can roughly decompose inputspeech signals into spectral envelopes, i.e. spectrogram, F0(fundamental frequency) information, and noise ratios.Those parameters and the maximum value of amplitudeare required for resynthesizing high-quality speech. Tomake STRAIGHT applicable for low-bit-rate speech cod-ing, the bit rate required to represent the spectral envelopemust be minimized. Since the spectral envelopes can befurther analyzed into LSF parameters and gain informa-tion, the LEBEL-TD algorithm can be incorporated withSTRAIGHT to construct high-quality speech coders work-ing at low-bit rates.

In this section, we introduce a new method for variable-rate speech coding based on STRAIGHT using LEBEL-TD. The proposed speech encoder and decoder block

STRAIGHTAnalysis

Inputspeech LSF

Analysis

SpectralEnvelope

TDAnalysis

Eventvectors

TDAnalysis

Eventfunctions

LSF

SQ

VQ

VQ

Enc

oder

F0

Noise ratio

F0 targets

Noise ratio targets

Gain

Dec

oder

TDSynthesis

LSFSynthesis

Gain

LSF

F0

Noise ratio

STRAIGHTSynthesis

Spectralenvelope

Synthesizedspeech

Fig. 12. Proposed speech encoder and decoder block diagrams (top: encoder, bottom: decoder).


diagrams are shown in Fig. 12, and a detailed descriptionof the proposed speech coding method is shown in the sub-sections followed. Experimental results show that thisspeech coding method can produce good-quality speechat an average rate below 2 kbps.

4.1. Derivation of LSF parameters

The amplitude spectrum X[m], where 0 6 m 6 M2

with M

is the number of samples in the frequency domain,obtained from STRAIGHT analysis is transformed to thepower spectrum using Eq. (8)

S½m� ¼ jX ½m�j2; 0 6 m 6M2

ð8Þ

The ith autocorrelation coefficient, R[i], is then calculatedusing the inverse Fourier transform of the power spectrumas follows:

R½i� ¼ 1

M

XM�1

m¼0

S½m� exp j2pmi

M

� �; 0 6 i 6 M � 1 ð9Þ

where S[m] = S[M � m]. Assume that the speech samplescan be estimated by a Pth order all-pole model, where0 < P < M, the reconstruction error is calculated as givenin Eq. (10)

P L ¼ R½0� �XP

l¼1

aPl R½l� ð10Þ

where faPl g; l ¼ 1; 2 . . . ; P , are the corresponding linear

predictive coding (LPC) coefficients. PL hereafter is re-ferred to as gain. By minimizing PL with respect to aP

l ,where l = 1,2, . . . ,P, aP

l s could be evaluated. They are thentransformed to the LSF parameters.

4.2. LEBEL-TD based vector quantization of

LSF parameters

4.2.1. Vector quantization of event vectorsSince the event vectors obtained from LEBEL-TD

method are valid LSF parameter vectors, they can be quan-

tized by usual quantization methods for LSF parameters.Here, the split vector quantization introduced in (Paliwaland Atal, 1993) was adopted. In this work, the order ofLSFs was empirically selected as 32 to increase the qualityof reconstructed speech. Every event vector was dividedinto four subvectors of dimensions 7, 8, 8, 9 due to the dis-tribution of LSFs, and each subvector was quantized inde-pendently. We assigned 8 bits to each subvector, whichresulted in 32 bits allocated to one event vector.

4.2.2. Vector quantization of event functions

In the case of event functions, normalization of theevent functions is necessary to fix the dimension of theevent function vector space. Notice that only quantizing/k(n) in the interval [nk;nk+1] is enough to reconstruct thewhole event function /k(n). Moreover, /k(n) always startsfrom one and goes down to zero in that interval, and thetype of decrease (after normalizing the length of /k(n))can be vector quantized. Therefore, an event function/k(n) can be quantized by its length L(k) = nk+1 � nk andshape in [nk + 1;nk+1 � 1]. In this work, 15 equidistantsamples were taken from each event function for length-normalization and then vector quantized by a 7-bitcodebook.

Considering that all intervals between two consecutiveevent locations are less than 256 frames long (note thatthe frame period used in STRAIGHT analysis is 1 mslong), we used 8 bits for quantizing the length of each eventfunction. Shortly speaking, each /k(n) was quantized by itslength and the type of decrease.

4.3. Coding speech excitation parameters

4.3.1. Coding noise ratio parameters

The speech production mechanism is assumed to be asynchronously controlled process with respect to the move-ment of different articulators, i.e. jaws, tongue, larynx,glottis, etc., and the temporal evolutionary patterns of dif-ferent properties of speech, e.g. spectrum, F0, and noiseratio, can be thus described by a common set of event func-


tions (Nandasena et al., 2001). Therefore, the same eventfunctions obtained from LEBEL-TD analysis of LSFparameters are also used to describe the temporal evolutionof the noise ratio parameters. Let i(n) be a noise ratioparameter. We have 0 6 i(n) 6 1, where i(n) = 1 for whitenoise and i(n) = 0 for pure pulse. Then i(n) is approximatedby iðnÞ, the reconstructed noise ratio parameter for the nthframe, as follows in terms of noise ratio targets, iks, and theevent functions, /k(n)s. Since the event functions are quan-tized and transmitted, this description also helps reduce thebit rate required for encoding noise ratio information

iðnÞ ¼XK

k¼1

ik/kðnÞ; 1 6 n 6 N

The noise ratio targets are determined by minimizing thesum squared error, Ei, between the original and the inter-polated noise ratio parameters with respect to iks

Ei ¼XN

n¼1

ðiðnÞ � iðnÞÞ2 ¼XN

n¼1

iðnÞ �XK

k¼1

ik/kðnÞ !2

where i(n) is the original noise ratio parameter for the nthframe. Finally, the noise ratio targets are quantized byusing scalar quantization. In this work, we used 6 bits forquantizing each noise ratio target.

Fig. 13 shows the plots of original and reconstructednoise ratio parameters and the plot of frame-wise noise

0 1000 2000−2

0

2x 10

4

0 1000 20000

0.5

1

Noi

se R

atio

0 1000 20000

0.5

1

Rec

. Noi

se R

atio

0 1000 2000–1

0

1

Noi

se R

atio

Err

or

Time

Fig. 13. Original noise ratio parameters, i(n), reconstructed noise ratio parasentence utterance ‘kaigi ni happyou surunodeha nakute choukou surudake daThe RMS noise ratio error is 0.1166. The speech waveform is also shown tog

ratio error, ei(n), where eiðnÞ ¼ iðnÞ � iðnÞ, for a male/Jap-anese sentence utterance. The root mean squared (RMS)noise ratio error,

ffiffiffiffiffiEip

, where Ei ¼ 1N

PNn¼1e2

i ðnÞ, was foundto be about 0.1166.

4.3.2. Coding F0 parameters

For encoding F0 information, the lengths of voiced andunvoiced segments were quantized by scalar quantizationfirst, with an average bit rate of 36 bps. Next, linear inter-polation was used within the unvoiced segments to form acontinuous F0 contour. Similar to the method presented inSection 4.3.1, the continuous F0 contour was thendescribed by the event functions obtained from LEBEL-TD analysis of LSF parameters and the so-called F0 tar-gets. As mentioned earlier, this description also helpsreduce the bit rate required for encoding F0 informationsince we can make use of the encoded event functions.The F0 targets were then quantized by a 6-bit logarithmicquantizer. In the decoder, F0 values were reconstructedfrom the quantized event functions and F0 targets usingthe TD synthesis. Meanwhile, F0 values of unvoiced inter-vals were set to zero.

Fig. 14 shows the plots of original and reconstructed F0parameters and the plot of frame-wise F0 error, ep(n), whereepðnÞ ¼ pðnÞ � pðnÞ with p(n) and pðnÞ are the original andreconstructed F0 parameters, respectively, for the same

3000 4000 5000

3000 4000 5000

3000 4000 5000

3000 4000 5000 [ms]

meters, iðnÞ, and frame-wise noise ratio error, eiðnÞ ¼ iðnÞ � iðnÞ, for theto, hiyou ha ikura kakari masu ka,’ of the ATR Japanese speech database.ether for reference.

0 1000 2000 3000 4000 5000–2

0

2x 104

0 1000 2000 3000 4000 50000

200

400

F0 [

Hz]

0 1000 2000 3000 4000 50000

200

400

Rec

. F0

[Hz]

0 1000 2000 3000 4000 5000–40

–20

0

20

40

F0 E

rror

[H

z]

Time [ms]

Fig. 14. Original F0 parameters, p(n), reconstructed F0 parameters, pðnÞ, and frame-wise F0 error, epðnÞ ¼ pðnÞ � pðnÞ, for the sentence utterance ‘kaigi nihappyou surunodeha nakute choukou surudake dato, hiyou ha ikura kakari masu ka,’ of the ATR Japanese speech database. F0 error is shown only forthe voiced segments of the utterance. The RMS F0 error is 3.6183 Hz. The speech waveform is also shown together for reference.

Table 3Bit allocation for the proposed speech coder

Parameter Proposed speech coder

Event vector 32 bits (8 + 8 + 8 + 8)Event function 7 bitsEvent location 8 bitsF0 target 6 bitsNoise ratio target 6 bitsSubtotal A (sum · event rate) 1475 bpsGain 300 bpsLengths of voiced and unvoiced segments 36 bpsMaximum amplitude of input speech 5 bpsSubtotal B 341 bpsTotal (A + B) 1816 bps


sentence utterance as in Fig. 10. The RMS F0 error,ffiffiffiffiffiEp

p,

where Ep ¼ 1N

PNn¼1e2

pðnÞ, was found to be about 3.6183 Hz.

4.3.3. Coding gain parameters

The gain contour was re-sampled at 20 ms intervals. Log-arithmic quantization was performed using 6 bits for eachsampled value. The quantized samples and the spline inter-polation were used in the decoder to form the reconstructedgain contour. It should be noted that we did not describe thegain information of speech using the event functionsobtained from LEBEL-TD analysis of LSF parameters asin Sections 4.3.1 and 4.3.2. This is due to the fact that thegain parameters are quickly and frequently changed, whichresults in a low reconstruction accuracy if the same methoddescribed in the two previous subsections is applied.

4.4. Bit allocation

Table 3 shows the bit allocation for the proposed speechcoding method. An example of bit-rate contour for a male/Japanese utterance is shown in Fig. 15. Note that the aver-age number of events per second, i.e. the event rate, was setas 25 events/s, resulting in the average algorithmic delay of55 ms. The larger the event rate, the better the speech qual-ity, however at the cost of increasing the bit rate requiredfor encoding speech. We can control the peak bit rate by,for example, setting the minimum length between the loca-tions of two consecutive events.

4.5. Subjective test

In order to evaluate the performance of the proposedspeech coding method, the quality of the reconstructedspeech was compared to that of other low-bit rate speechcoders such as the 4.8 kbps FS-1016 CELP (Campbellet al., 1991) and 2.4 kbps FS-1015 LPC-10E coders (Camp-bell and Tremain, 1986). By this we show that the proposedspeech coding method can achieve good-quality speechwith less than 2 kbps.

A subjective test was carried out using the Scheffe’smethod of paired comparison (Scheffe, 1952). Six graduatestudents known to have normal hearing ability wererecruited for the listening experiment. Each listener was

0 1000 2000 3000 4000 5000 60001000

1500

2000

2500

3000

3500

4000

Time [ms]

Bit

Rat

e [b

ps]

Fig. 15. Bit-rate contour for a male/Japanese sentence utterance ‘konkaino kokusaikaigi ha tuuyaku denwa ni kansuru naiyou wo subete fukundeimasu.’

0 1-1

distortion is smalldistortion is large

STRAIGHT-LSF

STRAIGHT-LSF& LEBEL-TD

CELP4.8 kbps

LPC-10E2.4 kbps

Proposed speech coder1.8 kbps

Fig. 16. Results of the listening experiment.


asked to grade from �2 to 2 the degradation perceived inspeech quality when comparing the second stimulus tothe first, in each pair. The Japanese speech dataset usedin Section 3 and an English speech dataset collected fromthe TIMIT speech corpus (Garofolo et al., 1993), whichconsists of 192 sentence utterances spoken by 24 speakers(18 male and 6 female), were selected as the training datafor the proposed speech coder. They were re-sampled at8 kHz sampling frequency and STRAIGHT analyzed usingthe frame shift of 1 ms. LSF transformation was then per-formed and the resulting 32nd order LSF parameters wereTD analyzed by using the LEBEL-TD method. It shouldbe noted that the higher order of LSFs, the better qualityof encoded speech. However, it was empirically perceivedthat a considerable improvement of speech quality is notachieved when the order of LSFs exceeds 32.

Eight phoneme balanced sentences, which are out of thetraining set, uttered by four English (two male and twofemale) and four Japanese (two male and two female)speakers were used as the test data. Namely, the test datacomprises four male and four female utterances. Stimuliwere synthesized by using the following coders: 4.8 kbpsFS-1016 CELP, 2.4 kbps FS-1015 LPC-10E, and the pro-posed 1.8 kbps speech coder. Also, 16 other stimuli wereSTRAIGHT synthesized using the unquantized speechparameters obtained from STRAIGHT analysis and LSFtransformation (STRAIGHT-LSF) as well as STRAIGHTanalysis, LSF transformation and LEBEL-TD analysis(STRAIGHT-LSF and LEBEL-TD) of the above 8utterances.

Results of the listening experiment are shown in Fig. 16.It can be seen from this figure that the quality of the recon-structed speech obtained from the proposed speech coder iscomparable to that of the 4.8 kbps FS-1016 CELP coderand is much better than that of the 2.4 kbps FS-1015LPC-10E coder. This justifies the usefulness of the pro-

posed LEBEL-TD algorithm when being applied to codingspeech at low-bit rates.

5. Conclusion

In this paper we have presented a new algorithm fortemporal decomposition of speech. The proposedLEBEL-TD method uses the limited error criterion for ini-tially estimating the event locations, and then furtherrefines them using the local optimization strategy. Thismethod achieves results comparable to other TD methodssuch as S2BEL-TD and MRTD while requiring less algo-rithmic delay and less computational cost. Moreover, thebuffering technique used for continuous speech analysishas been well developed and the stability of the corre-sponding LPC synthesis filter after spectral transformationperformed by LEBEL-TD has been completely ensured. Itis shown that the temporal pattern of the speech excitationparameters can also be well described using the LEBEL-TD technique.

We have also described a method for variable-ratespeech coding based on STRAIGHT using LEBEL-TD.For encoding spectral information of speech, LEBEL-TDbased vector quantization was used. Other speech parame-ters were quantized by scalar quantization. As a result, avariable-rate speech coder operating at rates around1.8 kbps was produced. The quality of the reconstructedspeech is comparable to that of the 4.8 kbps FS-1016 CELPcoder according to the listening experiment. It was shownthat the proposed speech coding method can producegood-quality speech with less than 2 kbps.

Acknowledgements

This work was partially supported by CREST (CoreResearch for Evolutional Science and Technology) ofJST (Japan Science and Technology Corporation). Theauthors thank anonymous reviewers for their suggestionsand comments that improved the presentation of the paper.

References

Atal, B.S., 1983. Efficient coding of LPC parameters by temporaldecomposition. In: Proc. ICASSP, pp. 81–84.


Athaudage, C.N., Brabley, A.B., Lech, M., 1999. Optimization of atemporal decomposition model of speech. In: Proc. Internat. Sympo-sium on Signal Processing and its Applications, Australia, pp. 471–474.

Campbell Jr., J.P., Tremain, T.E., 1986. Voiced/unvoiced classification ofspeech with applications to the US government LPC-10E algorithm.In: Proc. ICASSP, pp. 473–476.

Campbell Jr., J.P., Tremain, T.E., Welch, V.C., 1991. The FederalStandard 1016 4800 bps CELP voice coder. Digital Signal Process. 1(3), 145–155.

Dix, P.J., Bloothooft, G., 1994. A breakpoint analysis procedure based ontemporal decomposition. IEEE Trans. Speech Audio Process. 2 (1), 9–17.

Fallside, F., Woods, W.A. (Eds.), 1985. Computer Speech Processing.Prentice-Hall, New York.

Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.,Dahlgren, N.L., 1993. DARPA TIMIT acoustic phonetic continuousspeech corpus CDROM, NIST.

Itakura, F., 1975. Line spectrum representation of linear predictivecoefficients of speech signals. J Acoust. Soc. Amer. 57, S35.

Kawahara, H., Masuda-Katsuse, I., de Cheveigne, A., 1999. Restructuringspeech representations using a pitch-adaptive time–frequency smooth-ing and an instantaneous-frequency-based F0 extraction: possible roleof a repetitive structure in sounds. Speech Comm. 27 (3–4), 187–207.

Kim, S.J., Oh, Y.H., 1999. Efficient quantization method for LSFparameters based on restricted temporal decomposition. Electron.Lett. 35 (12), 962–964.

Nandasena, A.C.R., Nguyen, P.C., Akagi, M., 2001. Spectral stabilitybased event localizing temporal decomposition. Comput. SpeechLanguage 15 (4), 381–401.

Nguyen, P.C., Akagi, M., 2002a. Improvement of the restricted temporaldecomposition method for line spectral frequency parameters. In:Proc. ICASSP, pp. 265–268.

Nguyen, P.C., Akagi, M., 2002b. Limited error based event localizingtemporal decomposition. In: Proc. EUSIPCO, pp. 239–242.

Niranjan, M., Fallside, F., 1989. Temporal decomposition: a frameworkfor enhanced speech recognition. In: Proc. ICASSP, pp. 655–658.

Paliwal, K.K., 1995. Interpolation properties of linear prediction para-metric representations. In: Proc. Eurospeech, pp. 1029–1032.

Paliwal, K.K., Atal, B.S., 1993. Efficient vector quantization of LPCparameters at 24 bits/frame. IEEE Trans. Speech Audio Process. 1 (1),3–14.

Scheffe, H., 1952. An analysis of variance for paired comparisons. J.Amer. Statist. Assoc. 47, 381–400.

Shiraki, Y., Honda, M., 1991. Extraction of temporal pattern of spectralsequence based on minimum distortion criterion. In: Proc. AutumnMeeting of the Acoustical Society of Japan, pp. 233–234 (in Japanese).

Van Dijk-Kappers, A.M.L., Marcus, S.M., 1989. Temporal decomposi-tion of speech. Speech Comm. 8, 125–135.

Limited error based event localizing temporal decomposition and its application to variable-rate...

Documents

Transcript of Limited error based event localizing temporal decomposition and its application to variable-rate...