Inter-frame dependence arising from preceding and succeeding frames – Application to speech...

12
Inter-frame dependence arising from preceding and succeeding frames – Application to speech recognition P. Hanna * , J. Ming 1 , F.J. Smith 2 School of Electrical Engineering and Computer Science, The Queen’s University of Belfast, Belfast, BT7 1NN, Northern Ireland, Ireland Received 6 May 1998; received in revised form 15 February 1999; accepted 19 April 1999 Abstract This paper extends the notion of capturing temporal information within an HMM framework by permitting the observed frame not only to be dependent upon preceding frames, but also upon succeeding frames. In particular the IFD–HMM (Ming and Smith, 1996) is extended to support any number of preceding and/or succeeding frame de- pendencies. The means through which such a dependency might be integrated into an HMM framework are explored, and details given of the resultant changes to the IFD–HMM. Experimental results are provided, contrasting the use of bi-directional frame dependencies to the use of preceding only frame dependencies and exploring how such dependencies can be best employed. It was found that a dependency upon succeeding frames enabled dynamic spectral information not found in the preceding frames to be usefully employed, resulting in a significant increase in the recognition accuracy. It was also found that the use of frame dependencies proved to be a more eective means of increasing recognition accuracy than the use of multiple mixtures. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Speech recognition; HMM; Interframe dependence 1. Introduction It has long been recognised that the indepen- dent and identically distributed (IID) assumption inherent to HMMs limits the HMM in its ability to form an accurate model of a complex, non- stationary time series such as human speech. Stated otherwise, the IID assumption implies that within a state there is no correlation between successive speech segments and that the statistical characteristics of speech are stationary over time. However this assumption is inappropriate as the temporal characteristics of speech (represented as a sequence of frame feature vectors within an HMM framework) change slowly over time. Ad- ditionally, although for certain utterances (e.g. steady-state vowels) the stationary assumption is largely justifiable, for most cases (e.g. glides, diphthongs, vowel-constant/constant-vowel tran- sitions, etc.) this assumption is invalid. A number of broad approaches have so far been employed towards lessening the IID assumption by permitting the temporal correlation between frame feature vectors to be captured and utilised within an HMM framework. One such approach involves redefining the HMM’s definition, so that it explicitly imposes a dependency between feature vectors. Examples of this approach include the Speech Communication 28 (1999) 301–312 www.elsevier.nl/locate/specom * Corresponding author. Tel.: +44 1 232 244 733; e-mail: [email protected] 1 E-mail: [email protected] 2 E-mail: [email protected] 0167-6393/99/$ – see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 1 9 - 9

Transcript of Inter-frame dependence arising from preceding and succeeding frames – Application to speech...

Inter-frame dependence arising from preceding and succeedingframes ± Application to speech recognition

P. Hanna *, J. Ming 1, F.J. Smith 2

School of Electrical Engineering and Computer Science, The Queen's University of Belfast, Belfast, BT7 1NN, Northern Ireland, Ireland

Received 6 May 1998; received in revised form 15 February 1999; accepted 19 April 1999

Abstract

This paper extends the notion of capturing temporal information within an HMM framework by permitting the

observed frame not only to be dependent upon preceding frames, but also upon succeeding frames. In particular the

IFD±HMM (Ming and Smith, 1996) is extended to support any number of preceding and/or succeeding frame de-

pendencies. The means through which such a dependency might be integrated into an HMM framework are explored,

and details given of the resultant changes to the IFD±HMM. Experimental results are provided, contrasting the use of

bi-directional frame dependencies to the use of preceding only frame dependencies and exploring how such dependencies

can be best employed. It was found that a dependency upon succeeding frames enabled dynamic spectral information

not found in the preceding frames to be usefully employed, resulting in a signi®cant increase in the recognition accuracy.

It was also found that the use of frame dependencies proved to be a more e�ective means of increasing recognition

accuracy than the use of multiple mixtures. Ó 1999 Elsevier Science B.V. All rights reserved.

Keywords: Speech recognition; HMM; Interframe dependence

1. Introduction

It has long been recognised that the indepen-dent and identically distributed (IID) assumptioninherent to HMMs limits the HMM in its abilityto form an accurate model of a complex, non-stationary time series such as human speech.Stated otherwise, the IID assumption implies thatwithin a state there is no correlation betweensuccessive speech segments and that the statisticalcharacteristics of speech are stationary over time.

However this assumption is inappropriate as thetemporal characteristics of speech (represented asa sequence of frame feature vectors within anHMM framework) change slowly over time. Ad-ditionally, although for certain utterances (e.g.steady-state vowels) the stationary assumption islargely justi®able, for most cases (e.g. glides,diphthongs, vowel-constant/constant-vowel tran-sitions, etc.) this assumption is invalid.

A number of broad approaches have so far beenemployed towards lessening the IID assumptionby permitting the temporal correlation betweenframe feature vectors to be captured and utilisedwithin an HMM framework. One such approachinvolves rede®ning the HMM's de®nition, so thatit explicitly imposes a dependency between featurevectors. Examples of this approach include the

Speech Communication 28 (1999) 301±312www.elsevier.nl/locate/specom

* Corresponding author. Tel.: +44 1 232 244 733; e-mail:

[email protected] E-mail: [email protected] E-mail: [email protected]

0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 1 9 - 9

bi-gram constrained HMM (Paliwal, 1993; Well-ekens, 1987; Takahashi et al., 1993), the linear-predictive HMM (Kenny et al., 1990; Woodland,1992) and more recently the Interframe depen-dence (IFD)±HMM (Ming and Smith, 1996).Generally speaking, all of these approaches as-sume the observed frame is dependent upon one ormore previous frames (through the use of a con-ditional observation density).

We extend the principle adopted by these ap-proaches by modifying the IFD±HMM so that adependence of the observed frame upon bothsucceeding and preceding frames is supported.Given the non-stationary nature of speech it isreasonable to assume that, for a particular frame,the succeeding frames contain useful dynamic in-formation that may not be encapsulated in thepreceding frames. Hence, the inclusion of suc-ceeding frame dependencies provides a mechanismthrough which additional temporal correlationmay be utilised.

As an attempt to improve the levels of recog-nition accuracy returned by the IFD±HMM, onemight introduce multiple mixtures, in addition tothe use of frame dependencies. Doing so will in-crease the model parameter size, with an associ-ated increase in the parameter estimation/decodingtimes and in the minimum amount of training dataneeded to obtain robust parameter estimation. Inthis paper, we compare the relative advantages ofthe use of multiple mixtures and frame depen-dencies in terms of recognition accuracy versusmodel parameter size. However, ®rstly we developthe theory.

2. Mathematical foundations

Initially the IFD±HMM was constructed topermit the observed frame to be dependent upon asingle preceding frame (Smith et al., 1995); thiswas later extended to include any number of pre-ceding frame dependencies by using a weightedmixture of ®rst-order conditional Gaussian densi-ties (Ming and Smith, 1996). A preliminary studyof a further extension of this model by accom-modating both preceding and succeeding framedependencies, otherwise termed the bi-directional

frame dependencies, has been presented elsewhereby the authors (Hanna et al., 1997).

2.1. The preceding frame IFD±HMM's de®nition

In the IFD±HMM we de®ne the probabilitydensity function (pdf) of the observation sequence,given a state sequence and set of frame depen-dencies as follows:

p�xjs; s� �YT

t�1

bst�xtjxtÿst�1� . . . xtÿst�N��; �1�

where bi(á) is the conditional observation densityfor state i and s represents the state sequencefs1 . . . sTg. Apart from the usual state-dependence,the above formulation assumes that the probabil-ity of each observed frame, xt, is conditional uponN previous frames, xtÿst�1� . . . xtÿst�N�, de®ned by asequence of time-lags, s � fst�1� . . . st�N�gt�1...T ,with st�n� > 0 and st�nÿ 1� < st�n�. A ®xed time-lag model can be obtained by assuming that thetime-lags remain unchanged over time, i.e.s � fs�1� . . . s�N�g.

The observation density function bs�xjx1 . . . xN�is approximated as a weighted mixture of a set of®rst-order conditional densities (Ming and Smith,1996), as introduced in Eq. (2):

bs�xjx1 . . . xN � �XN

n�1

wsnfsn�xjxn�; �2�

where fsn�xjxn� represents the nth mixture compo-nent density in state s, and models the correlationbetween x and the nth conditional frame, xn, andwsn the corresponding weight, satisfying the con-straints wsn P 0 and

PNn�1 wsn � 1. This model

contains the bi-gram constrained HMM (Paliwal,1993; Wellekens, 1987) as a special case by as-suming N equals one. The density function fsn(á) isde®ned as a multivariate conditional Gaussiandensity, entailing that

f �xjxn� / expfÿ1=2h�xjxn�g; �3�

for which h(x, xn) is given by

h�xjxn� � �xÿ Hnxn ÿ ln��Uÿ1n �xÿ Hnxn ÿ ln�

� log jUnj; �4�

302 P. Hanna et al. / Speech Communication 28 (1999) 301±312

where ln is a K ´ 1 vector and both Hn and Un areK ´ K matrices for which K is the dimensionalityof the frame feature vectors. Furthermore, an as-terisk denotes vector transpose, and |Un| repre-sents the determinant of Un. In comparison to thenon-conditional Gaussian density, the parametersize of the above density is increased by a quantitycorresponding to the size of Hn, i.e. K if Hn has adiagonal form, and K ´ K if a full matrix form isused.

Using the de®nitions shown above, the likeli-hood function of an observation sequencex � �x1; . . . ; xT �, given a time-lag sequence s, canbe expressed as follows:

p�xjs; k� �X

s

pk�xjs; s�pk�sjk�

�X

s

ps0

YT

t�1

astÿ1st

XN

n�1

wstnfstn�xtjxtÿst�n��" #

;

�5�

where k denotes the parameter set, i.e. ([pi], [aij],[fin], [win]), where [aij] is the state transition matrixand [pi] the initial state probability vector. Themodel shown in Eq. (5) is termed the precedingframe IFD±HMM.

2.2. Introducing the bi-directional IFD±HMM

The aim is to extend the previous formulation ofthe IFD±HMM, so that a dependency upon bothsucceeding and preceding frames is supported.Consider the time series x1; . . . ; xn, whose proba-bility of occurrence can be determined by comput-ing either pÿ�x1�pÿ�x2jx1� . . . pÿ�xnjxnÿ1; . . . ; x1� oralternatively p��x1jx2; . . . ; xn� . . . p��xnjxnÿ1�p��xn�,where pÿ�xij . . .� and p��xij . . .� denote the type ofdependency that is employed. Should the time se-ries be non-stationary, as is the case with speech,then the probability pÿ�xij . . .� is likely to di�erfrom p��xij . . .�. Both a geometric and an arith-metic combination of pÿ�xij . . .� and p��xij . . .�o�er a means through which the informationcontained within the two distributions may bejointly employed. In particular, a product-basedcombination, which is equivalent to a squaredgeometric combination, was adopted as it permits

e�cient training and recognition algorithms to bereadily developed, as will be shown.

The approach involves a product-based com-bination of the two observation densities; onedensity modelling dependency upon succeedingframes, and the other density modelling depen-dency upon preceding frames, i.e.

pÿ�xjs; sÿ� � p��xjs; s��

�YT

t�1

bÿst�xtjxtÿsÿt �1� . . . xtÿsÿt �N��

� b�st�xtjxt�s�t �1� . . . xt�s�t �M��; �6�

where s� is a time-lag sequence de®ning M suc-ceeding frame dependencies and sÿ a time-lag se-quence of N preceding frame dependencies aspreviously de®ned. In Eq. (6), bÿs ��� and b�s ��� as-sume the form de®ned in Eq. (2), and by de®nitions�;ÿt �k� > 0; s�;ÿt �k ÿ 1� < s�;ÿt �k�. The likelihoodfunction of an observation sequence x, given time-lag sequences s� and sÿ, is expressed as follows:

p�xjsÿ; s�; k� �X

s

ps0

YT

t�1

astÿ1st

XN

n�1

wÿstnf ÿstn�xtjxtÿsÿt �n��

�XM

m�1

w�stmf �stm�xtjxt�s�t �m��; �7�

where k denotes the model parameter set, con-sisting of ffpig; faijg; fwÿikg; fw�ikg; ff ÿik g; ff �ik gg. Ine�ect, the model shown in Eq. (7) utilises a de-pendency upon both preceding and succeedingframes. We term this model the bi-directional IFD±HMM, with N preceding frame and M succeedingframe dependencies.

2.3. Bi-directional IFD±HMM parameter estima-tion

Ming and Smith (1996) show that the precedingframe IFD±HMM can be readily estimatedthrough the application of the traditional Baum±Welsh re-estimation procedure. The estimationprocedure is similar to that of the multiple mixtureGaussian HMM (Juang, 1985). The prime dif-ference being that in the multiple-mixture Gauss-ian approach a mixture is formed over theinstantaneous observation space, whereas for the

P. Hanna et al. / Speech Communication 28 (1999) 301±312 303

IFD±HMM a mixture is formed over the obser-vation history.

Likewise, a Baum±Welsh estimation procedurecan be applied to the bi-directional IFD±HMM. Aswill be shown, the estimation procedure for thismodel mirrors that of the estimation of the jointdensities of normal and delta spectral parameters,with pÿ�xjs; sÿ� corresponding to the density basedon the normal spectral components and p��xjs; s��corresponding to the density based on the deltaspectral components. In order to simplify the fol-lowing derivation and without loss of generality weassume sÿt �n� � n and s�t �m� � m. The formulasshown below apply to any sÿ;�t ��� by replacing theappropriate xtÿn with xtÿsÿt �n� and xt�m with xt�s�t �m�.We start by expanding the form shown in (7):

p�xjk; sÿ; s��

�X

s

XN

n1�1

. . .XN

nT�1

XM

m1�1

. . .XM

mT�1

ps0

YT

t�1

astÿ1st wÿstnt

� f ÿstnt�xtjxtÿnt �w�stmt

f �stmt�xtjxt�mt�

�X

s

XC

XK

p�x; s;C;Kjs�; sÿ; k�; �8�

where

p�x; s;C;Kjs�; sÿ; k�

� ps0

YT

t�1

astÿ1st wÿstnt

f ÿstnt�xtjxtÿnt�w�stmt

f �stmt�xtjxt�mt�

�9�and C and K represent the T-tuples �n1 . . . nT � and�m1 . . . mT �, respectively, and whose summationsare over all possible �n1 . . . nT � and �m1 . . . mT �.Following the usual practice, a maximum-likeli-hood estimation of k, based on the form shown inEq. (8) may be achieved through an iterativemaximisation of Baum's auxiliary function (Baum,1972). In this case, this function can be expressedas follows:

Q�k; �k� �X

s

XC

XK

p�x; s;C;Kjsÿ; s�; k�

� log p�x; s;C;Kjsÿ; s�; �k�; �10�where �k is the new estimate of the parametersobtained from a previous estimate k. Theform shown in Eq. (10) can be straightforwardly

maximised following the standard procedure(Juang, 1985), resulting in the re-estimation for-mula. For example, the new estimates for wÿin, lÿin,Hÿin and Uÿin , can be derived, respectively as

�wÿin �PT

t�1 fÿin�t�PNn�1

PTt�1 fÿin�t�

; �11�

�lÿin �PT

t�1 fÿin�t��xt ÿ Hinxtÿn�PTt�1 fÿin�t�

; �12�

�Hÿin �XT

t�1

fÿin�t��xt ÿ lÿin�x�tÿn

�XT

t�1

fÿin�t�xtÿnx�tÿn

" #ÿ1

�13�

and

�Uÿin �XT

t�1

fÿin�xt ÿ Hinxtÿn ÿ lin�

� �xt ÿ Hinxtÿn ÿ lin��XT

t�1

fÿin�t�,

; �14�

where

fÿin�t� �X

j

atÿ1�j�ajiwÿinf ÿin �xtjxtÿn�

� b�i �xtjxt�1 . . . xt�M�bt�i�; �15�

for which at(i) and bt(j) are the forward andbackward probabilities, computed recursively asfollows:

at�j� �X

i

atÿ1�i�aijbÿj �xtjxtÿ1; . . . ; xtÿN �

� b�j �xtjxt�1 . . . xt�M�; �16�

btÿ1�i� �X

j

bt�j�aijbÿj �xtjxtÿ1; . . . ; xtÿN �

� b�j �xtjxt�1 . . . xt�M�: �17�

In the above equations bÿi �xtjxtÿ1 . . . xtÿN� andb�i �xtjxt�1 . . . xt�M� assume the form de®ned byEq. (2). Similar formulas can be derived for theparameters based on succeeding frame dependen-cies.

304 P. Hanna et al. / Speech Communication 28 (1999) 301±312

2.4. Time lag optimisation

As noted, the time-lag sequences sÿt �fsÿt �1� . . . sÿt �N�gt� 1 ... T and s�t � fs�t �1� . . .s�t �M�gt� 1 ... T are constructed to permit the actualframe dependencies to be non-linear and timevarying, thereby permitting a means to adapt tothe changing temporal characteristics (for examplevarying pronunciation speeds). A modi®ed Viterbialgorithm can be used to determine the best time-lag sequences in terms of increased likelihood, fora given set of IFD±HMM parameters, k.

The optimisation process may be morestraightforwardly performed by the joint maximi-sation of p�x; sjsÿ; s�; k� with respect to both s and�sÿ; s��. This joint optimisation can be achievedthrough the use of an induction formula, i.e.:

dt�j� � maxi�dtÿ1�i� � log aij�

�maxsÿt

logXN

n�1

wÿjnf ÿjn �xtjxtÿsÿt �n��

�maxs�t

logXM

m�1

w�jmf �jm�xtjxt�s�t �m��; �18�

where dt�j� is the log-likelihood of the best stateand time lag sequences ending in state j for theobservations fxlgl�1;t and do�i� � logpi. The opti-mal sÿt and s�t for each t may be determinedthrough the use of a dynamic programming (DP)algorithm for which each of the mixture expres-sions is used as a separate objective function. Morespeci®cally, for the preceding frame-based mix-ture, we de®ne bÿl �n� as the maximised likelihoodof the ®rst n conditional frames (N P n P 1) end-ing in some frame xtÿl within a given observationhistory L. Hence, for l � 1; . . . ; L we have

bÿl �n� � max�nÿ16 k6 lÿ1�

max�k�16 s6 l�

fbÿk �nÿ 1�

� wÿjnf ÿjn �xtjxtÿs�g:�19�

The optimal time-lags for a particular xt are foundfrom the ss which maximise bÿl �n� for each n and l.The same principle applies to the succeedingframe-based mixture. During the training process,the time-lag optimisation is implemented throughthe use of the EM principle; iterating succes-

sive Expectation (E-steps) and Maximisation(M-steps). In the E-step, the parameter set k isestimated by using a given time-lag sequence foreach training signal. In the M-step, an optimisedtime-lag sequence is obtained for each trainingsignal based on the given model. These two stepsare iterated to achieve convergent estimates forboth the parameter set and the time-lag sequences.During recognition the M-step process is againperformed in order to determine the optimal time-lag sequence for each test observation. As will beshown, time-lag optimisation can be additionallyapplied to the IFD±HMM as a means of increas-ing the recognition accuracy at the cost of in-creased computational complexity, both duringthe training and recognition phases.

3. Results

Throughout this section, the results collectedfor the bi-directional IFD±HMM are typicallycompared against either the preceding only IFD±HMM outlined in Ming and Smith (1996) orstandard HMM models. We explore the extent towhich frame dependencies may be used to improvethe recognition accuracy, and contrast this withthe use of multiple static mixtures. Additionally,during recognition we can introduce an exponen-tial weight to balance the combination of thepreceding and succeeding time-lag dependentdensities, thereby exploring the e�ects of varyingthe emphasis given to the preceding or succeedingframe dependencies. Speci®cally we can calculate

pw�xjs; sÿ; s�� � pÿ�xjs; sÿ�wp��xjs; s��1ÿw �20�during recognition, where w (06w6 1) is theweighting term.

3.1. Experimental conditions

The same experiment conditions as employed byWoodland (1992), Harte et al. (1996) and Ming andSmith (1996), were adopted, i.e. all experiments arebased on the Connex speaker-independent alpha-betic database (provided by British Telecom Re-search Laboratories). The database contains threerepetitions of each letter by a total of 104 speakers(of which 53 are male, and the remaining

P. Hanna et al. / Speech Communication 28 (1999) 301±312 305

51 female). The database is roughly balanced withrespect to both age and sex. Experiments are basedon the highly confusable E-set (consisting of the b,c, d, e, g, p, t and v letters) for which 52 speakerswere designated for training and 52 for testing,resulting in a total of approximately 1250 uniqueutterances being available for training and the re-maining utterances being used for testing.

The speech, which was sampled at 20 kHz, wasdivided into 25.6 ms frames with a consecutiveframe overlap of 15.6 ms. Each frame was passedthrough a mel-frequency ®lter-bank, from which12 mel-frequency cepstral coe�cients (MFCCs)were extracted. First-order di�erential coe�cents(delta-MFCC) was also employed in the standardHMM, but not in the IFD±HMM. Either a ®ve or®fteen state left-to-right topology were usedthroughout testing. In all cases, diagonal covari-ance matrices were employed. In addition, unlessnoted otherwise, static mixtures were not em-ployed in the IFD±HMM.

3.2. Terminology

Two types of time-lag sequences, i.e. base time-lag sequences and ®xed length time-lag sequences,are introduced in order to permit experimentalcomparisons between the di�erent types of IFD±HMMs to be more readily made. A base time-lagsequence is a collection of frame o�sets, fromwhich a corresponding preceding only, succeedingonly and bi-directional time-lag representation canbe formed. For example, the [3 4] base time-lagsequence can be expressed as a [ÿ4 ÿ3] precedingonly, [+3 +4] succeeding only or [ÿ4 ÿ3 +3 +4] bi-directional time-lag sequence, thereby permittingsimilar time-lag sequences to be compared witheach other.

A Fixed length time-lag sequence is de®ned tobe a general (unspeci®ed) time-lag sequence of acertain, speci®ed, number of frames. For example,[+2 +4 +6], [ÿ3 ÿ2 ÿ1] and [ÿ2 +1 +2] are allexamples of a ®xed length time-lag sequence oflength three. An optimal ®xed length time-lag se-quence is that time-lag sequence which obtains thehighest recognition accuracy out of the set of time-lag sequences of the speci®ed ®xed length.Evidently, it was not feasible to test all possible

time-lag sequences, hence optimal ®xed lengthtime-lag sequences were drawn from the set of alltest cases (about 350 in total). The use of optimal®xed length time-lag sequences permits an unbi-ased comparison of the recognition accuracy ob-tained from the use of bi-directional, preceding onlyand succeeding only time-lag sequences (unbiasedin the sense that the number of frame dependenciesis kept constant during any comparison).

3.3. Experimental exploration of the bi-directionalIFD±HMM

3.3.1. Introduction of bi-directional time-lag se-quences

The recognition accuracy variations across aspread of base time-lag sequences for a ®ve stateIFD±HMM employing MFCC features are shownin Fig. 1. For a given base time-lag sequence (forexample [4 6]), the relevant preceding only time-lagsequence (i.e. [ÿ6 ÿ4]) obtained a higher recogni-tion accuracy than the corresponding succeedingonly time-lag sequence (i.e. [+4 +6]). Further-more, the bi-directional time-lag sequence (i.e.[ÿ6 ÿ4 +4 +6]) improved upon the relevant pre-ceding only time-lag in all cases. These ®ndingsneed to be carefully quali®ed.

The higher levels of recognition accuracy re-turned when a preceding only time-lag sequence isused over a succeeding only time-lag sequence is anartefact of basing the tests on the E-set, for whichmost of the discriminative information is availableat the start of the utterance. Under such circum-stances a preceding only time-lag sequence will beable to better capture the useful dynamic dis-criminative information as it o�ers a more com-prehensive modelling of the start of the utterance.As an aside, when the entire alphabet was used asthe test set, it was found that preceding only andsucceeding only time-lag sequences returned highlycomparable levels of recognition accuracy.

Whilst it is evident from Fig. 1 that a bi-direc-tional time-lag sequence o�ers a signi®cant in-crease in recognition accuracy, it is also the casethat it contains twice as many frame dependenciesas either of the other types of time-lag sequence.Nonetheless, the results provide clear evidence thatthe succeeding frames provide useful dynamic

306 P. Hanna et al. / Speech Communication 28 (1999) 301±312

information that is not captured in the precedingframes.

The recognition accuracies for various optimal®xed-length time-lag sequences, collected using a®ve state IFD±HMM are shown in Fig. 2. Alsoshown are the actual time-lag sequences whichresulted in the reported optimal recognition accu-racy.

For a ®xed number of frame dependencies, useof the optimal bi-direction time-lag sequence out-performs the optimal preceding only time-lag se-quence, which in turns outperforms the optimalsucceeding only sequence. Hence, for a ®xednumber of frame dependencies, the highest recog-nition accuracy is obtained when a mixture ofsucceeding and preceding frame dependencies areemployed.

3.3.2. Increasing numbers of frame dependenciesA brief examination of Fig. 2 shows that, in

general, the inclusion of additional frame depen-dencies results in an increased recognition accu-racy, with the bi-directional time-lag sequencesreturning the largest gains in recognition accuracy.However, it is reasonable to suspect that beyond acertain point the inclusion of frame dependencieswill have little e�ect on the returned level of rec-ognition accuracy due to the majority of usefuldiscriminative information already having beencaptured by the other frame dependencies. Indeed,the additional frame dependencies might result in alowered recognition accuracy should there be in-su�cient training data to adequately train the in-creased number of model parameters. Hence, thereis an optimal number of frame dependencies,

Fig. 2. Optimal ®xed-length time-lag sequences.

Fig. 1. Recognition accuracy of various base time-lag dependencies sequences.

P. Hanna et al. / Speech Communication 28 (1999) 301±312 307

which ensures as much useful temporal informa-tion is captured without increasing the model pa-rameter size, to the extent where model estimationis compromised.

The recognition accuracies for succeeding only,preceding only and bi-directional time-lag se-quences as the number of frame dependencies inthe time-lag sequence is varied are shown in Fig. 3.As before, a ®ve state IFD±HMM employingMFCC features was used. The plot lines shown inFig. 3 are intended only to improve clarity (evi-dently, they do not o�er an interpolation betweendata points).

Fig. 3 shows that when the number of depen-dencies in either the succeeding only or precedingonly time-lag sequence, is increased from one totwo or from two to three, an increase in recogni-tion accuracy is reaped. Beyond three dependen-cies, there is a slow and erratic increase in thelevels of recognition accuracies as the number ofdependencies is increased. Although not fullyshown in Fig. 3, the same pattern was also ob-

served for the bi-directional time-lag sequence; i.e.beyond six dependencies (consisting of three suc-ceeding and three preceding frames) there is littleincrease in recognition accuracy. Nevertheless, theuse of bi-directional time-lag sequences permits agreater number of frame dependencies to be usedbefore additional frame dependencies have littlecontribution on the levels of recognition accuracy.

The somewhat erratic variations in recognitionaccuracy beyond three dependencies (six in thecase of a bi-directional time-lag sequence) suggestthat little new dynamic information is being pro-vided by the added dependencies. This is not to saythat frame dependencies beyond the third do notprovide useful dynamic information; they do.Rather, the inclusion of additional dependenciesdoes not signi®cantly provide any new dynamicinformation. This ®nding suggests that beyond acertain number of frame dependencies it may bemore pro®table to look towards other means ofincreasing the recognition accuracy. In what fol-lows, the addition of multiple mixtures to the bi-directional IFD±HMM will be explored.

3.3.3. The use of multiple mixtures with the bi-directional IFD±HMM

Multiple mixtures were employed as an aid to-wards obtaining higher recognition accuracies, ine�ect o�ering a more comprehensive modelling ofthe instantaneous spectra (whereas the use oflonger time-lag sequences can be regarded as amore comprehensive modelling of the dynamicfeatures).

Fig. 3. Recognition accuracy variations due to di�ering number

of frame dependencies.

Table 1

IFD±HMM parameter size versus recognition accuracy for 15 state IFD±HMMs

IFD±HMM

1 mixture/15 state 2 mixtures/15 state 3 mixtures/15 state 4 mixtures/15 state 5 mixtures/15 state

Time-lag sequence % Size % Size % Size % Size % Size

2 frame 90.09 3 90.35 6 90.61 9 90.69 12 90.82 15

4 frame 91.22 6 90.73 12 91.96 18 91.55 24 91.22 30

6 frame 90.98 9 91.80 18 90.89 27 90.75 36 90.33 45

12 frame 92.37 18 91.71 36 90.48 54 89.50 72 89.02 90

HMM

MFCC + DMFCC 80.9 2.17 83.4 4.33 85.0 6.5 86.6 8.66 85.5 10.8

MFCC + DMFCC

+ DDMFCC

83.4 3.25 84.7 6.5 85.3 9.75 85.6 13 86.7 16.25

308 P. Hanna et al. / Speech Communication 28 (1999) 301±312

Table 1 provides recognition accuracies col-lected from ®fteen state IFD±HMMs employingstate tying of the last nine (Woodland and Cole,1991). Also shown is the model parameter sizerelative to a single mixture, ®ve state, [ÿ2 +2]IFD±HMM (which is arbitrarily assumed to havea model parameter size of one, equal to 360 freeparameters). A comparison of model parametersize versus recognition accuracy should enable thebest combination of number of multiple mixturesand frame dependencies to be determined.

A study of Table 1 shows that the introductionof multiple mixtures to IFD±HMMs with a largenumber of frame dependencies often results inrelatively poor levels of recognition accuracy. Thisis attributable to the limited size of the Connexdatabase, where approximately 155 utteranceswere available for training each word. As such,there were not su�cient training data to permit agood estimation of the IFD±HMM model pa-rameters when both multiple mixtures and a largenumber of frame dependencies are employed. Ifmore extensive training data were available, then itwould be reasonable to expect higher recognitionaccuracies for the larger models. The results pre-sented in Table 1 also show that should insu�cientdata enforce an upper limit on the model param-eter size, then it is best to favour longer time-lagsequences over the use of multiple mixtures. Forreference, standard HMM recognition accuraciesare also provided using MFCC features augmentedwith 1st and 2nd order di�erential coe�cients.

The superiority of longer time-lag sequencesover multiple mixtures is more readily seen from

Table 2, where ®ve state IFD±HMMs are em-ployed (thereby keeping the model parameter sizesu�ciently small to permit an adequate trainingof the model parameters). The results show, thatfor a ®xed model parameter size, somewhathigher recognition accuracies are obtained whenemphasis is given to longer time-lag sequencesover an increased number of multiple mixtures.Hence, whilst multiple mixtures may be used toextend the capabilities of the IFD±HMM theyare generally less e�ective than the use of addi-tional frame dependencies, but a combination ofboth is better still if su�cient training data isavailable.

3.3.4. Variation of w the time-lag squared-basedcombination weighting term

So far it has been assumed that both time-lagcomponents are equally weighted (equivalent to aw value of 0.5 in Eq. (20)). However, it was foundthat for E-set utterances, preceding only time-lagsequences yielded an improved performance overthe corresponding succeeding only time-lag se-quences. By varying w it is possible to alter therelative importance of each of the two time-lagcomponents, thereby potentially enabling a bettermodelling should the distribution of temporal in-formation be highly biased towards either suc-ceeding or preceding frames.

Two alphabets were used to explore the e�ectsof varying w; namely, the E-set (comprising of the``b'', ``c'', ``d'', ``e'', ``g'', ``p'', ``t'' and ``v'' char-acters) and the full alphabet (consisting of theentire 26 character alphabets from which the

Table 2

IFD±HMM parameter size versus recognition accuracy for 5 state IFD±HMMs

IFD±HMM

1 mixture/5 state 2 mixtures/5 state 3 mixtures/5 state 4 mixtures/5 state 5 mixtures/5 state

Time-lag sequence % Size % Size % Size % Size % Size

2 frame 79.1 1 80.2 2 84.7 3 84.3 4 83.7 5

4 frame 82.5 2 85.4 4 82.6 6 83.7 8 83.9 10

6 frame 84.8 3 85.4 6 83.0 9 83.9 12 83.4 15

HMM

MFCC + DMFCC 74.0 .72 76.0 1.44 74.8 2.17 76.2 2.88 76.8 3.61

MFCC + DMFCC

+ DDMFCC

74.2 1.08 81.1 2.17 80.3 3.25 80.9 4.33 82.4 5.42

P. Hanna et al. / Speech Communication 28 (1999) 301±312 309

Connex database is composed). The results can beseen in Fig. 4 where w is varied over the range[0,1.0] in intervals of 0.1.

As expected, due to the characteristics of theE-set, emphasis on modelling the temporal corre-lation from the preceding frames returns betterrecognition accuracies than that obtained should asimilar emphasis be given to corresponding suc-ceeding frame-only time-lag sequences. For theentire alphabet character set, a greater degree ofparity was found, to the extent we can approxi-mately regard succeeding and preceding framedependencies to be of equal importance. For time-lag sequences other than [ÿ2 +2], the optimalvalue of w was found to be 0.5. This was the caseeven if the recognition accuracies obtained whenw� 1 and w� 0 largely di�ered (i.e. between thecorresponding succeeding only and preceding onlytime-lag sequences recognition accuracies). Thecollected results show that a w value of 0.5 shouldprove su�ciently optimal in the majority of cases.

3.3.5. Time lag optimisation of the bi-directionalIFD±HMM

In this section, results are presented for thetime-lag optimisation of a two frame bi-directional

time-lag sequence, consisting of one preceding andone succeeding frame dependency. It is for thistype of time-lag sequence that the highest gains inrecognition accuracy should result over that ob-tained from the use of a ®xed time-lag sequence.The reason being that as greater number of framedependencies are employed, it becomes increas-ingly more likely that a ®xed time-lag sequence willcapture the important frame dependencies.Whereas, there is a greater probability that theimportant frame dependencies will be missedshould a small ®xed time-lag sequence be em-ployed.

Table 3 shows recognition accuracy versusmodel parameter size for a number of time-lagoptimised and non time-lag optimised IFD±HMMs. All results are based on a ®ve state IFD±HMM. The one frame dependency preceding onlytime-lag optimisation results are taken from Mingand Smith (1996). As might be expected, the re-sults con®rm that the use of time-lag optimisationresults in an improved level of recognition accu-racy (however, a two frame time-lag optimisedmodel does not outperform the best four framenon-time-lag optimised model). The use of time-lag optimisation results in a 10.5% reduction in the

Fig. 4. Variation of w for a 5 state IFD±HMM over di�erent character sets and time-lag sequences.

310 P. Hanna et al. / Speech Communication 28 (1999) 301±312

error recognition rate when compared to the cor-responding IFD±HMM without optimisation.Such a reduction is less than the reductions re-ported by Ming and Smith (1996) for the basicIFD±HMM (where the average reduction in errorrecognition rate was 22.9%). However, the di�er-ences are not viewed as signi®cant, for example,even if one of the dependencies of a two frame bi-directional time-lag sequence provides relativelyuseless information, the other dependency may yetprovide su�cient correlation to permit correctrecognition.

In conclusion, the application of time-lag opti-misation to the bi-directional IFD±HMM results ina reasonable (10±13%) reduction of the error rec-ognition rate. The application of time-lag optimi-sation to longer bi-directional time-lag sequencesshould also reduce the error recognition rate, al-though to a lesser degree as a greater number offrame dependencies are employed.

4. Conclusions

Extending the IFD±HMM so that it supports adependency upon succeeding frames permits animproved modelling of the available dynamicspectral information, thereby resulting in a signi-®cant decrease in the recognition error rate whencompared to models that are only dependent uponpreceding frames.

The introduction of bi-directional time-lag se-quences to the IFD±HMM was based on the as-sumption that the non-stationary nature of speech

entails that succeeding frames relative to theobserved frame will contain useful discriminativedynamic information not found in the precedingframes. This assumption has been shown to becorrect, and furthermore, that the temporalinformation contained in succeeding frames isas useful as that contained in the precedingframes.

As shown, the introduction of bi-directionaltime-lag sequences does not entail an increasedmodel parameter size or increased parameter esti-mation complexity for a given number of depen-dencies, thereby a�ording a powerful andinexpensive means of increasing recognition accu-racy. Indeed, the highest Connex E-set recognitionaccuracy obtained through the use of a bi-direc-tional IFD±HMM was 92.4%, comparing favour-ably to the previous highest preceding frame onlyIFD±HMM recognition accuracy of 90.7% (Mingand Smith, 1996) and the standard HMM (usingdelta MFCC features) recognition accuracy of85.7% (Woodland and Cole, 1991). The bi-direc-tional IFD±HMM's result of 92.4% was obtainedwithout the use of multiple mixtures, time-lagoptimisation or delta features. Should su�cienttraining be available, then it is reasonable to ex-pect that the introduction of these approacheswould result in an increased recognition accuracy.

The shown results demonstrate that beyond acertain number of frame dependencies, the intro-duction of additional frame dependencies is un-likely to result in the capture of new temporalinformation not already captured by the existingdependencies. Under such circumstances, alterna-tive means of increasing the recognition accuracymay prove more bene®cial than adding additionalframe dependencies. Additionally, it was alsoshown that a weighting term of 0.5 for balancingthe contributions of the preceding and succeedingframe dependencies is nearly optimal.

The use of multiple mixtures, in conjunctionwith the IFD±HMM, led to improved levels ofrecognition accuracy (provided there was su�cienttraining data). However, the collected results showthat emphasis on modelling the dynamic spectralinformation is at least as pro®table, and normallymore so, than emphasis on modelling the instan-taneous spectra.

Table 3

Comparison of time-lag optimised IFD±HMMs to non-opti-

mised models. Note the recognition accuracies reported for the

non-optimised models are those provided by the best non-op-

timised models

IFD±HMM Recognition

accuracy

(%)

Model

parameter

size

1 Frame preceding only, optimised 73.1 180

Highest 2 frames bi-directional 79.1 360

2 Frames bi-directional, optimised 81.3 360

Highest 4 frames preceding-only 80.1 720

Highest 4 frames bi-directional 83.4 720

P. Hanna et al. / Speech Communication 28 (1999) 301±312 311

Acknowledgements

The authors would like to thank both thereferees for their useful comments and also BTfor ®nancial support. Additionally this work issupported by an EPSRC grant, number GR/K82505.

References

Baum, L.E., 1972. An inequality and associated maximization

technique in statistical estimation for probabilistic functions

of Markov processes. Inequalities 3, 1±8.

Hanna, P., Ming, J., O'Boyle, P., Smith, F.J., Modelling inter-

frame dependence with preceding and succeeding frames.

EuroSpeech 97, Vol. 3, pp. 1167±1170.

Harte, H., Vaseghi S., Milner, B., 1996. Dynamic Feartures for

Segmental Speech Recognition, ICSLP 96, pp. 933±936.

Juang, B.H., 1985. Maximum-likelihood estimation for mixture

multivariate stochastic observations of Markov chains.

AT&T Technical Journal 64, 1235±1249.

Kenny, P., Lennig, M., Mermelstein, P., 1990. A linear

predictive HMM for vector-valued observations with appli-

cations to speech recognition. IEEE Transactions on

Acoustic., Speech and Signal Proceesing 38, 220±225.

Ming, J., Smith, F.J., 1996. Modelling of the interframe

dependence in an HMM using conditional Gaussian mix-

tures. Computer Speech and Language 10, 229±247.

Paliwal, K.K., 1993. Use of temporal correlation between

successive frames in a hidden Markov model based speech

recognizer. In: Proceedings of ICASSP 93, pp. 209±212.

Smith, F.J., Ming, J., O'Boyle P., Irvine A.D., 1995. A hidden

Markov model with optimized inter-frame dependence. In:

Proceedings of ICASSP 95, pp. 209±212.

Takahashi, S., Matsuoka, T., Minami Y., Shikano K., 1993.

Phoneme HMM's constrained by frame correlations. In:

Proceedings of ICASSP 93, pp. 219±222.

Wellekens, C.J., 1987. Explicit correlation in hidden Markov

models for speech recognition. In: Proceedings of ICASSP

87, pp. 384±387.

Woodland, P.C., 1992. Hidden Markov models using vector

linear prediction and discriminative output distributions. In:

Proceedings of ICASSP 92, pp. 509±512.

Woodland, P.C., Cole D.R., 1991. Optimizing hidden Markov

models using discriminative output distributions. In: Pro-

ceedings of ICASSP 91, pp. 545±548.

312 P. Hanna et al. / Speech Communication 28 (1999) 301±312