Speech Enhancement Combining Optimal Smoothing and Errors-In-Variables Identification of Noisy AR...

5564 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 12, DECEMBER 2007

Speech Enhancement Combining Optimal Smoothingand Errors-In-Variables Identification

of Noisy AR ProcessesWilliam Bobillet, Roberto Diversi, Eric Grivel, Roberto Guidorzi, Mohamed Najim, Fellow, IEEE, and

Umberto Soverini

Abstract—In the framework of speech enhancement, severalparametric approaches based on an a priori model for a speechsignal have been proposed. When using an autoregressive (AR)model, three issues must be addressed. 1) How to deal with ARparameter estimation? Indeed, due to additive noise, the standardleast squares criterion leads to biased estimates of AR parameters.2) Can an estimation of the variance of the additive noise foreach speech frame be obtained? A voice activity detector is oftenused for its estimation. 3) Which estimation rules and techniques(filtering, smoothing, etc.) can be considered to retrieve the speechsignal? Our contribution in this paper is threefold. First, wepropose to view the identification of the noisy AR process as anerrors-in-variables problem. This blind method has the advantageof providing accurate estimations of both the AR parametersand the variance of the additive noise. Second, we propose analternative algorithm to standard Kalman smoothing, based on aconstrained minimum variance estimation procedure with a lowercomputational cost. Third, the combination of these two steps isinvestigated. It provides better results than some existing speechenhancement approaches in terms of signal-to-noise-ratio (SNR),segmental SNR, and informal subjective tests.

Index Terms—Autoregressive (AR) parameter estimation,Kalman filtering, smoothing, speech enhancement.

I. INTRODUCTION

I N MANY applications where speech plays a key role,such as mobile communication or speech recognition, the

recorded signal is often disturbed by surrounding noises. There-fore, a great deal of attention has been paid to one-microphonespeech enhancement over the last 25 years. The purpose is toretrieve the original signal from a single sequence of noisyobservations.

In addition to the short-term spectral attenuation [1], [2],parametric approaches based on an a priori model of speech

Manuscript received July 21, 2006; revised February 21, 2007. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Ilya Pollak.

W. Bobillet, E. Grivel, and M. Najim are with Equipe Signal et Image-LAPS,UMR 5218 IMS, 33405 Talence, France (e-mail: [email protected]; [email protected]; [email protected]).

R. Diversi, R. Guidorzi, and U. Soverini are with the Dipartimentodi Elettronica, Informatica e Sistemistica, University of Bologna, 40136Bologna, Italy (e-mail: [email protected]; [email protected];[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2007.898787

have been developed. In the so-called subspace methods [3],[4], the speech signal is modeled by a sum of complex exponen-tials. Speech enhancement is then based on the decompositionof the subspace spanned by the noisy data into two orthogonalsubspaces: the noise subspace, that can provide informationabout the noise variance, and the signal subspace that is used toestimate the frequencies and the amplitudes of the complex ex-ponentials [5]. When the additive noise is white, both subspacescan be estimated by completing the eigenvalue decompositionof the autocorrelation matrix of the noisy observations. Thehighest eigenvalues are used to characterize the signal subspace,while the lowest ones allow the additive noise variance to beestimated. An alternative approach consists in considering thehighest singular values and the corresponding singular vectorsof the Hankel (or Toeplitz) matrix of the noisy observations.Then, various estimation rules such as least squares (LS) orminimum variance (MV) [6] have been proposed to retrievethe clean speech. Nevertheless, the choice of the model ordercan be critical. Indeed, when dealing with noise-like framessuch as the unvoiced phonemes /m/ or /p/, the order cannotbe easily defined. Thus, selecting an a priori constant orderintroduces the so-called musical noise.1 This problem can bepartially solved if the noise variance is available, by taking intoaccount only the complex exponentials with power higher thanthe noise variance. In [7], the authors propose a procedure thatrelies on an average of the eigenvalues related to the currentframe with the corresponding (in a frequential point of view)eigenvalues related to the previous frames.

An alternative model for the speech signal is the auto-regressive (AR) model defined as follows:

(1)

where are the autoregressive parameters, isthe zero-mean white driving process with variance , anddenotes the order. This model is used in speech analysis [8] andcoding (with CELP, for instance [9]) since it approximates theshort-term spectral envelope of the signal.

Here, the signal is disturbed by an additive zero-meanGaussian white noise , with variance , uncorrelated with

. The noisy observations are hence given by the relation

(2)

1A musical noise corresponds to short-term sinusoids randomly distributedover time and frequency.

1053-587X/$25.00 © 2007 IEEE

BOBILLET et al.: SPEECH ENHANCEMENT COMBINING OPTIMAL SMOOTHING AND EIVs IDENTIFICATION 5565

Speech enhancement based on (1) and (2) usually operates inthe following two steps:

1) estimations of and the variances and ;2) retrieval of the speech signal from the noisy

observations.Various approaches can be considered. Thus, MAP estimationtechniques can be concieved to estimate the speech signal andthe corresponding AR parameters [10], [11]. Enhancementtechniques based on Kalman filtering allow the estimation errorvariance to be minimized [12]–[15], whereas algorithmsprevent disturbances, such as the additive noise, from gen-erating large signal estimation error [16], [17]. However, thequality2 of the enhanced speech mostly depends on how theAR parameters and the noise variances are estimated. In [12]and [16], the speech parameters are obtained from the cleanspeech, assumed to be available. In [12], Paliwal and Basu usea delayed version of the Kalman filter to estimate the speechsignal, whereas the authors in [16] have recently studied therelevance of techniques. While convenient to comparedifferent filtering or smoothing procedures on simulated data,this procedure, obviously, cannot be used in real cases. In [19],Gibson et al. propose to consider a simplified version of theestimation-maximization (EM) algorithm, operating as follows.At the first iteration, the speech parameters andthe variance are estimated from the noisy signal that isfiltered by means of Kalman filtering. Then, the parametersare alternately estimated from the enhanced signal and used tofilter the noisy signal. In the previous methods, the estimationof the AR signal parameters from the noisy observations is notreally addressed and remains an open topic. Standard “online”LS approaches based on LMS, Kalman filtering, and tech-niques, or “offline” approaches such as Yule–Walker equationsyield biased estimates [20]. Indeed, the AR spectrum estimatedfrom the noisy observations is flatter than the AR spectrumestimated from the clean signal. Therefore, during recent years,several studies have been carried out to counteract the effects ofthe bias. An iterative “offline” bias correction scheme has beenproposed by Zheng [21], whereas Davila’s method [22] con-sists in solving the so-called noise compensated Yule–Walker(NCYW) equations, which were initially introduced by Kay[23]. “Online” methods have been also developed. Amongthem, the -LMS filter [24] consists in updating the tap-weightvector from the previous tap-weight vector multiplied by avalue , defined by the variance of the noise and by the LMSgain. The -LMS filter operates as an LMS filter on enhancedobservations [25]. Although these “online” noise-compensatedtechniques provide significant results when thousands of noisyobservations are used, their relevance remains questionablewhen only a few hundred noisy samples are available. As analternative, a recursive instrumental variable (IV) [26] solutionhas been recently proposed to estimate both the AR parametersand the speech signal. It is based on the use of two cross-cou-pled Kalman filters operating in parallel. The comparativestudy carried out by the authors shows that their approachoutperforms the above methods and the modified Yule–Walker(MYW) equations.

2As mentioned in [18], the quality of the speech is a subjective criterion thatmeasures the way the speech signal is perceived by listeners.

In addition to the AR parameter estimation issue, the vari-ance of the additive noise must also be estimated. This can bedone during silent periods, but implies the use of a voice ac-tivity detector (VAD). To avoid it, other methods have beenproposed. In [14], Kalman filter-based speech enhancement isformulated as a realization issue in the framework of identifica-tion. In that case, subspace methods for identification allow statespace models to be directly estimated without any intermediateestimation of the AR parameters. In addition, the variance ofthe additive noise for each speech frame can be obtained. Fur-thermore, an EM algorithm could also be considered to bothestimate the AR parameters and the noise statistics [13], [27].Another solution [28], based on Mehra’s work [29], does notrequire the explicit knowledge of the variances of the additivenoise and the driving process.

Remark 1: In [30], the joint estimations of the noise variancesand the AR parameters from noisy observations are studied bymeans of the factor graph. The resulting approach consists incombining three kinds of methods. Particle filtering is useful tosolve the noise variance issue while Kalman filtering is coupledwith an LMS-type algorithm for the joint estimations of the ARparameters and the AR process.

Remark 2: For the previous approaches, perceptual criteriacan be taken into account to weaken the residual noise. Thus,in [31], masking properties of the auditory system and spectralsubtraction are combined. In [32], speech enhancement isviewed as a minimization problem that limits the frequencycomponents of the residual noise to levels lower than themasking threshold. Moreover, when incorporating humanhearing properties into subspace methods, signal distortion canbe minimized under temporal or spectral constraints (respec-tively denoted as TDC and SDC) [4]. In [18] and [33], Ma et al.propose to derive a Kalman filter based approach by taking intoaccount both temporal and frequency-domain simultaneousmasking properties. For this purpose, the variance of the esti-mation error is first minimized in order to be under the maskingthreshold. However, this leads to an optimization algorithmsubject to several constraints and whose computational cost ishigh. For this reason, the authors suggest using a perceptualpost filtering combined with a Kalman filtering in the waveletdomain.

In this paper, our contribution is threefold. First, we pro-pose to view the AR parameter estimation as an errors-in-vari-ables (EIV) issue. The EIV models, previously developed in theframework of statistics and derived in other fields such as con-trol and identification, assume that every observation containsan additive error term [34]. In our case, the errors correspondto both the additive noise and the driving process , asshown in Fig. 1.

The approach described here is based on the determination ofthe null space of suitable positive semidefinite matrices. It hasthe advantage of blindly estimating the AR parameters and thevariances of both the additive noise and the driving process, evenwhen the number of observations is reduced and the signal-to-noise-ratio (SNR)3 is low (e.g., 5 dB). To illustrate the relevanceof the solution, we compare it with existing “online” [21], [22],[26] and “offline” [13], [24], [25] methods.

3SNR = 10 log [ s (k)= b (k)].


Fig. 1. EIV scheme for AR identification. The standard (a) AR model is viewedas (b) an EIV model.

Second, we propose an alternative algorithm to Kalmansmoothing to estimate the signal, based on a constrainedminimum variance estimation procedure that leads to a lesscritical implementation. Indeed, there is no need to estimate thestatistical mean of the initial state and its covariance matrix.

Third, we combine the previous AR parameter estimationmethod with the optimal smoothing in the framework of speechenhancement. The resulting algorithm provides better results interms of both SNR and segmental SNR improvements and in-formal subjective tests (IST) than existing speech enhancementapproaches such as [17] and [19].

The rest of this paper is organized as follows. Section II dealswith the EIV-based formulation of the AR parameter estima-tion issue. This approach is then analyzed in Section III, whereconnections with standard AR parameter estimation techniquesare also presented. In Section IV, the optimal MV smoothingof the signal is developed. In Section V, we apply the proposedmethod to speech enhancement. Some conclusions are given inSection VI.

II. AR PARAMETER ESTIMATION

A. EIV Formulation of the AR+Noise Parameter Estimation

Given a generic process described by measurable variables, the formulation of an EIV estimation problem con-

sists in determining the set of -tuples satisfyingthe following linear relation:

and compatible with the noisy observations , withoutany a priori knowledge of the variances of the noise termsthat are assumed to be independent from the process values[35]. By denoting with the covariance matrix of the noisyobservations , this problem can be reformulated as the searchof the family of noise covariance matrices , which enables

to be semidefinite positive, i.e., positive and singular

In the following, let us describe why and how an EIV can beconsidered to extract, from noisy observations, the parame-ters of an AR process disturbed by an additive white noise. Forthis purpose, let us consider the Hankel matrix

......

...

and the matrix

......

... (3)

Given (1), it follows that

(4)

where

(5)

Then, when introducing the positive semidefinite covariancematrix of the vector

(6)

the relation (4) implies that

(7)

Let us now define the following covariancematrices of the AR process , the noisy observationsand the additive noise :

where the Hankel matrices and have the samestructure as . Due to (2) and as and are assumedto be uncorrelated, it follows that

(8)

It is also straightforward to show, on the basis of (6), that

(9)


Fig. 2. EIV curves of a eighth-order AR process, driven by a zero-mean whiteprocess with variance � = 1. The additive noise is a white process with vari-ance � = 1. Here, A = (� ; � ) = (2; 1).

By denoting with the quantity , and by com-bining (8) and (9), we finally obtain

(10)

At that stage, the solution consists in determining [36] the pointswith so that

(11)

The set of points satisfying (11) belongs to a con-tinuous convex curve lying in the first quadrant of the plane

, where and stand for the abscissa and the ordinate,respectively. The concavity of this curve faces the origin [37].It can be observed that every point can be asso-ciated with the vector of parameters given by

(12)

The point corresponding to the true variancesis the only point belonging to and to for all

[37] (see Fig. 2). Consequently, is the onlyvector with last entry normalized to 1 satisfying the followingequations:

(13)

where

and where the matrix is defined like in (6).

B. Using the EIV Approach in Real Cases

In all real cases, since the assumptions of whiteness and un-correlation do not hold, no set verifies relation (13).This has led to the development of several criteria that make itpossible to apply the EIV scheme to real data. In this paper, twocriteria originally presented in [36] and [38] are considered.

To introduce the first one, denoted as shifted relations (SR)criterion, let us consider in the plane the intersectionsand of a line from the origin with and

, respectively. Since is the only point belonging tofor all , the relation

is satisfied if and only if . As this will not be veri-fied at any point of in all real cases, the SR criteriondetermines the solution that minimizes the squared norm of thevector

(14)The resulting algorithm is thus the following [36].

1) Start from a generic ratio defined by theline from the origin in the plane with slopeand compute its intersections and

with and , where .2) Compute the matrix and the corresponding vector

by using (11) and (12).3) Compute the cost function given in (14).4) Search for the value of that minimizes

.The second criterion is based on the so-called high order

Yule–Walker equations (HOYW) (16) that can be derived in thesequel. By defining another AR processHankel matrix as follows:

......

...

one obtains

(15)

Let us now introduce the matrix

......

...


as well as the matrices

.... . .

...

.... . .

...

and , defined in a similar way. Since

it follows from (15) that

(16)

Since the additive noise is uncorrelated with , the au-tocorrelation functions of and are the same for everynonzero lag; this leads to the following equality:

(17)

Therefore, according to (16) and (17), the AR parameter vectorsatisfies the relation

Let us now consider the locus of admissible solutionsand the parameter vectors associated with its points . Itcan be shown [38] that annihilates only if .In all real cases, this property will not be satisfied at any pointof ; the HOYW criterion makes it possible to estimate

by minimizing the squared norm of the vector

(18)This second algorithm is thus described by the following steps[38].

1) Start from a generic ratio defined by the linefrom the origin with slope and compute its inter-section with .

2) Compute the matrix and the corresponding vectorby using (11) and (12).

3) Compute the cost function (18).4) Search for the value of that minimizes

.The minimum of cost functions (14) and (18) alongcan be found by means of any standard search algorithm sincethese functions have a single absolute minimum.

C. Simulation Results

Here, the proposed method is compared with the EM al-gorithm [13], [27] and two “offline” methods, namely theIV method based on two interacting-Kalman filters [26] andDavila’s method [22]. The standard Yule–Walker solution isalso studied to point out the bias that must be compensated.Moreover, two “online” approaches using the -LMS [24] orthe -LMS [25] are tested. Note that the EIV (SR) and the EIV(HOYW) approaches are tested by taking .

Since only a limited number of samples is available whenmodeling speech frames by an AR process, we consider a se-quence of 300 noisy observations for the simulations. In addi-tion, the order of the AR process to be estimated has been setto 10, which is a common choice for speech AR modeling. Thecorresponding poles are

The AR process has then been disturbed by an additive whitenoise.

Since the AR models describe the short-term spectrum of asignal by retrieving its first formants, let us first compare the ARpoles in the -plane and the corresponding AR spectra obtainedby each tested method with the true ones. These results are pre-sented in Figs. 3 and 4, for 10 dB and 5 dB,respectively. Moreover, 20 realizations of the additive noise areplotted.

In both figures, one can see that the Levinson algorithm pro-vides very smoothed spectra: this is due to the bias introduced bythe additive noise that moves the poles away from the unit circle[20]. The “online” approaches, namely -LMS and -LMS donot allow the spectrum to be estimated accurately, due to thereduced number of available samples. In some cases, Davila’smethod diverges since it provides poles outside the unit circle inthe -plane. In addition, the variance of the driving processmay be poorly estimated. On the contrary, for 5 dB,the EM algorithm, the method proposed in Section II-B and theIV method constitute reliable alternatives since they can retrievethe three peaks of the true spectrum. The first two have the ad-vantage of blindly estimating the variance of the additive noise.

Moreover, we have also investigated the relevance of the ap-proach in terms of average Itakura distance [39]. The Itakuradistance between the vector and its th estimation is givenby the following relation:

where denotes the estimated autocorrelation matrix of theclean speech, calculated by means of the available 300 samples.

According to the results presented in Table I, the EIV ap-proach based on SR and HOYW criteria provides the lowestItakura distance for every SNR.

Concerning computational cost, it should be noted that theproposed method is based on the minimization of a cost functionalong and hence is computationally more intensivethan the Levinson algorithm. However, simulation tests show


Fig. 3. Estimated AR spectra and corresponding poles in the z-plane for various methods. SNR = 10 dB.

that our method is not more computationally intensive than, forinstance, Davila’s method.

III. EIV SCHEME ANALYSIS AND CONNECTION WITH SOME

STANDARD IDENTIFICATION METHODS

A. Connection With the Yule–Walker Solution

If is known, the EIV scheme helps to retrieve the setscorresponding to the th order AR process+noise

models whose theoretical autocorrelation matrix is .Here, we compare these solutions with the noise compensatedYule–Walker (NCYW) solution [23]. More precisely, when thenoise variance is known, we are going to prove that theYW solution corresponds to the EIVsolution , with .

The YW solution is obtained by applying, for instance, theLevinson algorithm to the noise compensated correlation matrix

whereas the EIV solution is obtained by meansof (10), as follows:

(19)To show that , let us verify that

is the solution of (19). By using the followingdecomposition of the noisy observation autocorrelation matrix:

the AR parameters can be expressed as follows:


Fig. 4. Estimated AR spectra and corresponding poles in the z-plane for various methods. SNR = 5 dB.

The corresponding variance of the driving process is given by

Thus, by using the notations introduced in (5), andsatisfy the conditions

(20)

and

(21)

Then, let us replace with in relation (19),and develop the resulting determinant (see Appendix I for proof)

It is thus possible to conclude, according to (21), that

Hence

(22)


TABLE IAVERAGE ITAKURA DISTANCE BASED ON 100 RUNS

Fig. 5. Representation of the specific points in the EIV scheme.

Moreover, it is easy to see from (20) and (21), that

Consequently, according to (7)–(9) it follows that

(23)

B. Specific Cases

Let us now focus our attention on the specific cases associatedwith the minimum and maximum values of the additive noisevariance .

First, when , the observations are noise free. In Fig. 5,the corresponding point in the plane is denoted by

. According to relations (22) and (23), the EIV solu-tion is the standard YW solution.

Second, the maximal value of corresponds toand is denoted by . The corresponding

point is represented in Fig. 5. Accordingto relation (8), is the maximum value of such that thematrix is still positive, i.e.,

Therefore, one has

where is the minimum eigenvalue of . The resultingautocorrelation matrix of the signal, denoted as , is thendefined as follows:

The matrix is singular and hence, necessarily correspondsto a discrete spectrum, that can be obtained by means of Pis-arenko’s decomposition [40]

where denotes the Dirac measure centered on . Inthat case, the transfer function associated with the vector

, defined by

(24)

has all its roots on the unit circle; its arguments are the angularfrequencies . It corresponds to the spectrum of the sum ofcomplex exponentials

satisfying the prediction equation

and associated with the standard subspace methods [41].

C. Features of the Estimated AR Spectrum

In this section, we propose to analyze the resulting spectrumof the AR processes that can be obtained with the EIV scheme.

For this purpose, let us first consider an AR process defined byand the corresponding transfer func-

tion given in (24). Then, let us define its spectral smoothness asfollows:

(25)


which is the inverse of the so-called dispersion of .(See Appendix II for more details.) Relation (25) can berewritten as follows:

(26)

where is the power of the AR process driven by a unit-variance white process

(27)

It is thereby possible to analyze the smoothness of theAR spectrum obtained when choosing a point on the curve

.On the one hand, for each point belonging to ,

the power of the corresponding AR process can bededuced from relation (8), as follows:

(28)

where is the power of the observation that can be estimatedby the relation

(29)

On the other hand, since the theoretical variance of the drivingprocess is , can be decomposed into twofactors

(30)

where is defined as in (27) when considering .Therefore, by combining (26), (28), and (30), the smoothnessof the spectrum obtained at the point is given by

which is a decreasing function of .As a consequence, the spectrum obtained using the EIV

scheme is an intermediate spectrum, in terms of smoothness,between the smoothest one obtained by the Yule–Walker equa-tions applied to the noisy data matrix, and the discrete spectrum

obtained when modeling the speech signal asthe sum of complex exponentials.

D. SNR Associated to a Point of the EIV Curve

For every point of the EIV curve defined by itsset , one can deduce the corresponding SNR

According to (28), its minimum value is obtained when ismaximum, and hence one has

where

Relation (29) allows to write

(31)

where is the mean value of the eigenvalues of .Remark 3: especially when the poles of the

noisy AR process are close to the unit circle in the -plane. Inthat case

Remark 4: When the SNR is too low, the following relationholds:

It is thus possible to write

Finally, satisfies, according to (31)

To illustrate both remarks, the true SNR versus isplotted, in decibels, for 150 frames of two noisy AR processesin Fig. 6(a). One of them has its poles close to the unit circle inthe -plane, whereas the other has its poles closer to the origin.In Fig. 6(b), the true SNR versus is plotted, for every 256sample frame of the French speech signal “le tribunal va bientôtrendre son jugement,” sampled at 8 KHz.

It is possible to observe the following three cases.1) Case A: dB dB.

The true SNR is lower than 0 dB for pure AR processesand most often between 10 and 20 dB when dealingwith speech frames. In that case, it is too difficult to dis-tinguish the speech from the additive white noise4 and theEIV algorithm will tend to significantly underestimate thevariance of the additive noise. Then, a clipping procedureshould be carried out.

2) Case B: dB dB dB.The EIV method can then be applied to the data and a smallorder model is, usually, sufficient. In practice, the order

can be a suitable choice. In speech processing, thecorresponding frames are either silent frames or noise-likeframes. The distinction can be performed by consideringthe power of the signal and the estimated variance of theadditive noise.

3) Case C: dB dB.The true SNR is high enough to apply the EIV methodwithout restrictions. In that case, when dealing with noisyAR processes whose poles are close to the unit circle in

4It is also true for all the methods tested in Section V-B.


Fig. 6. True SNR versus q for (a) AR processes and (b) the frames of the French speech sentence “le tribunal va bientôt rendre son jugement.”

the -plane, is a good approximation of the true SNR.If the poles are closer to the origin, the true SNR will behigher than . In practice, orders between 10 and 16 aresuitable.

Therefore, in theory, the proposed method should provide asolution which corresponds to a SNR higher than . How-ever, in real cases, due to the short term analysis, this propertydoes not necessarily hold, especially when the SNR is lowerthan 0 dB.

IV. OPTIMAL SMOOTHING

A. A Solution for Optimal Smoothing of AR+Noise Models

Since observations are available, (1) can be evaluated when. Thus, the set of relations can be

written in the compact form

(32)

where is the vector stacking speech samples padded withdriving process samples as follows:

(33)

and is the matrix

(34)

where

.... . .

. . .. . .

...(35)

The zero-padded observation vectors and the noise vectorpadded with driving process samples, denoted , can now bedefined as follows:

(36)

and

(37)Relations (2) and (32)–(37) lead to the following compactrelations:

In the following, they will be used to solve the optimalsmoothing problem, i.e., to obtain the minimum variance esti-mate of without introducing a state-space realization of theAR process.

Since the random noise vectors and are uncorrelated andGaussian, they are jointly Gaussian and thus the observationvector is also Gaussian. Consequently, the maximum like-lihood (ML) estimation coincides with the minimum varianceone and the smoothing problem can be solved by maximizingthe likelihood function of under the constraint(32)

(38)

where is a constant and denotes the covariance matrix of

(39)


The solution (38) can be obtained by introducing the vectorof Lagrange multipliers and by minimizing the followingLagrangian:

(40)

By equating the gradient vectors of (40) to zero, with respect toand , we obtain

The maximum likelihood solution is thus given by theexpression

The optimal speech reconstruction is finally given by thefirst entries of that, because of (33), (34), (36), and (39),are given by

(41)

B. Computational Complexity

The computational complexity of the optimal smoothing for-mula (41) is of order . This section shows that the band-Toeplitz structure of the matrix allowsvery efficient implementation of (41), with a number of opera-tions lower than that required by standard Kalman smoothing.

The enhanced speech can be obtained by means of thefollowing steps.

1) Determine the matrix . Thismatrix is band-Toeplitz with bandwith and is completelycharacterized by terms whose computation requires

multiplications.2) Compute the vector , that requires

multiplications.3) Compute the vector . This step can be efficiently

implemented by noting that matrix admits the decom-position , where is a lower triangular ma-trix with bandwith and is diagonal [42]. Due to theprevious decomposition, can be obtained with a numberof operations (multiplications and divisions) given by [42]

4) Compute the vector , that requiresmultiplications.

The total number of operations required to obtain is thusgiven by

Standard Kalman smoothing is based on the following state-space realization of the model (1), (2) [13]:

where

......

......

...

and

To obtain the smoothed signal the following filtering equationsmust first be computed for :

(42)

(43)

(44)

(45)

(46)

Subsequently, the following smoother equations must be com-puted for :

(47)

(48)

(49)

Note that (43) and (46) must be initialized by choosingand .

The number of operations (multiplications and divisions) as-sociated with the Kalman filter (42)–(46) are given by [43], [44]

The computational cost of Kalman filtering is thus. The operations required by the smoother (47)–(49) are the

following:


The resulting computational cost is thus . The totalnumber of operations associated with Kalman smoothing is fi-nally given by

It can be noted that the total number of operations requiredby the optimal smoothing procedure is not only lower than thenumber of operations required by standard Kalman smoothingbut also of that of standard Kalman filtering.

V. APPLICATION TO SPEECH ENHANCEMENT

In this section, our purpose is to combine the EIV based ap-proach for AR parameter estimation and the optimal smoothingprocedure to enhance a speech signal.

A. Simulation Protocol

Here, the purpose is to enhance four French speech signals,sampled at 8 kHz and disturbed by an additive noise for an inputSNR varying from 5 to 15 dB. The four clean speech signalsare “le tribunal va bientôt rendre son jugement,”5 “mes yeuxs’accoutument lentement à la pénombre,”6 “plusieurs éditeurslui renvoyèrent son manuscrit”7 and “le musée ouvrira ses portesen décembre.”8 It should be noted that most of the phonemes ofthe French language appear in these sentences.

We carry out a comparative study between the followingtwelve methods.

1) the so-called EIV (SR) method that combines the algo-rithms given in Section II (using the SR criterion) andSection IV;

2) the so-called EIV (HOYW) method which combines thealgorithms given in Section II (using the HOYW criterion)and Section IV;

3) the EM-based algorithm [13];4) the approach proposed by Paliwal [12], using a standard

Kalman filter;5) Paliwal’s method [12], using a Kalman smoother;6) Labarre’s method [15], using two cross Kalman filters;7) Gibson’s approach [19];8) Shen’s approach based on filtering [17];9) Ephraim’s method [2], based on short-time spectral ampli-

tude estimation;10) the Subspace Method using LS estimation [6];11) the Subspace Method introducing temporal perception cri-

teria (TDC) as suggested by Ephraim [4];12) the Subspace Method introducing spectral perception cri-

teria (SDC) as suggested by Ephraim [4].They all operate frame by frame, using an add-overlap method,with 50% overlap. 256 samples are used for each frame. Inmethods 1–8, the autoregressive order has been assigned to10 to fit the first 5 formants of the speech signal. The proposedalgorithms do not require the knowledge of the variance of the

5The Court will soon render its verdict.6My eyes slowly get accustomed to the darkness.7Several editors sent back his manuscript.8The museum’s grand opening is in December.

additive noise, unlike the methods 4–12. In our simulations, itis estimated during silent frames that are assumed to be known.In practice, a VAD should be used.

B. Simulation Results

The simulation results are presented in Table II, in terms ofboth global SNR and segmental SNR improvements. Indeed, an-other quality measure [45] consists in averaging the SNR mea-sured over short frames

where is the SNR defined in the third footnote and cal-culated for the th frame, excluding silent periods and frameswhere the SNR is higher than 35 dB and lower than 0 dB. More-over, denotes the number of such frames. Monte Carlo sim-ulations are based on 100 noises realizations for each sentence.The results are then averaged over the four sentences.

IST confirm that subspace methods provide an enhancedsignal disturbed by a residual musical noise. The perceptioncriteria proposed in [4], introduced in the methods 11 and 12,make it possible to improve the quality of the enhanced speech,even if the SNR improvements are somewhat low. Moreover,when an AR model is used for speech, the enhanced signalsare disturbed by a residual broadband noise with methods 4,6 and 7. The subspace-based methods 10–12 provide betternumerical results, but they are computationally intensive sincethey require a singular value decomposition of a data matrixfor every frame. Moreover, as mentioned in Section I, theyintroduce a musical noise.

It should be noted that the proposed enhancement methodleads to the presence of a slight residual musical noise. However,as suggested by Ma et al. [18], [33], a postfiltering that combinesfrequency and time-domain masking properties can be carriedout to weaken this residual noise.

C. Noise Variance Tracking

In this subsection, the relevance of our appoach is studiedwhen tracking the variance of the additive noise. Indeed, thenoise variance can be obtained from every frame by meansof the EIV method. However, since its estimations may varysignificantly from one frame to another, the following trackingprocedure can be used:

where and are the estimation of the variance of theadditive noise for the th frame and the value obtained usingthe EIV algorithm, respectively. The forgetting factor makes itpossible to adjust the algorithm to slow variations of the additivenoise variance.

Here, we compare our method with the EM algorithm whensentence 1 is disturbed by an additive noise, the variance ofwhich varies slowly in time. The estimations of this varianceprovided by both the EIV approach and the EM method are rep-resented in Fig. 7 and illustrate the noise tracking capabilities ofour method.


TABLE IIAVERAGE SNR AND SEGMENTAL SNR IMPROVEMENTS, IN DECIBELS, BASED ON THE FOUR SENTENCES AND 100 NOISE REALIZATIONS

FOR EACH ONE. THREE GLOBAL INPUT SNR ARE ADDRESSED: 5, 10, AND 15 dB

Fig. 7. Variance tracking for the French sentence “le tribunal va bientôt rendreson jugement.” SNR = 10 dB.

VI. CONCLUSION

In this paper, a solution to the single-microphone speech en-hancement problem has been investigated; this solution is basedon AR model for the speech signal. The EIV formulation of thisproblem makes it possible to carry out the joint estimations ofthe AR parameters and the noise variances. Finally, the min-imum variance smoothing of the speech, based on the AR modelestimated for each frame, has been derived.

It should be of great interest to investigate the extension ofthis approach to the additive colored noise case.

APPENDIX I

Our purpose is to express the determinant of the followingsymmetric matrix:

where the sizes of , , and are, respectively, , ,and (1 1). is also assumed to be invertible. From the product

we can immediately derive the relation

APPENDIX II

Let us consider the AR transfer function defined in (24).The mean value of the function on with re-spect to the measure , where is the Lebesgue mea-sure, is the following:

Proof: Assume for the sake of simplicity that can bewritten as follows:


where denotes the th root of . The functioncan thus be expressed in the following way:

for some , . We can thus obtain

where denotes the unit circle. However, since ,, it is possible to write the relation

that completes the proof.It can be noted that the dispersion of the function

on with respect to the measure is given by

REFERENCES

[1] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub-traction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2,pp. 113–120, Apr. 1979.

[2] Y. Ephraim and D. Malah, “Speech enhancement using a minimummean-square error short-time spectral amplitude estimator,” IEEETrans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109–1121,Dec. 1984.

[3] S. Doclo and M. Moonen, “GSVD-based optimal filtering for singleand multimicrophone speech enhancement,” IEEE Trans. SignalProcess., vol. 50, no. 9, pp. 2230–2244, Sep. 2002.

[4] Y. Ephraim and H. L. V. Trees, “A signal subspace approach for speechenhancement,” IEEE Trans. Speech Audio Process., vol. 3, no. 4, pp.251–266, Jul. 1995.

[5] S. Van Huffel, “Enhanced resolution based on minimum variance es-timation and exponential data modeling,” Signal Process., vol. 33, pp.333–355, Sep. 1993.

[6] S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. Sørensen, “Reduc-tion of broad-band noise in speech by truncated QSVD,” IEEE Trans.Speech Audio Process., vol. 3, no. 6, pp. 439–448, Nov. 1995.

[7] J. Jensen, R. C. Hendriks, R. Heusdens, and S. H. Jensen, “Smoothedsubspace based noise suppression with application to speech enhance-ment,” presented at the EUSIPCO, Singapore, 2005.

[8] B. D. Kovacevic, M. M. Milosavljevic, and M. D. Veinovic, “Robustrecursive AR speech analysis,” Signal Process., vol. 44, pp. 125–138,Jun. 1995.

[9] “Coding of speech at 16 kbit/s using low-delay code excited linear pre-diction,” CCITT Recommendation G.728, 1992.

[10] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,”IEEE Trans. Acoust., Speech, Signal Process., vol. 26, no. 3, pp.197–210, Jun. 1978.

[11] J. H. L. Hansen and M. Clements, “Constrained iterative speech en-hancement with application to speech recognition,” IEEE Trans. SignalProcess., vol. 39, no. 4, pp. 795–805, Apr. 1991.

[12] K. K. Paliwal and A. Basu, “A speech enhancement method based onKalman filtering,” in Proc. ICASSP, 1987, pp. 177–180.

[13] S. Gannot, D. Burchtein, and E. Weinstein, “Iterative and sequentialKalman filter-based speech enhancement algorithms,” IEEE Trans.Speech Audio Process., vol. 6, no. 4, pp. 373–385, Jul. 1998.

[14] E. Grivel, M. Gabrea, and M. Najim, “Speech enhancement as a real-ization issue,” Signal Process., vol. 82, pp. 1963–1978, Dec. 2002.

[15] D. Labarre, E. Grivel, M. Najim, and E. Todini, “Two-Kalman fil-ters based instrumental variable techniques for speech enhancement,”in Proc. 6th IEEE Workshop Multimedia Signal Process., 2004, pp.375–378.

[16] D. Labarre, E. Grivel, N. Christov, and M. Najim, “Relevance ofH filtering for speech enhancement,” in Proc. ICASSP, 2005, pp.169–172.

[17] X. Shen and L. Deng, “A dynamic system approach to speech enhance-ment using the H filtering algorithm,” IEEE Trans. Speech AudioProcess., vol. 7, no. 4, pp. 391–399, Jul. 1999.

[18] N. Ma, M. Bouchard, and R. A. Goubran, “Speech enhancement usinga masking threshold constrained Kalman filter and its heuristic imple-mentation,” IEEE Trans. Acoust., Speech, Language Process., vol. 14,no. 1, pp. 19–32, Jan. 2006.

[19] J. D. Gibson, B. Koo, and S. D. Gray, “Filtering of colored noise forspeech enhancement and coding,” IEEE Trans. Signal Process., vol.39, no. 8, pp. 1732–1742, Aug. 1991.

[20] S. M. Kay, “The effects of noise on the autoregressive spectral esti-mator,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 5,pp. 478–485, Oct. 1979.

[21] W. X. Zheng, “Fast identification of autoregressive signals from noisydata,” IEEE Trans. Circuits Systems II, Express Briefs, vol. 52, no. 1,pp. 43–48, Jan. 2005.

[22] C. E. Davila, “A subspace approach to estimation of autoregressiveparameters from noisy measurements,” IEEE Trans. Signal Process.,vol. 46, no. 2, pp. 531–534, Feb. 1998.

[23] S. M. Kay, “Noise compensation for autoregressive spectral esti-mates,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 3,pp. 292–303, Mar. 1980.

[24] J. R. Treichler, “Transient and convergent behavior of the adaptive lineenhancer,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no.1, pp. 53–62, Feb. 1979.

[25] W. R. Wu and P. C. Chen, “Adaptive AR modeling in white Gaussiannoise,” IEEE Trans. Signal Process., vol. 45, no. 5, pp. 1184–1191,May 1997.

[26] D. Labarre, E. Grivel, Y. Berthoumieu, M. Najim, and E. Todini,“Consistent estimation of AR parameters from noisy observationsbased on two interacting Kalman filters,” Signal Process., vol. 86, pp.2863–2876, Oct. 2006.

[27] M. Deriche, “AR parameter estimation from noisy data using the EMalgorithm,” in Proc. ICASSP, 1994, pp. 69–72.

[28] M. Gabrea, E. Grivel, and M. Najim, “A single microphone Kalmanfilter-based noise canceller,” IEEE Signal Process. Lett., vol. 6, no. 3,pp. 55–57, Mar. 1999.

[29] R. K. Mehra, “On the identification of variances and adaptive Kalmanfiltering,” IEEE Trans. Autom. Control, vol. 15, no. 2, pp. 175–184,Apr. 1970.

[30] S. Korl, H.-A. Loeliger, and A. G. Lindgren, “AR model parameterestimation: From factor graphs to algorithms,” in Proc. ICASSP, 2004,pp. 509–512.

[31] N. Virag, “Single channel speech enhancement based on maskingproperties of the human auditory system,” IEEE Trans. Speech AudioProcess., vol. 7, no. 2, pp. 126–137, Mar. 1999.

[32] Y. Hu and P. C. Loizou, “Incorporating a psychoacoustical model infrequency domain speech enhancement,” IEEE Signal Process. Lett.,vol. 11, no. 2, pp. 270–273, Feb. 2004.

[33] N. Ma, M. Bouchard, and R. A. Goubran, “Perceptual Kalman filteringfor speech enhancement in colored noise,” in Proc. ICASSP, 2004, pp.717–720.

[34] S. Van Huffel and P. Lemmerling, Eds., Total Least Squares and Er-rors-in-Variables Modeling: Analysis, Algorithms and Applications.Norwell, MA: Kluwer, 2002.

[35] R. P. Guidorzi, “Certain models from uncertain data: The algebraiccase,” Syst. Control Lett., vol. 17, pp. 415–424, Dec. 1991.

[36] R. Diversi, U. Soverini, and R. Guidorzi, “A new estimation approachfor AR models in presence of noise,” in Proc. Preprints 16th IFACWorld Congr., 2005, pp. 290–294.

[37] S. Beghelli, R. Guidorzi, and U. Soverini, “The Frisch scheme in dy-namic system identification,” Automatica, vol. 26, pp. 171–176, Jan.1990.


[38] R. Diversi, R. Guidorzi, and U. Soverini, “A noise-compensated esti-mation scheme for ar processes,” in Proc. 44th IEEE Conf. Dec. Con-trol, Eur. Control Conf., 2005, pp. 4146–4151.

[39] M. Itakura, “Minimum prediction residual principle applied to speechrecognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 23, no.1, pp. 67–72, Feb. 1975.

[40] V. Pisarenko, “The retrieval of harmonics from a covariance function,”Geophysical J. Royal Astronomical Soc., vol. 33, pp. 347–366, 1973.

[41] C. W. Therrien, Discrete Random Signals and Statistical Signal Pro-cessing. Englewood Cliffs, NJ: Prentice-Hall, 1992.

[42] G. H. Golub and C. F. V. Loan, Matrix Computations. Baltimore,MA: Johns Hopkins Univ. Press, 1983.

[43] I. A. Gura and A. B. Bierman, “On computational efficiency of linearfiltering algorithms,” Automatica, vol. 7, pp. 299–314, May 1971.

[44] J. M. Mendel, “Computational requirements for a discrete Kalmanfilter,” IEEE Trans. Automat. Control, vol. 16, no. 6, pp. 748–758,Dec. 1971.

[45] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Pro-cessing of Speech Signals. Piscataway, NJ: IEEE Press, 2000.

William Bobillet received the electronics engi-neering degree from the Ecole Nationale Supérieurd’électronique, Informatique et Radiocommuni-cations de Bordeaux, Bordeaux, France, in 2002,and the M.S. degree in signal processing from theUniversity of Bordeaux 1, Bordeaux, France, in2003, where he is currently pursuing the Ph.D.degree in image and signal.

His research interests include system modelingand errors-in-variables identification, withapplications in mobile communication and speech

processing.

Roberto Diversi was born in Faenza, Italy, in July1970. He received the “Laurea” degree in electronicengineering and the Ph.D. degree in system engi-neering from the University of Bologna, Bologna,Italy, in 1996 and 2000, respectively.

Since 1997, he has been with the Departmentof Electronics, Computer Science, and Systems(DEIS), University of Bologna, where he is currentlya Research Associate. His research interests includesystem identification, optimal filtering, signal pro-cessing, fault detection.

Eric Grivel received the diploma of engineer inelectronics and the Ph.D. degree in signal processingfrom the University of Bordeaux, Bordeaux, France,in 1996 and 2000, respectively.

He is currently an Associate Professor with theTelecommunications Department, ENSEIRB, agraduate national engineering school in Bordeauxwhich belongs to the Signal and Image ResearchGroup, University of Bordeaux. His research inter-ests include the design of parametric approachesfor signal processing with applications in mobile

communication and speech processing.

Roberto Guidorzi holds the chair of System Theoryat the University of Bologna, Bologna, Italy, since1980. From 1987 to 1992, he was the Director of theComputer Centre, Engineering School of BolognaUniversity, and since January 1997, he has been theDirector of the Interfaculty Center for Advanced andMultimedia Technologies (CITAM).

He has been a Visiting Professor and InvitedSpeaker in European and American universitiesand has collaborated with several industries in thedevelopment of advanced projects, among them

management systems for gas pipeline networks, satellite navigation systems,computerized injection systems, software for the identification of natural gasreservoirs, tracking and data fusion, early diagnosis in railway systems. He isthe author of about 200 publications dealing with subjects of a methodologicalnature as well as applications of system theory methodologies. His presentresearch interests include errors-in-variables identification and filtering, blindchannel equalization, and the development of e-learning environments.

Mohamed Najim (M’74–SM’83–F’89) received theDr.Sci. degree (Doctorat d’Etat) from the Universityof Toulouse, Toulouse, France, in 1972.

In 1972, he joined the University of Rabat, Rabat,Morocco, as an Associate Professor and he becamea Professor in 1974. Since 1988, he has been a Pro-fessor with the ENSEIRB/University of Bordeaux,Bordeaux, France, where he founded the Signal andImage Processing Laboratory. He has worked invarious fields, including microwaves, modeling andidentification, adaptive filtering including H infinity,

and control. His research interests presently include modeling and identificationin multidimensional signal and image processing with applications in speech,seismic, biomedical, radar signal processing, textures, and image enhancement.He supervised more than 50 Ph.D. theses and published over 220 scientificpapers and coauthored several books, including Parametric Modelling inImage Processing (Masson, 1994). He is the author of the book Modellingand Identification in Signal Processing (Masson, 1988). He set up the “CADSoftware Library” Group within the GDR-CNRS TdSI, which is a Frenchresearch program on signal and image processing.

Dr. Najim has organized 15 international conferences on control, signalprocessing, and friendly exchange through the Internet. He coorganized withProf. Th. Kailath (Stanford University) and Prof. P. Dewilde (Delft University)two workshops respectively dedicated to maths and systems (Saint Emilion,France, 1997) and structured algorithms (Cadzand, Netherlands, 2002). Hewas the Cochairman of the IEEE Statistical Signal Processing Conference’05organized in Bordeaux, France, July 2005. He was the Editor of numeroussymposia proceedings. From 1981 to 1990, he was a member of the IFACTechnical Board. He is currently a member of the Technical Committee onDigital Signal Processing of the IEEE Circuits and Systems Society. Since1999, he has been an elected Associate Member of the Third World Academy ofSciences. He has managed various projects with industrial partners (includingTexas Instruments, Digital Equipment, ST Microelectronics). He is currentlythe head of the LASIS a joint TOTAL/CNRS Laboratory.

Umberto Soverini was born in Bologna, Italy, in1959. He received the “Laurea” degree in electronicengineering and the Ph.D. degree in system engi-neering from the University of Bologna, Bologna,Italy, in 1985 and 1990, respectively.

In 1992, he was appointed as a Research Associatewith the Department of Electronics, Computer Sci-ence, and Systems, University of Bologna, where heis currently an Associate Professor. In 1999, he was aVisiting Researcher with the Department of Systemsand Control, Uppsala University, Uppsala, Sweden.

His research interests include stochastic realization theory, signal processing,and system identification.

Speech Enhancement Combining Optimal Smoothing and Errors-In-Variables Identification of Noisy AR...

Documents

Transcript of Speech Enhancement Combining Optimal Smoothing and Errors-In-Variables Identification of Noisy AR...