A speech enhancement method based on sparse ... - HKIE

11
HKIE TRANSACTIONS, 2017 VOL. 24, NO. 1, 24–34 http://dx.doi.org/10.1080/1023697X.2016.1210545 A speech enhancement method based on sparse reconstruction on log-spectra Tak Wai Shen and Daniel P K Lun Centre for Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, People’s Republic of China ABSTRACT A new speech enhancement method using sparse reconstruction of the log-spectra is presented. Similar to the traditional sparse coding methods, the proposed algorithm makes use of the least angle regression (LARS) with a coherence criterion (LARC) algorithm to reconstruct the log power spectrum of clean speech. However, a new stopping criterion is introduced to allow the LARC algorithm to adapt to various background noise environments. In addition, a modified two-step noise reduction with a log-MMSE filter is applied which solves the bias of estimated a-priori signal-to-noise ratio (SNR). A notable improvement in the proposed algorithm over traditional speech enhancement methods is its adaptability to the changes in the SNR of noisy speech. The performance of the proposed algorithm is evaluated using standard measures based on a large set of speech and noise signals. The results show that a significant improvement is achieved com- pared to traditional approaches, especially in non-stationary noise environments where most traditional algorithms fail to perform. ARTICLE HISTORY Received 7 January 2016 Accepted 5 July 2016 KEYWORDS Speech enhancement; dictionary learning; sparse representation 1. Introduction Signal-channel speech enhancement is widely used in applications such as speech recognition, voice commu- nications and information forensics [1]. However, it is hard to remove noise successfully without causing dis- tortion in the estimated speech. The efficiency of most frequency domain speech enhancement approaches relies on the accurate estimation of the clean speech power spectrum (PS), which is often difficult to achieve [16]. Recently, a speech enhancement algorithm was proposed using an expectation-maximisation (EM) framework working in the cepstral domain [7]. Since the coefficients of speech in the cepstral domain are sparse, a good estimation of the speech PS is achieved and in turn improves the speech enhancement per- formance. While the sparsity of the cepstrum repre- sentation is the key to the success of the algorithm proposed in [7], a question naturally arises as to whether there are other representations of speech that can give an even better performance in terms of sparsity. Recently, the techniques of sparse representation and dictionary learning have been widely investigated and provided possible solutions for many signal process- ing problems. The goal of these techniques is to look for the sparsest representation of a signal in terms of linear combination of atoms in an overcomplete dictionary. They have been adopted in many applica- tions in speech processing, such as speech recognition [8], voice activity detection (VAD) [9], and speaker identification [10]. They have also been introduced to speech enhancement methods [1114]. For instance, in [13], the approximation of the K-singular value decom- position (K-SVD) [15] algorithm and the least angle regression (LARS) with a coherence criterion (LARC) algorithm were used to learn the composite dictionary and reconstruct the speech spectral amplitude. The LARC method extends the LARS algorithm to include a residual coherence stopping criterion and optimise it to solve a large number of simultaneous coding prob- lems efficiently. The residual coherence stopping crite- rion of LARC is invariant to changes in signal energy. In addition, the LARC method allows the code and dictionary entries to assume values of the entire real domain. However, the composite dictionary depends on the noise dictionary, which is difficult to train and inconvenient to use because the background noise type is diversified and the property is not known in advance for general applications. On the other hand, a speech enhancement method using the sparse reconstruction of the approximated power spectral density (PSD) through the magnitude-squared spectrum was pre- sented in [14]. The approximate K-SVD algorithm with non-negative constraint was applied to train the PSD dictionary of the clean speech signal. The enhanced speech was obtained by combining the estimated PSD with the signal subspace approach based on the short- time spectral amplitude (SSB-STSA). The above stud- ies demonstrated that the sparse coding method can improve the quality of noisy speeches. In this work, CONTACT Tak Wai Shen [email protected] © 2017 The Hong Kong Institution of Engineers

Transcript of A speech enhancement method based on sparse ... - HKIE

HKIE TRANSACTIONS, 2017VOL. 24, NO. 1, 24–34http://dx.doi.org/10.1080/1023697X.2016.1210545

A speech enhancement method based on sparse reconstruction on log-spectra

Tak Wai Shen and Daniel P K LunCentre for Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong,People’s Republic of China

ABSTRACTAnew speech enhancementmethodusing sparse reconstruction of the log-spectra is presented.Similar to the traditional sparse codingmethods, the proposed algorithmmakes use of the leastangle regression (LARS)with a coherence criterion (LARC) algorithm to reconstruct the logpowerspectrum of clean speech. However, a new stopping criterion is introduced to allow the LARCalgorithm to adapt to various background noise environments. In addition, a modified two-stepnoise reduction with a log-MMSE filter is applied which solves the bias of estimated a-priorisignal-to-noise ratio (SNR). A notable improvement in the proposed algorithm over traditionalspeech enhancement methods is its adaptability to the changes in the SNR of noisy speech. Theperformance of the proposed algorithm is evaluated using standard measures based on a largeset of speech andnoise signals. The results show that a significant improvement is achieved com-pared to traditional approaches, especially in non-stationary noise environments where mosttraditional algorithms fail to perform.

ARTICLE HISTORYReceived 7 January 2016Accepted 5 July 2016

KEYWORDSSpeech enhancement;dictionary learning; sparserepresentation

1. Introduction

Signal-channel speech enhancement is widely used inapplications such as speech recognition, voice commu-nications and information forensics [1]. However, it ishard to remove noise successfully without causing dis-tortion in the estimated speech. The efficiency of mostfrequency domain speech enhancement approachesrelies on the accurate estimation of the clean speechpower spectrum (PS), which is often difficult to achieve[1–6]. Recently, a speech enhancement algorithm wasproposed using an expectation-maximisation (EM)framework working in the cepstral domain [7]. Sincethe coefficients of speech in the cepstral domain aresparse, a good estimation of the speech PS is achievedand in turn improves the speech enhancement per-formance. While the sparsity of the cepstrum repre-sentation is the key to the success of the algorithmproposed in [7], a question naturally arises as towhether there are other representations of speech thatcan give an even better performance in terms ofsparsity.

Recently, the techniques of sparse representation anddictionary learning have been widely investigated andprovided possible solutions for many signal process-ing problems. The goal of these techniques is to lookfor the sparsest representation of a signal in termsof linear combination of atoms in an overcompletedictionary. They have been adopted in many applica-tions in speech processing, such as speech recognition[8], voice activity detection (VAD) [9], and speaker

identification [10]. They have also been introduced tospeech enhancement methods [11–14]. For instance, in[13], the approximation of the K-singular value decom-position (K-SVD) [15] algorithm and the least angleregression (LARS) with a coherence criterion (LARC)algorithm were used to learn the composite dictionaryand reconstruct the speech spectral amplitude. TheLARC method extends the LARS algorithm to includea residual coherence stopping criterion and optimise itto solve a large number of simultaneous coding prob-lems efficiently. The residual coherence stopping crite-rion of LARC is invariant to changes in signal energy.In addition, the LARC method allows the code anddictionary entries to assume values of the entire realdomain. However, the composite dictionary dependson the noise dictionary, which is difficult to train andinconvenient to use because the background noise typeis diversified and the property is not known in advancefor general applications. On the other hand, a speechenhancement method using the sparse reconstructionof the approximated power spectral density (PSD)through the magnitude-squared spectrum was pre-sented in [14]. The approximate K-SVD algorithmwithnon-negative constraint was applied to train the PSDdictionary of the clean speech signal. The enhancedspeech was obtained by combining the estimated PSDwith the signal subspace approach based on the short-time spectral amplitude (SSB-STSA). The above stud-ies demonstrated that the sparse coding method canimprove the quality of noisy speeches. In this work,

CONTACT Tak Wai Shen [email protected]

© 2017 The Hong Kong Institution of Engineers

25

the above-mentioned methods are further extended toapply sparse coding techniques to the reconstructionof the log power spectrum (LPS) of speech. It is wellknown that distortion measures based on the mean-square error of the log-spectra are more appropriatefor speech processing [3]. In addition, a new adap-tive residual coherence threshold is proposed as thestopping criterion, which enables the speech dictio-nary to adapt to various noise environments in order toimprove the enhancement performance. As shown inthe simulation results, the proposed algorithm demon-strates an outstanding performance, particularly whenthe contaminating noises are not totally random butcontain a certain structure in the frequency domain.Better performance is obtained in all cases evaluated byusing different standardmeasures as comparedwith thestate-of-the-art speech enhancement techniques.

This paper is organised as follows. In Section 2,some key formulations used in the traditional speechenhancement algorithms are recalled. This is followedby a brief introduction to the traditional frameworkof sparse coding and dictionary learning for speechenhancement in Section 3. The new framework ofsparse representation, dictionary learning and recon-struction on log power spectra for speech enhancementis described in Section 4, then the simulation resultsare presented in Section 5 and conclusions are drawnin Section 6.

2. Related works in speech enhancement

A speech signal x is contaminated by additive noisen such that the noisy speech y = x+ n. In thefrequency domain, this process can be modelledas Y(k,i) = X(k,i)+N(k,i), where Y(k,i), X(k,i) andN(k,i) represent the kth spectral component of theshort-time frame i of the Fourier transform of thespeech elements y, x and n, respectively. For improvedreadability, the segment index i is dropped whereverpossible; thus, Sy(k), Sx(k) and Sn(k) are the PS ofnoisy speech, underlying clean speech and input addi-tive noise, respectively. The noise reduction processconsists in the application of a spectral gain G(k) toeach short-time spectrum value Y(k). In practice, thespectral gain requires the evaluation of two parame-ters. The a-posteriori signal-to-noise ratio (SNR) γ (k)can be evaluated based on the observed noisy speechand the a-priori SNR ξDD(k) can be estimated by thedecision-directed approach:

γ (k, i) = |Y(k, i)|2Sn(k, i)

, ξDD(k, i) = αSx(k, i− 1)Sn(k, i)

+ (1− α) max {γ (k, i)− 1, 0}, (1)

where Sn is the estimator of Sn. The enhanced speechcan be obtained by applying the gain function to the

spectrum of the noisy speech as follows:

X(k) = G(k)Y(k) where G(k) = g(γ (k), ξDD(k)). (2)

The function g can be the different gain functionsproposed in the literature [3]. Two of the most populargain functions are the Wiener filter and the log-MMSEgain functions. Their most commonly used basic formsare [1]:

GWiener(k) = ξDD(k)1+ ξDD(k)

, (3)

Glog-MMSE(k) = ξDD(k)1+ ξDD(k)

exp{12

∫ ∞v(k)

e−t

tdt

}where

v(k) = ξDD(k)1+ ξDD(k)

γ (k). (4)

In [4], a two-step noise reduction (TSNR) techniquewas introduced to improve the noise reduction perfor-mance. In the first step of the TSNR, the spectral gainGWiener(k) is computed as in Equation (3) and is usedto generate an initial guess for the enhanced speech. Itis used in the second step to estimate the a-priori SNRas follows:

ξTSNR(k, i) = α′|GWiener(k, i)Y(k, i)|2

Sn(k, i)

+ (1− α′) max {γ (k, i+ 1)− 1, 0}.(5)

In order to avoid the additional processing delay dueto the usage of the future frame (i+ 1), the parameterα′ is set to 1. Thus, Equation (5) becomes:

ξTSNR(k, i) = |GWiener(k, i)Y(k, i)|2Sn(k, i)

. (6)

Finally, the new a-priori SNR is used in the Wienerfilter gain function as shown below:

GTSNR(k) = ξTSNR(k)1+ ξTSNR(k)

. (7)

This TSNR technique will be modified and appliedto the proposed algorithm.

In addition, it is shown in [6] that the temporalcepstrum smoothing (TCS) method can give a goodestimation of the a-priori SNR for some non-stationarynoise environments. The algorithm can be imple-mented by first computing the maximum-likelihood(ML) estimation of the a-priori clean speech PS asfollows:

SMLx (k, i) = Sn(k, i) max {ξML(k), ξmin

ML }, (8)

where ξminML is the minimum value allowed for ξML(k),

and ξML(k) = γ (k)− 1 is the ML estimation of the

HKIE TRANSACTIONS

26

a-priori SNR. Next, the cepstral representation ofSMLx (k, i) is computed as:

Cx(q) = IDFT{log(SMLx (k))}, (9)

where q is the cepstral index, also known as the que-frency index [16]. Then the selected cepstral coef-ficients are recursively smoothed over time with aquefrency-dependent parameter α(q) as follows:

CTCSx (q, i) = αTCS(q)CTCS

x (q, i− 1)

+ (1− αTCS(q))Cx(q, i). (10)

The parameter α(q) is also smoothed recursivelyusing:

αTCS(q, i)

={

αpitch if q ∈ Cpitch(i)

βαTCS(q, i− 1)+ (1− β)αconstq otherwise

,

(11)

whereCpitch refers to the set of cepstral bin indices asso-ciated with the fundamental frequency. All parametersincluding αpitch, β and αconst

q for different values of qneed to be determined empirically [6]. In this case, theestimation of the clean speech PS can be computed by:

STCSx = exp(DFT(CTCSx )). (12)

3. Sparse coding techniques for speechenhancement

The basic model of sparse coding suggests that naturalsignals can be efficiently explained as linear combina-tions of prespecified atom signals in overcomplete dic-tionaries where the linear coefficients are sparse (mostof them are zero). Formally, if y is a column signal andD is the dictionary (whose columns are the atom sig-nals), then the sparse coding can be described using acardinality constraint:

c∗ = argminc‖y− Dc‖2 subject to‖c‖0 ≤ K � P,(13)

or using an error constraint:

c∗ = argminc‖c‖0 subject to‖y− Dc‖2 ≤ ε, (14)

where ε is the error tolerance, ‖ · ‖0 is the L0 pseudo-norm which is counting the non-zero entries of a vec-tor, and K is the target sparsity. The matrix D ∈ R

NxP

with N < P is an overcomplete dictionary usually nor-malised by the L2 norm. The vector c ∈ R

P is the sparsecoefficient of the signal y ∈ R

N . Since N < P, an infi-nite number of solutions are available for the problem,which is called a NP-hard problem. The desired solu-tion can be estimated by using (i) greedy searchingmethods such as the matching pursuit (MP), orthogo-nalMP (OMP) or gradient pursuit (GP); (ii) nonconvex

local optimisation methods such as the focal under-determined system solver (FOCUSS); or (iii) convexrelaxation methods such as the least absolute shrinkageand selection operator (LASSO), the LARS and LARC,and others [15,17]. The LARS [18] is a very efficientmodel selection algorithm that gives a solution whichclosely resembles that of the LASSO (and with a simplemodification, it can be made to give the LASSO solu-tion exactly). As with OMP, each iteration of the LARSconsists of an atom selection and a coding coefficientupdate step. Atom selection is based on the maximalcorrelation to the current residual. For the coefficientupdate step, the LARS selects atoms in the equiangulardirection until a new atom has equal correlation withthe residual as all atoms in the active set. The termi-nation criterion of the LARS is based on a coding car-dinality or a residual norm value. Based on the LARS,the coding algorithm LARC [13] uses a residual coher-ence threshold as the sparsity parameter. As the LARCmethod is also a greedy algorithm, the coherent compo-nents will be coded before the incoherent components,and the maximum residual coherence will decrease ineach iteration. A residual coherence threshold τl is usedas the stopping criterion, which does not depend onthe magnitude of the observation. It is not necessary toadapt the residual coherence threshold to the data on aframe-by-frame basis in contrast to specifying cardinal-ity or a residual norm. This enables a trade-off betweensource distortion and source confusion by controllingthe coding sparsity.

The abovemethods are based on the assumption thatthe speech dictionary is already known. When apply-ing them to speech enhancement problems, a properspeech dictionary needs to be designed.Methods of dic-tionary learning were presented in [19]. The K-SVDalgorithm is one of the most popular methods for dic-tionary learning [20]. It includes an initial overcompletedictionary D0, a set of training signals arranged as thecolumns of a matrix Y and the iteration number k. Ittargets iteratively improving the dictionary to achievethe sparse coding of the signal in Y, which is achievedby solving the following optimisation problem:

minD,Cs‖Y − DCs‖2F subject to ∀i ‖ci‖0 ≤ K, (15)

where ‖ · ‖2F denotes the Frobenius norm, ci is the ithrow of Cs, and K is the desired sparsity. The approx-imation error can be reduced by increasing the targetsparsity K, but this also increases the computationaltime. As D and Cs are unknown, the objective functionof Equation (15) is not convex. The K-SVD algorithmsolves it by alternating between the sparse coding of theexamples based on the current data and updating thedictionary’s atoms. The sparse coding step is commonlyimplemented by employing the OMP. When updatingthe dictionary, one atom is processed at a time to opti-mise the target function while keeping the rest fixed.

T. W. SHEN AND D. P. K. LUN

27

To further reduce complexity, the approximation of K-SVD algorithm is proposed using power iterations onthe dictionary update step [15].

When applying the sparse coding technique tospeech enhancement, it is desirable to have thedictionary D(s) trained to be coherent to the speechsignal and incoherent to the background noise signal.This can be achieved relatively easily when the back-ground noise is white Gaussian and does not containany structure. Such background noise is incoherent toany fixed dictionary and in particular to the speechdictionary [21]. However, many relevant kinds of back-ground noise contain structure; if the background noiseis partially coherent to the speech dictionary, thenit will also incur strong coding coefficients that arevery difficult to be removed. To solve this problem, itis suggested that a coherent noise dictionary D(i) forstructured background noises should be trained. It isshown in [13] that by using a composite dictionaryD = [D(s) D(i)], a significant improved enhancementperformance can be achieved compared with using asingle speech dictionary.

4. The new framework of sparse coding anddictionary learning of the LPS for speechenhancement

It is well known that distortion measures based on themean square error of the log-spectra aremore appropri-ate for speech processing [3]. In fact, the LPS of a signalis related to its log periodogram, as in the followingformulation [22]:

log(Sx(k)) = log(Sx(k))+ R(k) where

R(k) = ε(k)− γ , (16)

where ε(k) are i.i.d.with zeromean and a fixed varianceπ2/6, and γ ≈ 0.577216 is the Euler’s constant. TheLPS of the estimated speech can also be expressed invector form as follows:

Slog = Dlogclog + R, (17)

where Slog = log(Sx) ∈ RN is a column vector, the

matrix Dlog = {dlogj}Pj=1 ∈ RNxP(N < P) containing P

atoms is an overcomplete dictionary usually normalisedby the L2 norm, the vector clog ∈ R

N is the sparsecoding coefficient such that:

Slog = Dlogclog , (18)

where Slog = log(Sx). Using an error constraint, thesparse coding can be described as the following min-imisation process:

clog = argminclog‖clog‖0

subject to‖Slog − Dlog clog‖2 ≤ ε. (19)

For solving the above minimisation problem, theLARC algorithm is directly adopted; it is a greedymethod consisting of an atom selection and a cod-ing coefficient update step in each iteration. Differentfrom that in [13], the LARC algorithm is applied toreconstruct the LPS of the clean speech. The detailedprocedure is shown in Algorithm 1.

Algorithm 1 Batch LARC on LPS.1. Input: Sx ∈ R

N ; Dlog ∈ RNxP; G = DT

logDlog ; τl2. Slog← log(Sx)3. Normalise xN ← {Slog −mean(Slog)}/std(Slog)4. Initialise coefficient vector c(0)log ← 0 and fitted vec-

tor yN (0)← 05. Initialise active set A← {}; number of active set

KA← 06. μ(x)← DT

logxN ; μ(y)← 0

7. While |A| < Dlog do8. μ← μ(x) − μ(y)

9. j∗ ← argmaxj|μj|, j ∈ Ac

10. A← A ∪ {j∗}; KA← KA + 111. if |μj|/‖xN − yN‖2 < τl, then break12. s← sign(μA)

13. g ← G−1(A,A)s14. b← (gTs)−1/215. w← bg16. u← Dlog( : ,A)

w17. a← G( : ,A)w

18. Calculate step length ϕ← min+e∈Ac

( |μj∗ | − μe

b− ae,

|μj∗ | + μe

b+ ae

)19. μ(y)← μ(y) + ϕa20. Update fitted vector yN ← yN + ϕu21. Update regression coefficients clogA ← clogA + ϕw22. End while23. Slog = Dlogclog24. Denormalise Slog = Slog×std(Slog)+mean (Slog)25. Output coefficients: Slog ∈ R

P, KA

The above procedure shows that the Algorithm 1can be used to obtain a good estimate of the sparsecoding coefficients of the true PS of the periodogramof the clean speech based on the speech dictionaryD. The problem is how to obtain the periodogram ofthe clean speech in the first place. To do so, a modi-fied TSNR procedure is proposed which is similar tothe EM algorithm [7]. Firstly, a rough estimation ofthe periodogram of the clean speech is made from thenoisy observation using a traditional speech enhance-ment algorithm. Then, based on the Algorithm 1, theinitial estimate of the true PS of the clean speech isobtained. This is then used to compute the a-priori

HKIE TRANSACTIONS

28

SNR and in turn the minimum mean square error oflog-spectral amplitude (MMSE-LSA) gain function torefine the estimation of the clean speech periodogram.Finally, the enhanced speech is obtained.

More specifically, the TCS method is adopted toobtain the initial estimate of the clean speech peri-odogram. The TCS method [6] works reasonably wellfor non-stationary noises and can give an acceptableestimation of Sx from the noisy Sy without the needfor a very accurate a-priori SNR estimator. Thus, theTCS method is adopted to obtain the initial guess of Sxfor the proposed algorithm. Then, the estimated LPS ofthe clean speech Slog is obtained by the Algorithm 1.The enhanced speech X can thus be obtained by usingthe modified TSNR gain function in which the a-prioriSNR is computed based on the current estimate Slog .The MMSE-LSA gain function theoretically gives theminimum mean square error estimation of the log-magnitude spectra of speeches. More specifically, theestimate of the a-priori SNR is refined in the first step asfollows:

ξTSL(k) = |GFS(k)Y(k)|2Sn(k)

, (20)

where GFS(k) = Glog-MMSE(γ (k), ξFS(k)) and ξFS =exp(Slog)

Sn. Then, the MMSE-LSA gain function is com-

puted and applied to obtain the clean speech estimateas follows:

GSRTSL(k) = ξTSL(k)1+ ξTSL(k)

exp{12

∫ ∞vTSL(k)

e−t

tdt

}where

vTSL(k) = ξTSL(k)1+ ξTSL(k)

γ (k). (21)

The enhanced speech X can thus be obtainedby:

|X| = GSRTSL(k) · |Y|,X = |X|exp(j∠Y),

(22)

where ∠Y is the phase angle of Y.As mentioned above, the dictionary Dlog needs to

be obtained before the sparse reconstruction stage. Thetraining of the dictionary Dlog can be carried out bysolving the following optimisation problem:

minD,Clog‖Slog − DlogClog‖2F subject to ∀i ‖clogi‖0 ≤ K,

(23)where clogi is the ith row of Clog, and Slog is the logtrue PS of the clean speeches. In practice, these areapproximated by their log periodogram. The approx-imate K-SVD method is directly adopted to train thedictionary Dlog. The complete algorithm is shown inAlgorithm 2.

Algorithm 2 Approximate K-SVD on log powerspectra.1. Input: Signal set Sx ∈ R

NxframeNum;initial dictionary Dlog0 ∈ R

NxP;target sparsity K, and number of iterations n

2. Set Slog ← log(Sx); Dlog ← Dlog03. Normalise XN ← {Slog −mean(Slog)}/std(Slog)4. For iter = 1 to n do5. ∀i : Clogi argminclog‖XNi − DlogClog‖22 subject

to ‖clog‖0 ≤ K6. For j = 1 to P do7. Dlog( : ,j) ← 08. I← {indexes of the signals in Slog whose

representations use dj}9. g ← C T

logj,I10. d← XNIg− DlogClogIg11. d← d/‖d‖212. g ← XNI

Td − (DlogClogI )Td

13. Dlog( : ,j) ← d14. Clogj,I ← gT

15. end for16. end for17. Output dictionary Dlog ∈ R

NxP

4.1. Adaptive residual coherence threshold

For the LARC algorithm as described in Algorithm1, a residual coherence threshold τl is used to definethe stopping criterion of the iterative sparse codingprocess (Algorithm 1, step 11). Basically, its selectionis not critical if the background noise is incoherentto the speech dictionary (such as white noise) [13].This is the case to begin with when applying the pro-posed algorithm since the true PS of speech is differ-ent from its periodogram by an i.i.d. error function(Equation (16)). Nevertheless, since the initial estimateof Sx actually comes from the TCS, some of the back-ground noise – which can be structural – may stillremain in the initial estimate of Sx. Such residual back-ground noise components are thus coherent with thespeech dictionary. The selection of τl becomes criticalin this case as it can significantly affect the final speechenhancement performance. To illustrate this, a compar-ison of the enhancement performance of the proposedspeech enhancement algorithm with various residualcoherence thresholds τl is shown in Figure 1. Morespecifically, the perceptual evaluation of speech quality(PESQ) scores of the enhanced speech generated by theproposed method are compared for different values ofτl. PESQ is an International Telecommunication Unionstandard for evaluating speech quality 0. The resultsshow that when the residual coherence threshold τl istoo high (e.g. τl = 0.4), poor performance results for alltypes of background noise as distortion occurs duringreconstruction. For white noise that is incoherent tothe speech dictionary, better results are obtained with a

T. W. SHEN AND D. P. K. LUN

29

1.7

1.8

1.9

2

2.1

2.2

2.3(a)

(b)

(c)

Babble White Buccaneer

PE

SQ

Input SNR = 0 dB

0.1

0.2

0.3

0.4

2.6

2.7

2.8

2.9

3

3.1

Babble White Buccaneer

PE

SQ

Input SNR = 10 dB

0.1

0.2

0.3

0.4

3.3

3.4

3.5

3.6

3.7

3.8

Babble White Buccaneer

PE

SQ

Input SNR = 20 dB

0.1

0.2

0.3

0.4

Figure 1. Enhancement performance of the proposed speechenhancement algorithm SRLPS-TSL for τ l = 0.1, 0.2, 0.3 and 0.4with an input SNR of: (a) 0 db; (b) 10 db; and (c) 20 db.

lower threshold value (e.g. τl = 0.1). For babble noisesthat are highly coherent to the speech dictionary, ahigher threshold value (e.g. τl = 0.3) results in a betterperformance. For buccaneer noises which are partiallycoherent to the speech dictionary, lower threshold val-ues are more favourable when the input SNR is low(0 dB and 10 dB), but when the input SNR is higher, ahigher threshold value is preferred (e.g. τl = 0.3). Theabove shows that the speech enhancement performanceis not only affected by the coherence between the back-ground noise and the dictionary, but also by the inputSNR.

To deal with this problem, the traditional approachuses a composite dictionary that consists of both thespeech and background noise dictionary. However, itis difficult to train the background noise dictionary, asthe noise property is not known in advance for mostapplications. It is proposed that an adaptive residualcoherence threshold is used such that the thresholdvalue can be adjusted automatically for different kindsof noise and different levels of input SNR.

When closely examining Algorithm 1, it is noticedthat the parameter KA (Algorithm 1, step 5) can givea good indication of the coherence between the inputsignal and the speech dictionary. So, if the PS of a

noise signal is input into Algorithm 1, KA can be usedto indicate the coherence between the noise and thespeech dictionary. To illustrate this, the first step isto use VAD [1] to detect the noise frames in a noisyspeech signal. Then, the PS of the noise is estimatedby taking the average of the periodogram of all thedetected noise frames. This is then fed into the LARCalgorithm with the trained speech dictionary Dlog anda fixed τl (e.g. τl = 0.5). The number of iterations thatthe algorithm requires to converge (i.e. the parameterKA in Algorithm 1, step 10) is recorded and averagedwith that of the previous noise frames. The experimen-tal results show that such an average KA has a closerelationship to the coherence between the backgroundnoise and the speech. Table 1 shows the results. Fornoises with higher coherence (e.g. babble noise), thevalue of KA can be very large (e.g. 14.05), while fornoises with lower coherence (e.g. pink noise), the valueof KA is smaller (e.g. 3.51). The results in Table 1 showthat KA can be used as an indicator for choosing theτl for noisy speech frames; τl = 0.5 is selected in thepresent experiment because when τl is too high, onlyone iteration is needed for some types of noise withlower coherence, while when τl is too low, the computa-tion time is increased for each iteration. Therefore, τl =0.5 is deemed a suitable threshold for distinguishing thecoherence between background noise and speech.

As mentioned above, the input SNR can also affectthe selection of the residual coherence threshold τl.However, it is noticed that its effect is not linear but sim-ilar to theWiener gain function, i.e. it flattens out whenthe noise level is sufficiently small. Since theWiener fil-ter gain function GWiener(k) in Equation (3) could beupdated in every frame, it is proposed that the meanof theWiener filter gain functionGWiener(k) is used as aparameter to compute the residual coherence threshold.More specifically, a parameter hi is defined for a noisyspeech frame as follows:

hi = 1N

N−1∑k=0

GWiener(k). (24)

Then, the residual coherence threshold for a specificnoise τn can be obtained as follows:

τn = min{max{max{(KA − b1)/b2, 0}+ b3hi, τmin}, τmax}, (25)

Table 1. Summary of the average number of active sets KAversus different background noises (τ l = 0.5).

Noise Average KA

White 1.00Speech babble 14.05Destroyer engine room 9.45F16 cockpit 8.99Pink 3.51Buccaneer cockpit 6.03M109 tank 7.50

HKIE TRANSACTIONS

30

where τmin = 0.1 and τmax = 0.3; three parametersb1 = 20, b2 = 25 and b3 = 0.5 are set empirically.Equation (25) is formulated based on the observationexplained earlier that the residual coherence thresholdτn should be proportional to the coherence betweenthe residual background noise and the speech dictio-nary (represented by the parameter KA) and the inputSNR (represented by the parameter hi). The parameterb1 is the offset, while b2 and b3 are the weighting fac-tors, selected by fitting the PESQ results of the proposedalgorithm using Equation (25) to over 80 speech sam-ples of different genders, noise types and noise levels(in fact, their selection is not sensitive to these factors).To reduce the fluctuation between frames, τn is fur-ther smoothed by taking a weighted average with theestimate in the previous frame as follows:

τa(i) = αaτa(i− 1)+ (1− αa)τn(i), (26)

where the smoothing factor αa is set to 0.8; τa isthus the proposed adaptive residual coherence thresh-old. To summarise, the proposed speech enhance-ment algorithm based on the new framework of sparsereconstruction on log power spectra with the two-stepMMSE-LSA filtering (SRLPS-TSL) can be described asfollows:

(1) Use Algorithm 2 to train Dlog from the cleanspeech.

nS logS logD

Slog

Initial Guess: Temporal Cepstrum

Smoothing (TCS)

Sparse Coding: Algorithm 1

Enhanced Speech

Noisy Speech

Dictionary Learning: Algorithm 2

Clean Speech

First Step of the TSL: Equations (4) and (20)

TSL (k)

Second Step of the TSL: Equations (21) and (22)

X

Noise Estimation: Voice Activity

Detection (VAD)

Sparse Coding:

Algorithm 1 ( = 0.5) lt

at

ξ

Figure 2. The operation of the proposed speech enhancementalgorithm SRLPS-TSL.

(2) For the noisy speech input, compute the initialguess of the LPS Slogusing the TCS method ofEquation (12), i.e. Slog = log (STCSx ).

(3) Compute the adaptive residual coherence thresh-old τa using Equation (26) and Algorithm 1 withnoise estimation Sn(k).

(4) Estimate the LPS Slog of the clean speech usingAlgorithm 1 and τl = τa.

(5) For the first step of TSL, estimate the a-priori SNRξTSL(k) using Equations (4) and (20).

(6) For the second step of TSL, obtain the enhancedspeech spectrum X using Equations (21) and (22).

(7) Obtain the enhanced speech signal by inverse dis-crete Fourier transform (IDFT), overlay and add.

The operation of the algorithm is also described inFigure 2.

5. Simulations and results

In this section, the performance of the proposedalgorithm is presented and compared to the perfor-mance of state-of-the-art speech enhancement meth-ods. To start with, an example is used to illustrate theimportance of the proposed algorithm working on thelog power spectra instead of the normal power spectra.Figure 3 shows a segment of a typical noisy speech peri-odogram compared to its original clean speech peri-odogram, the enhanced speech periodogram using theproposed speech enhancement algorithmon the PS andthe proposed speech enhancement algorithm on theLPS. It can be seen that both the reconstructed peri-odograms (PS and LPS) can restore the spectral peak oforiginal speech. However, the LPS method restores thespectral valley much better than the PS method. Thisexample demonstrates that it is more appropriate toapply the sparse reconstruction method to log-spectrafor speech enhancement.

A comparison of the spectrogram of enhancedspeech generated using different algorithms withcoloured noises at input is shown in Figure 4. Table

0 2 4-50

-40

-30

-20

-10

0

10

20Voiced frame

Frequency (kHz)

Mag

nitu

de (

dB)

Original

NoisyPS

LPS

Figure 3. Effect of the noise reduction with sparse reconstruc-tion and dictionary learning on a voiced frame with pink noiseand an input segSNR of−3.69 dB.Note: Original = the original speech spectrum; Noisy = the noisy speech

spectrum; PS = the reconstruction on the power spectrum; LPS =the reconstruction on the log power spectrum.

T. W. SHEN AND D. P. K. LUN

31

2 gives a summary of the algorithms that are com-pared. The speech sampling rate is 16 kHz. The sim-ulation details are as follows: the frame size is 512samples (∼32ms), the fast Fourier transform (FFT)size is 1024 samples (zeros padded each frame with512 samples) and the window shift step size is 128samples (75% overlap). The LPS dictionary of clean

speech is trained by the approximateK-SVDmatrix lab-oratory (MATLAB) Toolbox [15]. The sparsity targetK as mentioned in Equation (15) is set to 10 andthe number of iterations is set to 30. The dictionaryDlog is learned using 100 sentences extracted from thetraining portion of the Texas Instruments (TI) andMassachusetts Institute of Technology (MIT) database,

Original

)zHk(

ycneuqerF

0 0.5 1 1.5Time (second)

Time (second)

2 2.5 3 3.50

2

4

6

8

Noisy

)zHk(

ycneuqerF 0 0.5 1 1.5 2 2.5 3 3.5

0

2

4

6

8

MMSE-LSA)zHk(

ycneuqerF

0 0.5 1 1.5 2 2.5 3 3.50

2

4

6

8

HRNR)zHk(

ycneuqerF

0 0.5 1 1.5 2 2.5 3 3.50

2

4

6

8

MMSE-LSA CEM)zHk(

ycneuqerF

0 0.5 1 1.5 2 2.5 3 3.50

2

4

6

8

Proposed SRLPS-TSL

Time (second)

Time (second)

Time (second)

Time (second)

)zHk(

ycneuqerF

0 0.5 1 1.5 2 2.5 3 3.50

2

4

6

8

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4. Spectrograms of the speech sample with Leopard noise and an input segSNR of−5.79 dB: (a) original; (b) noisy; (c) MMSE-LSA; (d) HRNR; (e) MMSE-LSA CEM; and (f ) the proposed SRLPS-TSL.

HKIE TRANSACTIONS

32

Table 2. Summary of the algorithms compared in thesimulations.

Method Description

MMSE-LSA Minimum mean-square error log-spectral amplitudeestimator [3]

HRNR Harmonic regeneration noise reduction algorithm [4]MMSE-LSA CEM EM-based cepstrum smoothing using a MMSE-LSA

filter [7]SRLPS-TSL The proposed algorithm, based on sparse reconstruc-

tion on log power spectra with two-step MMSE-LSAfiltering

which is different to the testing portion. No sentenceshould appear in both the training and testing portions.The size of the dictionary Dlog is 513× 1024 and isinitialised by randomly taking the training data. Forall algorithms, the noise PS is estimated by first usingthe initial frames that are assumed to have no speechenergy and then updating whenever a frame is detectedas having no speech energy by using VAD [23]. Figure4(b) shows what happens when the military vehicle

Table 3. Composite measurement comparison of different algorithms. (Input SNR ranging from 0 dB to+20 dB. The higher scoresof Csig, Cbak and Covl indicate better performance.)

Input SNR

Csig Cbak Covl

Noise Method 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

White Noisy 1.34 1.92 2.61 3.26 3.87 1.72 2.17 2.64 3.12 3.59 1.35 1.87 2.42 2.95 3.45M-L 1.84 2.54 3.16 3.67 4.05 2.18 2.59 2.97 3.28 3.51 1.80 2.37 2.86 3.26 3.55TSNR 1.48 2.10 2.56 3.00 3.52 2.11 2.58 3.02 3.45 3.90 1.57 2.12 2.55 2.98 3.44HRNR 1.94 2.50 3.06 3.62 4.17 2.28 2.72 3.17 3.63 4.08 1.83 2.33 2.84 3.35 3.85M-LTS 2.11 2.91 3.50 3.96 4.33 2.41 2.87 3.29 3.69 4.08 2.05 2.70 3.21 3.62 3.98M-LC 2.47 3.17 3.71 4.17 4.56 2.62 3.05 3.43 3.82 4.23 2.35 2.94 3.38 3.79 4.17Proposed 2.43 3.20 3.79 4.25 4.64 2.57 3.03 3.44 3.84 4.24 2.30 2.94 3.44 3.85 4.22

Speech babble Noisy 2.52 3.09 3.63 4.10 4.50 1.68 2.15 2.64 3.13 3.61 1.96 2.47 2.95 3.39 3.78M-L 2.53 3.11 3.60 3.97 4.22 1.87 2.35 2.79 3.16 3.44 2.04 2.56 3.01 3.36 3.59TSNR 1.67 2.51 3.23 3.82 4.32 1.46 2.06 2.64 3.19 3.72 1.37 2.09 2.72 3.27 3.74HRNR 1.86 2.73 3.46 4.06 4.52 1.46 2.09 2.70 3.26 3.78 1.48 2.23 2.88 3.43 3.88M-LTS 2.21 2.94 3.56 4.07 4.49 1.86 2.39 2.90 3.39 3.85 1.84 2.47 3.03 3.50 3.91M-LC 2.41 3.09 3.69 4.19 4.59 1.95 2.47 2.96 3.46 3.92 2.00 2.59 3.12 3.59 3.99Proposed 2.58 3.24 3.81 4.29 4.67 2.11 2.60 3.07 3.54 3.98 2.15 2.73 3.23 3.69 4.06

Destroyer engine room Noisy 2.00 2.59 3.19 3.75 4.25 1.52 1.98 2.46 2.96 3.47 1.65 2.14 2.64 3.12 3.57M-L 2.45 3.02 3.51 3.89 4.15 1.96 2.39 2.79 3.14 3.41 2.04 2.53 2.95 3.29 3.55TSNR 1.89 2.48 3.05 3.64 4.19 2.03 2.48 2.89 3.34 3.81 1.75 2.28 2.76 3.26 3.74HRNR 2.20 2.92 3.56 4.12 4.60 2.09 2.60 3.05 3.50 3.96 1.90 2.54 3.10 3.59 4.03M-LTS 2.57 3.20 3.67 4.09 4.49 2.19 2.71 3.16 3.59 4.00 2.19 2.78 3.24 3.64 4.01M-LC 3.00 3.55 3.97 4.36 4.72 2.49 2.94 3.35 3.74 4.16 2.57 3.08 3.48 3.85 4.20Proposed 3.10 3.66 4.09 4.47 4.81 2.50 2.97 3.40 3.81 4.21 2.63 3.15 3.58 3.95 4.28

F16 cockpit Noisy 1.95 2.58 3.20 3.77 4.28 1.57 2.05 2.54 3.05 3.55 1.63 2.17 2.70 3.20 3.66M-L 2.45 3.05 3.56 3.94 4.20 2.03 2.47 2.87 3.21 3.46 2.09 2.60 3.04 3.37 3.61TSNR 1.67 2.31 2.92 3.53 4.13 1.90 2.42 2.92 3.39 3.87 1.56 2.16 2.71 3.23 3.75HRNR 2.04 2.79 3.48 4.10 4.59 2.00 2.54 3.06 3.54 3.99 1.75 2.44 3.07 3.61 4.05M-LTS 2.56 3.18 3.67 4.07 4.48 2.21 2.72 3.18 3.61 4.04 2.21 2.79 3.26 3.65 4.03M-LC 2.89 3.45 3.88 4.30 4.68 2.43 2.89 3.32 3.74 4.16 2.50 3.01 3.43 3.82 4.19Proposed 2.99 3.56 4.00 4.39 4.75 2.46 2.93 3.35 3.77 4.18 2.57 3.09 3.52 3.90 4.25

Leopard (military vehicle) Noisy 3.39 3.83 4.22 4.56 4.86 1.99 2.42 2.87 3.32 3.80 2.66 3.08 3.45 3.80 4.14M-L 3.41 3.67 3.88 4.03 4.13 2.63 2.93 3.18 3.37 3.52 2.88 3.16 3.38 3.55 3.66TSNR 3.55 4.07 4.48 4.80 4.98 2.76 3.21 3.63 4.01 4.38 3.03 3.52 3.91 4.23 4.49HRNR 3.72 4.25 4.66 4.94 5.00 2.84 3.29 3.71 4.09 4.43 3.15 3.63 4.03 4.35 4.59M-LTS 3.82 4.29 4.66 4.92 4.99 2.90 3.33 3.73 4.08 4.42 3.25 3.70 4.07 4.35 4.56M-LC 3.92 4.36 4.71 4.94 5.00 2.89 3.33 3.73 4.10 4.46 3.29 3.73 4.09 4.38 4.60Proposed 4.16 4.52 4.81 4.97 5.00 3.13 3.50 3.85 4.17 4.49 3.53 3.90 4.20 4.44 4.63

M109 (tank) Noisy 2.66 3.24 3.78 4.26 4.68 1.85 2.31 2.78 3.26 3.75 2.23 2.73 3.20 3.63 4.03M-L 3.05 3.46 3.77 3.98 4.11 2.49 2.85 3.15 3.38 3.53 2.66 3.03 3.31 3.51 3.63TSNR 2.36 3.21 3.91 4.44 4.83 2.42 2.95 3.47 3.95 4.37 2.26 2.94 3.52 3.97 4.33HRNR 2.74 3.57 4.22 4.69 4.95 2.54 3.06 3.54 3.97 4.35 2.49 3.18 3.73 4.14 4.45M-LTS 3.13 3.71 4.23 4.66 4.94 2.62 3.10 3.57 4.02 4.42 2.77 3.30 3.76 4.15 4.45M-LC 3.37 3.93 4.41 4.80 4.98 2.76 3.22 3.67 4.10 4.49 2.95 3.45 3.89 4.26 4.54Proposed 3.55 4.08 4.53 4.87 4.99 2.85 3.30 3.73 4.14 4.52 3.09 3.57 3.99 4.32 4.58

Mixed F16 cockpit andbabble noise

Noisy 2.24 2.84 3.42 3.94 4.40 1.63 2.10 2.60 3.10 3.59 1.80 2.32 2.83 3.30 3.73M-L 2.26 2.90 3.37 3.70 3.93 1.96 2.43 2.83 3.15 3.37 1.96 2.52 2.95 3.25 3.46TSNR 1.26 1.92 2.58 3.24 3.90 1.56 2.11 2.65 3.16 3.69 1.20 1.81 2.39 2.97 3.53HRNR 1.49 2.36 3.16 3.84 4.40 1.63 2.23 2.81 3.33 3.83 1.33 2.09 2.77 3.36 3.86M-LTS 2.32 2.90 3.35 3.78 4.23 2.05 2.54 2.99 3.43 3.87 2.00 2.55 2.99 3.39 3.80M-LC 2.48 3.04 3.50 3.95 4.40 2.18 2.65 3.09 3.53 3.98 2.16 2.66 3.09 3.51 3.93Proposed 2.57 3.13 3.58 4.01 4.44 2.22 2.68 3.13 3.56 4.00 2.22 2.73 3.17 3.58 3.97

Notes: aHRNR = harmonic regeneration noise reduction; M-L = minimum mean square error of log-spectral amplitude (MMSE-LSA) estimator; M-LC =MMSE- LSA with EM-based cepstrum smooth (CEM); M-LTS = MMSE-LSA with temporal cepstrum smoothing (TCS) and speech presence probability(SPP); Proposed = the proposed algorithm based on sparse reconstruction on log power spectra with the two-step MMSE-LSA filtering (SRLPS-TSL);TSNR = two-step noise reduction.bFigures in bold fonts represent the highest scores.

T. W. SHEN AND D. P. K. LUN

33

(Leopard) noise (from theNOISEX-92 database [24]) isadded to the speech with an input segmental signal-to-noise ratio (segSNR) of around −5.79 dB. Figure 4(c)depicts the spectrogram using the traditional MMSE-LSA method. It can be seen that although some ofthe background noise is suppressed, the speech signalis distorted. Figure 4(d) shows the spectrogram usingthe harmonic regeneration noise reduction (HRNR)algorithm [4]. Although much of the speech contentis recovered, the noise control is not sufficient and thestrong background residue noise remains present in theenhanced speech. Figure 4(e) shows the spectrogramgiven by the MMSE-LSA expectation-maximisationbased cepstrum smoothing (CEM) algorithm [7].Muchresidual noise remains at the low frequency part of thespectrum,whichmay be due to the strong spectral coef-ficients of the Leopard noise at low frequencies, wherestrong speech spectral coefficients can also be found.These are then mixed up in the MMSE-LSA CEMalgorithm. Figure 4(f) shows the spectrogram using theproposed SRLPS-TSL algorithm. It can be seen thatthere is very good background noise control (as indi-cated in the circled areas) and the speech content is alsopreserved. It can also be seen that many of the low fre-quency noise coefficients are suppressed, while thoseof the speech remain intact. This result demonstratesthe ability of the proposed sparse coding method indifferentiating speech and noise coefficients.

The performance of the proposed SRLPS-TSLalgorithm is further evaluated using the standard objec-tive evaluation measures and compared with the fol-lowing approaches: MMSE-LSA [3], TSNR [4], HRNR[4],MMSE-LSA plus speech presence probability (SPP)using a TCS method [6], and MMSE-LSA CEM [7].

In the simulation, 40 male and 40 female speechsamples were arbitrarily selected from the TI and MITdatabase [25]. The noise signals were adopted fromthe NOISEX-92 database [24] and added to the speechsamples with input SNR ranging from about 0 dB to+20 dB. To predict the quality of the enhanced speechsamples, three composite objective metrics [1] wereused: (a)Csig , signal distortion (SIG) formed by linearlycombining the log-likelihood ratio (LLR), PESQ andweighted-slope spectral distance (WSS) measures; (b)Cbak, noise distortion (BAK) formed by linearly com-bining the segSNR, PESQ and WSS measures; and (c)Covl: overall quality (OVL) formed by linearly combin-ing the PESQ, LLR and WSS measures.

Table 3 lists the performance of six different algo-rithms for noisy speech signals contaminated by sixdifferent kinds of noise, both stationary and non-stationary. In addition, the background noise alsoincludes a mixed noise which is a combination of F16cockpit noise and speech babble noise (one second ofF16 noise and then one second of babble noise for oneperiod) in order to show the adaptive ability of the eval-uated approaches in situations where the background

noise is rapidly changing. As can be seen from Table3, the proposed SRLPS-TSL algorithm always outper-forms the other five – and in many cases, the improve-ment is significant. Its performance is consistent forspeech samples originating from people of differentgenders and at different noise levels. These results thusdemonstrate the robustness of the proposed algorithm.

6. Conclusion

In this paper, an improved speech enhancementalgorithm based on the sparse coding on log-spectra isproposed. The proposed algorithm starts by using thetraditional cepstrum smoothing method which givesan initial guess of the log periodogram of the cleanspeech. The sparse coding is carried out using a LARCalgorithm with a speech dictionary trained using aK-SVD method. The LARC algorithm is improved byintroducing a noise adaptive residual coherence thresh-old so that the stopping criterion adapt to the noisetype and the input SNR.The improvedLARCalgorithmgives a good estimate of the sparse coding coefficientsof the clean speech’s LPS. Combining this with the TSLmethod, enhanced speech is obtained. The proposedalgorithm does not only fully exploit the sparsity ofspeeches through the use of the sparse speech dictio-nary, but it also reduces the confusion in the sparse cod-ing process due to the coherence between the noise andthe speech. As a result, the proposed algorithm worksparticularly well when the input speech is contami-nated by noise that contains structure (which includesthe most coloured noises). The proposed algorithmalso has very good control of the residual backgroundnoises, allowing it to outperform traditional meth-ods. The simulation results verify that the proposedalgorithm shows a better performance compared to tra-ditional speech enhancement methods under a varietyof testing conditions using several evaluationmeasures.Thus, the robustness of the proposed algorithms for usein general speech enhancement applications is clearlyshown.

Funding

This work was supported by The Hong Kong PolytechnicUniversity [grant number G-YBAR].

Notes on contributors

Dr Tak Wai Shen received his BEng(Hons) degree in Electronic Engineeringfrom the Oxford Brookes University, theUK in 1992, and his MPhil and Ph.D.degrees in Electronic and InformationEngineering from The Hong Kong Poly-technic University in 1998 and 2015,respectively. He has served as an Engi-neer and Manager in several listed com-

panies in Hong Kong. He also won the Best Paper Award

HKIE TRANSACTIONS

34

at the 2015 Institute of Electrical and Electronics Engineers(IEEE) International Conference on Digital Signal Process-ing (DSP 2015). Currently, he is a Senior Engineer of theHong Kong Applied Science and Technology Research Insti-tute (ASTRI). His research interests include digital signal andimage processing, fast algorithms and speech enhancement.He is a member of the US’s IEEE and the UK’s Institution ofEngineering and Technology (IET).

Ir Dr Daniel P K Lun received hisB.Sc. (Hons) degree from the Univer-sity of Essex, the UK in 1988 and hisPh.D. degree fromTheHongKong Poly-technic University in 1991. He is cur-rently an Associate Professor and Act-ing Head of the Department of Elec-tronic and Information Engineering atThe Hong Kong Polytechnic University.

He has published more than 120 international journals andconference papers. His research interests include speech andimage enhancement, wavelet theories and applications, andmultimedia technologies. He was the Chairman of the IEEEHong Kong Chapter of Signal Processing in 1999/2000, theFinance Chair of the 2003 IEEE International Conference onAcoustics, Speech and Signal Processing, the General Chairof the 2004 International Symposium on Intelligent Multi-media, Video and Speech Processing (ISIMP2004), and theFinance Chair of the 2010 International Conference on ImageProcessing (ICIP2010). He is amember of theDSP andVisualSignal Processing and Communications Technical Commit-tee of the IEEECircuits and Systems Society.He is a registeredengineer, a corporatemember of the IET and TheHongKongInstitution of Engineers (HKIE), and a senior member of theIEEE.

ORCiD

Tak Wai Shen http://orcid.org/0000-0002-6151-0245Daniel P K Lun http://orcid.org/0000-0003-3891-1363

References

[1] Loizou PC. Speech enhancement: theory and practice.Boca Raton, FL: CRC Press; 2007.

[2] EphraimY,Malah D. Speech enhancement using amin-imum mean square error short time spectral amplitudeestimator. IEEE Trans Acoust Speech Signal Process.1984;32(6):1109–1121.

[3] Ephraim Y, Malah D. Speech enhancement using aminimum mean-square error log-spectral amplitudeestimator. IEEE Trans Acoust Speech Signal Process.1985;33(2):443–445.

[4] Plapous C, Marro C, Scalart P. Improved signal-to-noise ratio estimation for speech enhancement.IEEE Trans Acoust Speech Signal Process. 2006;14(6):2098–2108.

[5] Lun DPK, Shen TW, Hsung TC, Ho KC. Waveletbased speech presence probability estimator for speechenhancement. Digit Signal Process. 2012;22(6):1161–1173.

[6] Breithaupt C, Gerkmann T, Martin R. A novel a-prioriSNR estimation approach based on selective cepstro-temporal smoothing. Proc. IEEE Int. Conf. on Acous-tics, Speech and Signal Processing (ICASSP); 2008 Apr.p. 4897–4900.

[7] Lun DPK, Shen TW, Ho KC. A novel expectation-maximization framework for speech enhancement in

non-stationary noise environments. IEEE Trans AudioSpeech Lang Process. 2014;22(2):335–346.

[8] Gemmeke JF, Virtanen T, Hurmalainen A. Exemplar-based sparse representations for noise robust automaticspeech recognition. IEEE Trans Audio Speech LangProcess. 2011;19(7):2067–2080.

[9] Deng SW, Han JQ. Statistical voice activity detectionbased on sparse representation over learned dictionary.Digit Signal Process. 2013;23(4):1228–1232.

[10] Naseem I, Togneri R, Bennamoun M. Sparse represen-tation for speaker identification. Proc. IEEE Int. Conf.Pattern Recognition; 2010 Aug. p. 4460–4463.

[11] Schmidt MN, Larsen J, Hsiao FT. Wind noise reductionusing non-negative sparse coding. Proc. IEEE Work-shop on Machine Learning for Signal Processing; 2007.p. 431–436.

[12] Wilson K, Raj B, Smaragdis P, Divakaran A. Speechdenoising using non-negative matrix factorization withpriors. Proc. IEEE Int. Conf. Acoustics, Speech andSignal Processing (ICASSP); 2008 Mar. p. 4029–4032.

[13] Sigg C, Dikk T, Buhmann J. Speech enhancementusing generative dictionary Learning. IEEETransAudioSpeech Lang Process. 2012;20(67):1698–1712.

[14] Zhao Y, Zhao X, Wang B. A speech enhancementmethod based on sparse reconstruction of powerspectral density. Comput Electr Eng. 2014;40(4):1080–1089.

[15] Rubinstein R, Zibulevsky M, Elad M. Efficient imple-mentation of the K-SVD algorithm using batchorthogonal matching pursuit. Technical report. Haifa:Technion-computer Science Department; 2008.

[16] Bogert BP, Healy MJR, Tukey JW. The quefrency anal-ysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. In:Rosenblatt M, editor. Time series analysis, Ch. 15. NewYork: Wiley; 1963. p. 209–243.

[17] Rubinstein R, Zibulevsky M, Elad M. Double spar-sity: Learning sparse dictionaries for sparse signalapproximation. IEEE Trans Signal Process. 2010;58(3):1553–1564.

[18] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angleregression. Ann. Statist. 2004;32:407–499.

[19] Tosic I, Frossard P. Dictionary learning. IEEE SignalProcess Mag. 2011;28(2):27–38.

[20] Aharon M, Elad M, Bruckstein AM. The K-SVD: Analgorithm for designing of overcomplete dictionariesfor sparse representation. IEEE Trans Signal Process.2006;54(11):4311–4322.

[21] Rauhut H, Schnass K, Vandergheynst P. Compressedsensing and redundant dictionaries. IEEE Trans InfTheory. 2008;54(5):2210–2219.

[22] Moulin P. Wavelet thresholding techniques for powerspectrum estimation. IEEE Trans Signal Process. 1994;42:3126–3136.

[23] Davis A, Nordholm S, Togneri R. Statistical voice activ-ity detection using low-variance spectrum estimationand an adaptive threshold. IEEE Trans Audio SpeechLang Process. 2006;14(2):412–424.

[24] Varga A, Steeneken HJM. Assessment for automaticspeech recognition II: NOISEX-92: a database andan experiment to study the effect of additive noiseon speech recognition systems. Speech Commun.1993;12(3):247–251.

[25] Garofolo JS. Getting started with the DARPA TIMITCD-ROM: an acoustic phonetic continuous speechdatabase. Gaithersburg, Maryland (MD): Nat. Inst.Standards Technol. (NIST); prototype as of 1988 Dec.

T. W. SHEN AND D. P. K. LUN