Frequency Selection Based Separation of Speech Signals ...

9
ARCHIVES OF ACOUSTICS Vol. 42, No. 2, pp. 287–295 (2017) Copyright c 2017 by PAN – IPPT DOI: 10.1515/aoa-2017-0031 Frequency Selection Based Separation of Speech Signals with Reduced Computational Time Using Sparse NMF Yash Vardhan VARSHNEY, Zia Ahmad ABBASI, Musiur Raza ABIDI, Omar FAROOQ Department of Electronics Aligarh Muslim University Aligarh, India e-mail: {yashvarshneyy, omarfarooq70}@gmail.com, [email protected], abidimr@rediffmail.com (received September 2, 2016; accepted December 21, 2016 ) Application of wavelet decomposition is described to speed up the mixed speech signal separation with the help of non-negative matrix factorisation (NMF). It is assumed that the basis vectors of training data of individual speakers had been recorded. In this paper, the spectrogram magnitude of a mixed signal has been factorised with the help of NMF with consideration of sparseness of speech signals. The high frequency components of signal contain very small amount of signal energy. By rejecting the high frequency components, the size of input signal is reduced, which reduces the computational time of matrix factorisation. The signal of lower energy has been separated by using wavelet decomposition. The present work is done for wideband microphone speech signal and standard audio signal from digital video equipment. This shows an improvement in the separation capability using the proposed model as compared with an existing one in terms of correlation between separated and original signals. Obtained signal to distortion ratio (SDR) and signal to interference ratio (SIR) are also larger as compare of the existing model. The proposed model also shows a reduction in computational time, which results in faster operation. Keywords: sparse NMF; mixed speech recognition; machine learning. 1. Introduction The problem of mixed signal separation have been attracting researchers for a long time. Non-negative matrix factorisation (NMF) has emerged for the use of source separation (Lee, Seung, 1999; Paatero, Tapper, 1994). Initially used for mathematical com- putation, NMF has now found its application in the field of various source separations. NMF factorises a two dimensional matrix into its components and their weights. For speech signals, the spectrogram magni- tude can be considered as primary two-dimensional matrix. As speech signals are sparse in nature, sparse NMF is applied for factorise it (Benetos et al., 2006; Demir et al., 2013; Schmidt, Olsson, 2006). Proper factorisation of matrix is a time consum- ing process because it needs hundreds of iterations. Numbers of computations in single iteration depend upon the number of input samples. By ignoring the data which consists of negligible signal energy, how- ever, the number of samples for operation can be re- duced. Although an audio signal has frequencies in the range of 20–20000 Hz, most of the information of speech signal is contained in the lower frequencies. In practice, higher frequencies are mostly affected by real time random noise (generated by recording camera and environment), therefore, by processing only lower frequency samples, faster operation may be achieved without any noticeable degradation in the quality of separation. Initially, sparse NMF was used to separate mixed images. Hoyer (2004) reported successful implemen- tation of NMF with sparse constraints for image sep- aration. The use of NMF for speech application and its advantages over Independent Component Analysis (ICA) are reported in (Cho et al., 2003). Benetos et al. (2006) used sparse NMF (SNMF) application for classification of individual musical instrument from a mixed sound. Demir et al. (2013) have shown the de- mixing of music material made by jingle catalog and speech using NMF. Single channel speech separation using sparse NMF was proposed by Schmidt and Olsson (2006). Re- cently, Wang et al. (2014) have shown that the perfor-

Transcript of Frequency Selection Based Separation of Speech Signals ...

ARCHIVES OF ACOUSTICS

Vol. 42, No. 2, pp. 287–295 (2017)

Copyright c© 2017 by PAN – IPPT

DOI: 10.1515/aoa-2017-0031

Frequency Selection Based Separation of Speech Signals with ReducedComputational Time Using Sparse NMF

Yash Vardhan VARSHNEY, Zia Ahmad ABBASI, Musiur Raza ABIDI, Omar FAROOQ

Department of ElectronicsAligarh Muslim University

Aligarh, Indiae-mail: {yashvarshneyy, omarfarooq70}@gmail.com, [email protected], [email protected]

(received September 2, 2016; accepted December 21, 2016 )

Application of wavelet decomposition is described to speed up the mixed speech signal separation withthe help of non-negative matrix factorisation (NMF). It is assumed that the basis vectors of trainingdata of individual speakers had been recorded. In this paper, the spectrogram magnitude of a mixedsignal has been factorised with the help of NMF with consideration of sparseness of speech signals.The high frequency components of signal contain very small amount of signal energy. By rejecting thehigh frequency components, the size of input signal is reduced, which reduces the computational timeof matrix factorisation. The signal of lower energy has been separated by using wavelet decomposition.The present work is done for wideband microphone speech signal and standard audio signal from digitalvideo equipment. This shows an improvement in the separation capability using the proposed model ascompared with an existing one in terms of correlation between separated and original signals. Obtainedsignal to distortion ratio (SDR) and signal to interference ratio (SIR) are also larger as compare of theexisting model. The proposed model also shows a reduction in computational time, which results in fasteroperation.

Keywords: sparse NMF; mixed speech recognition; machine learning.

1. Introduction

The problem of mixed signal separation have beenattracting researchers for a long time. Non-negativematrix factorisation (NMF) has emerged for the useof source separation (Lee, Seung, 1999; Paatero,Tapper, 1994). Initially used for mathematical com-putation, NMF has now found its application in thefield of various source separations. NMF factorisesa two dimensional matrix into its components and theirweights. For speech signals, the spectrogram magni-tude can be considered as primary two-dimensionalmatrix. As speech signals are sparse in nature, sparseNMF is applied for factorise it (Benetos et al., 2006;Demir et al., 2013; Schmidt, Olsson, 2006).Proper factorisation of matrix is a time consum-

ing process because it needs hundreds of iterations.Numbers of computations in single iteration dependupon the number of input samples. By ignoring thedata which consists of negligible signal energy, how-ever, the number of samples for operation can be re-duced. Although an audio signal has frequencies in

the range of 20–20 000 Hz, most of the informationof speech signal is contained in the lower frequencies.In practice, higher frequencies are mostly affected byreal time random noise (generated by recording cameraand environment), therefore, by processing only lowerfrequency samples, faster operation may be achievedwithout any noticeable degradation in the quality ofseparation.Initially, sparse NMF was used to separate mixed

images. Hoyer (2004) reported successful implemen-tation of NMF with sparse constraints for image sep-aration. The use of NMF for speech application andits advantages over Independent Component Analysis(ICA) are reported in (Cho et al., 2003). Benetoset al. (2006) used sparse NMF (SNMF) applicationfor classification of individual musical instrument froma mixed sound.Demir et al. (2013) have shown the de-mixing of music material made by jingle catalog andspeech using NMF.Single channel speech separation using sparse NMF

was proposed by Schmidt and Olsson (2006). Re-cently,Wang et al. (2014) have shown that the perfor-

288 Archives of Acoustics – Volume 42, Number 2, 2017

mance may be improved with suitable choice of basisvectors. In the present work a SNMF is used alongwith rejection of data which contains negligible energyto separate two individual signals from a single channelmixed signal with low processing time.Correlation between mixed speech signal and orig-

inal signal, SDR, signal to artifacts ratio (SAR), andSIR between separated signals are found in order tocompare the results of existing and proposed algo-rithm. In present work it is found that the correlationbetween separated and original signals and SDR andSIR of separated signals is increased as desired. Here,it is reported that the proposed model is performingbetter to separate mixed signal than an existing algo-rithm based on these parameters. The proposed modelis also showing faster operation, which makes it appli-cable for real time application.The paper is organised as follows. In Sec. 2, a brief

review of NMF has been given. In Sec. 3, sparse NMFwith the criterion of choosing of sparse parameter iselaborated. Designing of model with the help of NMFand frequency selection using wavelet transform is dis-cussed in Sec. 4 followed by results and conclusion.

2. Non-negative matrix factorisation

Consider two individual speech source generatingsignals s1(t) and s2(t). A microphone is capturing sig-nal which is a mixture of individual signals as:

s(t) = p ∗ s1(t) + q ∗ s2(t), (1)

where p and q are the scaling factor by which in-dividual speech signals are affected, which dependsupon the distance of speakers from the microphone.An NMF factorises the spectrogram magnitude (X)of the mixed speech signal and has N column vectorsof length M . M depends on the frequency resolutiontaken at the time of Short Time Fourier Transform(STFT), whereas N depends upon the length of speech

Fig. 1. Non-negative matrix factorisation of a speech signal represented by three basis vectors.

signal. All elements of the matrix X are non-negative.The NMF factorises X as follows:

X = [W ] [H ], (2)

where the basis matrixW is ofM×K order and weightmatrixH is of K×N . K is the number of basis vectorsfor X. The size of K should be less or equal to min(M , N).To find out closest factorisation, a cost func-

tion between U and V (U is original spectrogrammagnitude matrix and V is reconstructed spectro-gram magnitude matrix) is defined in terms of Eu-clidean distance, i.e.,

∑i,j

(Ui,j − Vi,j)2, however, for

speech/audio applications Itakura-Satio or Kullback-Leibler (K-L) divergence is often found to be more suit-able (Nasersharif, Abdali, 2015; Fevotte et al.,2009; Lee, Seung, 2000). K-L divergence is defined as:

D(U‖V) =∑

ij

(Uij log

Uij

Vij− Uij + Vij

). (3)

As U and V are not symmetric, so D(U‖V) is nottermed as Euclidean distance but it is also lower boundby zero at U = V. For better approximation of fac-torisation of signal X (Eq. (2)), D(X‖WH) should beminimised (Zhu et al., 2013). Initially, random valuesof M × K dimensions are assigned to W, and H iscalculated accordingly. This random initialisation af-fects the quality of factorisation. Better initialisationof basis vector may lead to better approximation. Di-vergence may be minimised by updating either basisvectorsW or their weight H, or both W and H. Forfinding basis vectors of individual training signals, up-date in both W and H is required. This will providemost appropriate basis vectors. Consider spectrogrammagnitude of signals s1(t) and s2(t) as X1 and X2.Then X1 and X2 may be given by:

X1 = [W1] [H1] , X2 = [W2][H2]. (4)

Y.V. Varshney et al. – Frequency Selection Based Separation of Speech Signals. . . 289

By concatenating basis vectors of both the signals,a weight matrix for the mixed signal is found as:

[W ] = [W1 W2]. (5)

According to the generated basis vector of themixed signal, the weight matrix will be calculated us-ing SNMF. The source separation from extracted Wand H matrix is done by separating signal portions asshown here:

X = [W1 W2]

[H1

H2

]= X1 +X2. (6)

Number of basis vectors K also affects the per-formance of NMF. A small value of K will result ingreater error because limited number of basis vectormay not be able to represent the original signal. Forlarge value of K, the signal extraction does not im-prove much, while the computational time increasesdrastically. Wang, Sha (2014) reported the effect ofnumber of basis vectors used for defining any signal.

3. Sparse NMF

One major drawback of conventional NMF is its in-ability to use the sparseness between different speechsignals. Actually conventional NMF is not botheredabout the sparseness of individual signals, which re-duces the quality of separation. As speech signals arehaving sparse characteristics, so it will have sparse rep-resentation of data. By adding sparseness constraintin to NMF, controllability over sparse representationof output can be extended (Wang, Sha, 2014; Kim,Park, 2008). Sparsity can be imposed on the weightmatrix with the help of sparse parameter. Here sparseparameter is represented by γ.The divergence formulation stated in Eq. (3) is

modified into (7) as:

minW,H

D (X‖WH)=minW,H

‖X−WH‖2F+γ

i,j

Hi,j

,

W,H ≥ 0,

(7)

where ‖ . ‖F is denoting the Frobenius norm. Thesparse parameter affects the weight matrixH and basisvectorW as described in Eqs. (8) and (9) respectively:

Hi,j ← Hi,j ∗XT

i W j

[WH ]Ti Dj + γ, (8)

Wj ← Wj ∗∑

i Hi,j

[Xi + ([WH ]

Ti W j)W j

]

∑iHi,j

[[WH ]i + (XT

i W j)W j

] . (9)

The sparse parameter γ is chosen as larger whenstronger sparsity exists but it leads to relatively poor

approximation, while smaller values of γ can be usedfor better accuracy of approximation but the numberof iterations to reach the minima of cost function isincreased. Time taken for processing the signal can alsobe managed by choosing a proper sparse parameter(Hoyer, 2004; Kim, Park, 2008).

4. Frequency selection based speech separation

After analysing approximately 240 speech signals of2 to 3 seconds from 30 different speakers (both malesand females) with the sampling frequency of 16 kHzand 48 kHz (wideband microphone speech signal andstandard audio signal from digital video equipment),it has been found that more than 95% of signal energyis contained in the lower 50% frequency band of signal.For reference a speech signal of English digit ‘one’ ‘two’‘three’ ‘four’ sampled at 16 kHz and 48 kHz and theirwavelet decompositions are shown in Fig. 2.

a)

b)

c)

Fig. 2. Speech signals plot for digits ‘one’ ‘two’ ‘three’‘four’ sampled at 16 and 48 kHz, respectively: a) input sig-nals, b) signals containing low frequencies from 0–4 kHzand 0–12 kHz, c) signals containing high frequencies from

4–8 kHz and 12–24 kHz, respectively.

Figure 2a shows the original signals sampled at16 kHz and 48 kHz in 1st and 2nd row. Signal of lower

290 Archives of Acoustics – Volume 42, Number 2, 2017

Fig. 3. Proposed model for speech separation.

Fig. 4. High frequency signal separation model.

half frequencies denoted by ‘A’ and signal of higherhalf frequencies ‘D’ is shown in Fig. 2b and Fig. 2c,respectively.The proposed speech separation model is a casca-

ded structure. The first stage of the model is a systembased on wavelet decomposition. Second stage sepa-rates the speech signal using SNMF. The frequencycomponents containing higher amount of signal energyare sent for further processing, whereas the remainingsignal components are kept for the reconstruction ofsignal at output as shown in Fig. 3.To reduce the size of signal for separation, sig-

nals are filtered by wavelet decomposition consistingof a low pass and a high pass filters followed by downsampler. A signal coming from lower 50% of frequencyband is sent for factorisation. A high frequency sig-nal is considered as noise and replaced by zero of thesame length. Spectrograms of low frequency signals areobtained by short time Fourier transforms (STFT).Sparse NMFs (SNMFs) factorize the spectrogrammag-nitude matrix X into X1 and X2. After factorisation,speech signals are obtained from X1 and X2 by inverseSTFT (ISTFT). Separated speech signals s1 and s2 arereconstructed by inverse wavelets.In this paper, high frequency signal separation is

also tried in place of zero replacement for high fre-

quency speech signals as shown in Fig. 4. The sepa-rated high frequency signals are recombined with lowfrequency separated signals using IDWT.

5. Evaluation parameters

5.1. Correlation value

The correlation value of mixed and separated sig-nals with individual signals (1st and 2nd signals)has been calculated by using the following expression(Walpole et al., 2011):

r =n∑

xy −∑

x∑

y√[n (∑

x2)− (∑

x)2] [

n (∑

y2)− (∑

y)2] , (10)

where r is the sample correlation coefficient, n is thesample size, x is the value of the 1st variable, y isthe value of the 2nd variable.

5.2. Global performance measuresof source separation

The common distortion measures SDR, SIR, SARare described in Eqs. (11)–(13), see (Vincent et al.,

Y.V. Varshney et al. – Frequency Selection Based Separation of Speech Signals. . . 291

2006) by using (Fevotte et al., 2005). The parametersare defined as:

SDR ∆= 10 log10

‖st‖2

‖einterf + eartif‖2, (11)

SIR ∆= 10 log10

‖st‖2

‖einterf‖2, (12)

SAR ∆= 10 log10

‖st + einterf‖2

‖eartif‖2, (13)

where si = st + einterf + eartif is the estimated/recon-

structed signal, st∆= 〈si, si〉si is the targeted source,

einterf∆= 〈si, si′〉si′ is the interference error, eartif ∆

=si − (st + 〈si, si′〉si′) is the artifacts error, si′ are the

input signals other than si, 〈a, b〉 :=T−1∑t=0

a(t)b(t) is the

inner product between two signals a and b of the samelength and b is a complex conjugate of b.

5.3. Computational time

The implementation of the proposed method wascarried out on MATLAB 2011a working on 64 bit Win-dows 7 operating system running on a 1.9 GHz Intel i7processor with 4 GB RAM. The computation for sep-aration time of mixed speech data is done. Time con-sumed to find the basis vectors from training signal isnot considered because it is assumed that these vectorswill be recorded already and then any source separa-tion technique will be applied.

6. Simulations and results

To evaluate the performance of the proposed ap-proach, simulations have been performed on speech sig-nals from TIMIT database which contains widebandmicrophone speech signals sampled at 16 kHz. Fur-ther the speech signals of digital video equipment sam-

Table 1. Performance of the existing model and proposed model for separation of mixed signal for 16 kHz sampled signal.

Gender Separation time [s] Corr 1st mix Corr 2nd mix Corr 1st x1 Corr 2nd x2 SDR [dB] SIR [dB] SAR [dB]

Existing Model (without using Wavelet Decomposition)

MM 0.89416 0.65814 0.72128 0.71074 0.766438 1.912557 3.654271 9.003129

FF 0.95930 0.69155 0.70432 0.76841 0.783969 2.867325 5.249127 8.435939

MF 0.92917 0.62508 0.75723 0.82236 0.870187 5.266586 8.906659 8.672519

Proposed Model (using Wavelet Decomposition)

MM 0.47996 0.65814 0.72128 0.72100 0.784089 2.464281 5.316304 7.535073

FF 0.49931 0.69155 0.70432 0.76401 0.768281 2.9944 6.211138 7.537197

MF 0.48005 0.62508 0.75723 0.83347 0.866885 5.792528 10.80475 8.193429

Signal Separation of Both Lower and Higher Frequency Components

MM 1.28554 0.65814 0.72128 0.70301 0.75841 1.890813 5.233115 7.620994

FF 0.90763 0.69156 0.70433 0.76826 0.78443 3.058805 5.978104 7.848247

MF 0.86949 0.62509 0.75723 0.83382 0.87869 5.79779 10.46752 8.383128

pled at 48 kHz are taken for simulation from AligarhMuslim University Audio Data Library (AMUADLib)(Upadhyaya et al., 2013). AMUADLib contains 2common and 8 different sentences by 100 speakersof both genders. 400 and 2368 combinations of male-female, 180 and 1856 combinations of female-femaleand 180 and 4228 combinations of male-male speakersof 16 kHz and 48 kHz sampled sentences are taken foranalysis of models, respectively. The average length ofindividual speech signals taken for the experiment is2.1 seconds from TIMIT data and 2.55 seconds fromAMUADLib. As each sentence has a different lengthso zero padding is applied for addition of both signals.From AMUADLib, Hindi and English digits data

(‘Ek’ ‘Do’ ‘Teen’ ‘Char’ ‘Paanch’ ‘Chai’ ‘Saat’ ‘Aath’‘Now’ ‘Dus’ and ‘one’ ‘two’ ‘three’ ‘four’ ‘five’ ‘six’‘seven’ ‘eight’ ‘nine’ ‘ten’) have been taken for trainingpurpose from each speaker. From TIMIT databaseeight sentences have been taken for training purposefrom each speaker. The average length of a trainingsignal is about 8.8 seconds. 512 point Fast FourierTransform with 50% overlap was used to find STFTwith window size of 10 milliseconds. Based on thesetraining signals, basis vectors are found using SNMFas described in Sec. 3. Here the number of basisvector taken is K = 100. The sparseness parameteris fixed for our simulation purpose, i.e., 0.5, whichis enough to separate sparse signals with a moderatecomputational time.A speech signal mixture of male-male (MM), male-

female (MF) and female-female (FF) has been sepa-rated with and without using the frequency based se-lection (Wavelet Decomposition). Table 1 can be usedfor comparing the overall performance of the existingmodel with the proposed model for 16 kHz. Table 1also contains the results for the model in which thehigh frequency signals are separated and the individ-ual signals reconstructed using IDWT with separatedlow frequency signals.

292 Archives of Acoustics – Volume 42, Number 2, 2017

Separation time is the time required to separatemixed signals using the given algorithm in seconds.Corr 1st mix and Corr 2nd mix are the correlation co-efficients of the first speaker’s signal with mixed speechsignal and second speaker’s signal with mixed speechsignal, respectively. After the separation of signals atthe output end, Corr 1st x1 and Corr 2nd x2 are cal-culated to show the correlation coefficients of the firstspeaker’s input signal with the first separated signaland the second speaker’s input signal with the secondseparated signal, respectively. All results are the aver-age of 5 random initialisations.The following histogram shows the average correla-

tion improvement in percentage using the existing andproposed algorithms (Fig. 5).

Fig. 5. Correlation improvement in percentage of individ-ual speech signals with mixed signal and separated signals

using the existing and proposed models.

For all cases, 13.76% and 14.27% average improve-ment is found in correlation between the original andseparated signals using the existing and proposed mod-els, respectively. Correlation improvement by high fre-quency signal separation model is 13.97%, which is bet-ter than the existing algorithm but not better than theproposed model.Figure 6 shows the improvement in SDR and SIR

using the proposed model in comparison with the exist-ing algorithm. The results matched to our expectationsin terms of SDR and SIR. However, the proposed algo-rithm leads to some artifacts result in lower SAR. Thehigher SIR and lower SAR are due to the lower pro-jection of the reconstructed signal to the other signalsthan the original individual required signal 〈si, si′〉.A lower projection leads to a low interference buthigher artifacts error.SDR of the same gender after separation is 2.38 dB

and 5.27 dB for the opposite gender using the exist-ing model, where the proposed model shows 2.72 dBSDR for the same gender and 5.79 dB for the oppositegender. SIR is also improved from 4.43 dB to 5.76 dBfor the same gender and 8.90 dB to 10.80 dB for theopposite gender. But SAR is reduced from 8.72 dB to7.53 dB for the same gender and 8.67 dB to 8.19 dBfor the opposite gender. For high frequency separa-tion models, SDR for the same and opposite genders

a)

b)

c)

Fig. 6. Comparative performance evaluation of the existingand proposed models using: a) signal to distortion ratio,b) signal to interference ratio, c) signal to artifacts ratio.

is 2.40 dB and 5.80 dB, respectively. SIR for the sameand opposite genders is 5.50 dB and 10.46 dB, respec-tively. And SAR for the same and opposite genders is7.70 dB and 8.38 dB, respectively.As the basis vectors of training signals considered

in the data set, the time taken for extraction of thebasis vector is not considered in the calculation. How-ever, all the above results may change slightly in everyexperiment, as these are highly dependent upon theinitialisation of the basis vector taken by the system.As mentioned earlier, the test signals have the av-

erage time duration of 2.55 seconds. It takes for theexisting model about 0.92 seconds to separate two sig-nals using the given system configuration, whereas ittakes around 0.49 seconds for the proposed model andaround 0.91 seconds for the high frequency signal sep-aration model, as shown in Fig 7. The proposed modelreduces the time requirement to separate signals byapproximately 46.73%.All results show that the high frequency signal sep-

aration model is better than the existing model but the

Y.V. Varshney et al. – Frequency Selection Based Separation of Speech Signals. . . 293

Fig. 7. Computational time to separate signals from themixed signal using the existing and proposed models.

results are not so remarkable with negligible improve-ment in the computational burden. So, further simula-tions for the mixed speech signal sampled at 48 kHz areperformed on the existing and proposed models only.Table 2 can be used for comparing the overall perfor-mance of the existing model with the proposed one for48 kHz.Figure 8 shows the average correlation improve-

ment in percentage using the existing and proposedalgorithms for speech signals sampled at 48 kHz.

Fig. 8. Correlation improvement in percentage of individualspeech signals with mix signal and separated signals using

the existing and proposed models.

For all cases, 8.96% and 10.23% average improve-ment is found in correlation between the original andseparated signals using the existing and proposed mod-

Table 2. Performance of existing model and proposed model for separation of mixed signal for 48 kHz sampled signal.

Gender Separation time [s] Corr 1st mix Corr 2nd mix Corr 1st x1 Corr 2nd x2 SDR [dB] SIR [dB] SAR [dB]

Existing Model (without using Wavelet Decomposition)

MM 3.24725 0.70144 0.68516 0.75158 0.73708 1.68318 2.88217 10.0829

FF 3.26976 0.70009 0.70061 0.75468 0.75515 1.11444 2.41860 8.73423

MF 3.47564 0.70729 0.68841 0.79254 0.76728 2.44976 4.09292 9.39623

Proposed Model (using Wavelet Decomposition)

MM 1.61918 0.70144 0.68516 0.75821 0.73992 2.11094 3.90159 8.73423

FF 1.63695 0.70009 0.70061 0.75364 0.75872 1.23419 3.06655 7.80943

MF 1.74793 0.70729 0.68841 0.81369 0.77920 3.10664 5.52209 8.25576

els, respectively. It can be said that the proposed algo-rithm performs 1.23% better than the existing one.Figure 9 shows the improvement in SDR and SIR

using the proposed model in comparison with the ex-isting algorithm.

a)

b)

c)

Fig. 9. Comparative performance evaluation of the existingand proposed models using: a) signal to distortion ratio,b) signal to interference ratio, c) signal to artifacts ratio.

294 Archives of Acoustics – Volume 42, Number 2, 2017

SDR of the same gender after separation is 1.4 dBand for the opposite gender it is 2.44 dB using the ex-isting model, where the proposed model shows 1.67 dBSDR for the same gender and 3.10 dB for the oppositegender. SIR is also improved from 2.65 dB to 3.48 dBfor the same gender and 4.09 dB to 5.52 dB for theopposite gender. But SAR is reduced from 9.4 dB to8.26 dB.As mentioned earlier, the test signals have the av-

erage time duration of 2.55 seconds. It takes about3.33 seconds for the existing model to separate twosignals using the given system configuration, whereasit takes around 1.67 seconds for the proposed model,as shown in Fig. 10. The proposed model reduces thetime requirement to separate signals by approximately49.89%. This result may lead to real time speech sep-aration using small packets of mixed speech signal.

Fig. 10. Computational time to separate signals from themixed signal using the existing and proposed models.

Performance of the algorithm also depends uponthe intensity (loudness) of the speech signal and differ-ence in formant frequencies of speakers. If the inten-sity of speech signals by two speakers is very differentthen there may be a problem of masking, which re-sults in poor separation of speech signals. Moreover, ifformant frequencies of the speech signals are differentthen the quality of speech signals separation will bebetter.As the energy content of speech signal is more in

1st and 2nd formant frequencies (Reetz, Jongman,2011), the relationship between the difference of 1stand 2nd formant frequency (fd1 and fd2) of two indi-vidual speech signals and the average improvement interms of correlation of individual signals from the ex-periment is reported in Fig. 11, where fd1 and fd2 arecalculated as:

fd1 = 1st formant frequency of one speech signal

− 1st formant frequency of other speech signal,

fd2 = 2nd formant frequency of one speech signal

− 2nd formant frequency of other speech signal.

This shows that the average improvement in corre-lation between separated and original signals increases

a)

b)

Fig. 11. Difference of (a) 1st and (b) 2nd formantfrequencies of individual signals v/s correlation

improvement after separation.

with the increment in difference of 1st and 2nd for-mant frequencies of both individual original signals,respectively.

7. Conclusion

Noise generated by microphone at the time ofrecording contains high frequencies components. By re-moving them using the wavelet decomposition, separa-tion of mixed speech signal is done with improvementin performance over the existing algorithm in terms ofcorrelation, SDR, and SIR. This also results in a lowerexecution time of the algorithm. The effect of formantfrequencies over separation capability of the proposedmodel is also shown. It is also reported that for betterseparation of the mixed signal, the intensity of speak-ers should be nearby equal and the formant frequenciesshould have enough difference.However, a lot of improvisation can be carried

out to improve the performance and speed of separa-tion. Better initialisation leads to earlier optimization,which reduces the number of iterations to optimise the

Y.V. Varshney et al. – Frequency Selection Based Separation of Speech Signals. . . 295

cost function and better separation. The selection ofthe proper sparse parameter according to speech sig-nals which are mixed together is the issue to solve.It may also play an important role in proper and fastfinding of the weight matrix.

References

1. Benetos E., Kotti M., Kotropoulos C. (2006),Musical instrument classification using non-negativeMatrix factorization algorithms and subset feature se-lection, IEEE International Conference on Acoustics,Speech and Signal Processing, 5, 221–224.

2. Cho Y-C., Choi S., Bang S-Y. (2003), Non-negativecomponent parts of sound for classification, 3rd IEEEInternational Symposium on Signal Processing and In-formation Technology, 633–636.

3. Demir C., Saraclar M., Cemgil A.T. (2013),Single-channel speech-music separation for robust ASRwith mixture models, IEEE Transactions on Audio,Speech, and Language Processing, 21, 4, 725–736.

4. Fevotte C., Bertin N., Durrieu J. (2009), Nonneg-ative matrix factorization with the Itakura-Saito diver-gence: with application to music analysis, Neural Com-putation, 21, 793–830.

5. Fevotte C., Gribonval R., Vincent E. (2005),BSS EVAL toolbox user guide revision 2.0, Tech. Rep.1706, IRISA, Rennes, France.

6. Hoyer P.O. (2004), Non-negative matrix factorizationwith sparseness constraint, Journal of Machine Learn-ing Research, 1457–1469.

7. Kim J., Park H. (2008), Sparse nonnegative matrixfactorization for clustering, Georgia Institute of Tech-nology, GT-CSE-08-01.

8. Lee D.D., Seung H.S. (1999), Learning the pans ofobjects with nonnegative matrix factorization, Nature401, 788–791.

9. Lee D.D., Seung H.S. (2000), Algorithms for non-negative matrix factorization, Advances in Neural In-formation Processing Systems, 13, 556–562.

10. Nasersharif B., Abdali S. (2015), Speech/musicseparation using non-negative matrix factorization with

combination of cost functions, International Sympo-sium on Artificial Intelligence and Signal Processing(AISP), 107–111.

11. Paatero P., Tapper U. (1994), Positive matrix fac-torization: A non-negative factor model with optimalutilization of error estimates of data values, Environ-metrics, 5, 111–126.

12. Reetz H., Jongman A. (2011), Phonetics: tran-scription, production, acoustics, and perception, Wiley-Blackwell, ISBN: 978-1-4443-5854-4, pp. 182–200.

13. Schmidt M.N., Olsson R.K. (2006), Single-channelspeech separation using sparse non-negative matrix fac-torization, 9th International Conference on SpokenLanguage Processing, Pittsburgh, PA, USA.

14. Upadhyaya P., Farooq O., Varshney P., Upad-hyaya A. (2013), Enhancement of VSR using lowdimension visual feature, International Conference ofMultimedia, Signal Processing and CommunicationTechnologies (IMPACT), Aligarh, India, pp. 71–74.

15. Vincent E., Gribonval R., Fevotte C. (2006),Performance measurement in blind audio source sepa-ration, IEEE Transactions on Audio, Speech, and Lan-guage Processing, 14, 1462–1469.

16. Walpole R.E., Myers R.H., Myers S.L., Ye K.E.(2011), Probability and Statistics for Engineers andScientists, 9th ed., Pearson, ISBN: 978-0-3216-2911-1,p. 433.

17. Wang Y., Li Y., Ho K.C., Zare A., Skubic M.(2014), Sparsity promoted non-negative matrix factor-ization for source separation and detection, 19th In-ternational Conference on Digital Signal Processing(DSP), 640–645.

18. Wang Z., Sha F. (2014), Discriminative non-negativematrix factorization for single-channel speech sepa-ration, IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 3749–3753, Florence, Italy, 4–9 May.

19. Zhu B., Li W., Li R., Xue X. (2013), Multi-stagenon-negative matrix factorization for monaural singingvoice separation, IEEE Transactions on Audio, Speech,and Language Processing, 21, 10, 2096–2107.