BLIND SPEECH SEPARATION OF NON-LINEAR CONVOLUTIVE MIXTURES FOR ROBUST SPEECH RECOGNITION

BLIND SPEECH SEPARATION OF NON-LINEAR CONVOLUTIVE

MIXTURES FOR ROBUST SPEECH RECOGNITION

Athanasios Koutras, Evangelos Dermatas and George Kokkinakis

WCL, Electrical and Computer Engineering, University of Patras

26100 Patras, HELLAS.

[email protected]

Abstract

In this paper we present a novel solution to the convolutive and post non-linear Blind Speech Separation

(NLBSS) problem. The non-linear separating functions are chosen to be a mixture of parametric sigmoid

functions and higher order odd polynomial functions. The estimation of the separating filter coefficients and

the parameters of the separating non-linear functions is derived using the Maximum Likelihood Estimation

principle. Extensive experiments including different types of complex non-linear mixing functions and real

room impulse responses were carried out to simulate a mixing scenario of two speakers that are talking

simultaneously in a real room environment under the effect of high non-linear distortions. Our proposed

method succeeded in separating the non-linear mixture signals and improved the phoneme recognition

accuracy of an automatic speech recognition system by more than 20% in comparison to the accuracy

measured with the non-linear mixture signals. Furthermore, this method was found to outperform standard

linear Blind Speech Separation methods by 20%, justifying the necessity for non-linear separating functions

integration in speech recognition systems in a non-linear and multi-simultaneous speaker environment.

Keywords: Blind Speech separation, non-linear mixtures, speech recognition, Maximum Likelihood.

1. Introduction

The problem of Blind Source Separation (BSS) primarily consists in recovering a set of statistically

independent signals or sources given only a number of observed mixture data. The term ''blind'' is justified

by the fact that the only a-priori knowledge that we have for the signals is their statistical independence and

2

no other information about the mixing model parameters and the transfer paths from the sources to the

sensors is available beforehand.

Generally, a great number of algorithms for BSS have been proposed over the last 15 years that deal with

the linear instantaneous (memoryless) mixture of sources [1-5] as well as with the situation of linear

convolutive mixtures [3-5]; some of them have been already tested in the case of multiple competing

speakers showing great performance [6-11]. The case of instantaneous mixture of speech signals is the

simplest one and can be encountered when multiple speakers are talking simultaneously in an anechoic

room. However, when dealing with real room acoustics one has to consider the echoes and the delayed

versions of the speech signals as well. This case is the most frequently encountered situation in real world,

where speech signals from multiple speakers are received by a number of microphones located in the room.

Each microphone acquires the signals produced by every speaker, that consist of several delayed and

modified copies of the original speech sources due to the reflections on the walls and objects of the room.

Depending upon the level and the type of the room noise, the strength of the echoes and the amount of the

reverberation, the resulting signals received by the microphones may be highly distorted minimizing the

efficacy of any Automatic Speech Recognition (ASR) system. As a result, there has been a great necessity

for the use of robust BSS techniques as a front end in order to separate convolutive mixtures of speech

signals, reduce the distortion effects and improve the recognition accuracy of ASR systems in various multi-

simultaneous speaker situations [6-11].

Although a lot of work that deals with the aforementioned linear mixing models has already been done,

presentation of algorithms that deal with more realistic and practical non-linear mixing models has been very

limited [12-22]. In the case of non-linear mixing situations, linear BSS methods fail to extract the

independent sources when employed [16,18], therefore new non-linear BSS (NLBSS) methods or extensions

of existing linear BSS ones must be introduced. Such extensions can be achieved by employing non-linear

separating functions that suitably transform the mixtures so that the outputs become statistically

independent. Still, such an approach although clear and straightforward, would be difficult - if not

3

impossible - to be implemented and work efficiently without first properly limiting the function class for the

de-mixing transforms. In its full generality, the non-linear blind separation problem is intractable since the

indeterminacies in the separating solutions are much more severe than those existing in the linear mixing

case [21]. Basically, this can be mainly attributed to the fact that if x and y are two independent random

variables, any of their non-linear transformations f(x) and g(y) still remain independent. Thus if a BSS

solution is found for the non-linear mixing model, this might be quite different from the correct solution of

the respective non-linear BSS where one would find the original sources. As a result, without any prior

knowledge about the mixing function f of the non-linear mixing model, criteria using only statistical

independence cannot be considered efficient enough for recovering the source signals without distortions

and the non-linearly mixed sources are recovered not only up to any scaling and permutation factor, but up

to any non-linear function as well. So in general, NLBSS is impossible using only the sources independence

assumption without any prior knowledge about the non-linear functions. Under these facts, to limit the

function class for the de-mixing functions is equivalent to assume some prior knowledge about the non-

linear mixing function.

In the last years a few algorithms that deal with the NLBSS problem have been presented, despite the

aforementioned difficulties, some of them based on neural networks. Burel [12] has proposed a neural

network based solution to deal with the case of known nonlinearities that depend on an unknown set of

parameters. The self-organizing map (SOM) has been also used to separate sources in non-linear mixture

situations [13]. However this method’s efficacy is hindered by the exponential growth of the network’s

complexity since it requires a huge number of neurons for good accuracy and is restricted to separating

sources with probability density functions with bounded supports. Additionally, the use of SOMs leads to a

high interpolation error when applied for the separation of continuous sources. Deco and Brauer [14] have

studied the very particular and restricting case of volume conserving non-linear transforms. Pajunen et al.

[15] have proposed a non-linear BSS technique based on the Maximum Likelihood. Koutras et al. [16] have

proposed a neural network topology that uses a special case of mixtures of parametric non-linearities that

deals with the NLBSS problem. Lee et al. [17] have addressed the problem of non-linear separation based on

4

information maximization criteria using a parametric sigmoidal non-linearity and higher order polynomials.

Yang et al. [18] have proposed a neural network topology to separate non-linear mixtures, where the inverse

of the non-linearity can be approximated by a two-layer perceptron with parametric sigmoidal non-linearities

on the network’s neurons. Lappalainen [19] and Lappalainen et al. [20] have also proposed a NLBSS

method that separates non-linearly mixed signals using ensemble learning techniques. A recent survey that

describes the NLBSS problem can also be found in [21]. Finally, Taleb et al. [22] have dealt with a special

case of post non-linear mixtures (PNL) using entropy maximization criteria. The PNL mixing model is

useful in practical signal processing applications as it can adequately model the non-linear distortions that

are introduced by sensors. All these approaches deal with the case of memoryless non-linear mixing models.

Work that deals with the convolutive non-linear mixing models with memory has not been yet presented to

our knowledge.

In this paper, we present a novel solution to the PNL blind speech separation problem of convolutive

mixtures that is based on a neural network topology. To extract the non-linear distortions imposed by this

model two non-linear separating functions are implemented; a generalized form of a mixture of parametric

sigmoidal non-linearities as in [16] and higher order odd polynomial based non-linearities. These non-linear

functions have presented a better fitting manner in situations that involve non-linear high complexity mixing

functions, showing accurate approximation and satisfactory separating performance [16]. Extensive

phoneme recognition experiments were performed using mixtures of two simultaneous speech signals from

the TIMIT database that were convolutively and post non-linearly mixed using real room impulse responses

and various types of non-linear functions. The separation network’s learning rules are derived using the

Maximum Likelihood Estimation (MLE) criterion to estimate the separating filters coefficients and the

parameters of the non-linear separating functions and show great efficacy when used to separate the

simultaneous speech signals. Phoneme recognition results have proved the proposed method’s efficacy

achieving an improvement of more than 20% of the recognition rate in comparison to the non-linearly

distorted speech signals accuracy, under all types of non-linear distortions, which leads to a recognition

score of 50%. In addition, the proposed method was found to outperform typical linear BSS methods by

5

20% making necessary the integration of non-linear functions in speech separation systems when robust

speech recognition is needed in non-linear and multiple speaker environments.

The structure of this paper is as follows: In section 2, we present the general non-linear BSS mixing and

separating models. In section 3 we derive the learning rules for the separating filter coefficients and the non-

linear separation functions parameters using the MLE criterion. In section 4 our experimental set-up and the

phoneme recognition results are presented and finally in the last section some conclusions and remarks are

given.

2. Post Non-Linear Convolutive Mixtures

Let us have N statistically independent sources {s1, s2,..,sN} that are mixed by the post non-linear model with

memory shown in Fig. 1. A post non-linear mixture model was first introduced by Taleb and Jutten [22] for

the case of memoryless non-linear mixing models. In our approach, we extend this particular model to the

case of non-linear mixing models with memory.

s1(t)

sN(t)

H11(z)

.

.

.

+

+x1(t)

xM(t)HMN(z)

H1N(z)

HM1(z)

f1

fM

.

.

.

Figure 1. The convolutive non-linear mixture model

The M convolutively and non-linearly mixed signal observations {x1,x2,..,xM} are given by the following

equation:

Miktskhftx jN

j

K

kijii �,2,1,)()()(

1 0=

−⋅= ∑ ∑

= =

(1)

6

where fi is an unknown invertible and derivable non-linear mixing function and hij(k) are the linear FIR filter

coefficients that are used to model the transfer paths between the ith sensor and the jth source. Throughout

this work, the number of the sources is considered equal to the number of the sensors and is known

beforehand (N=M).

To separate the mixed signals xi, we use the NLBSS network shown in Fig. 2 without any prior

knowledge about the convolutive mixture H, the non-linear functions fi and the original sources s(t).

W11(z)

.

.

.

+

+

WNN(z)

W1N(z)

WN1(z)

x1(t)

xN(t)

g1

gN

u1(t)

uN(t)

y1(t)

yN(t)

Figure 2. The non-linear convolutive blind signal separation network

The output yi(t) of the neural network is given by the equation:

( ))()()(1 0

ktxgkwty jjN

j

K

kiji −⋅= ∑ ∑

= =

(2)

or in a more compact matrix form by:

( )[ ]∑=

−⋅=

K

k

Tk ktt

0)()( xgWy (3)

where g = [g1, g2, …, gN] denotes the hidden neurons non linear transfer functions and Wk is a square NxN

matrix that contains the separating filter coefficients wij(k) for the kth filter lag.

To approximate the inverse of the non-linear mixing functions fi –1, two different types of non-linear

functions g are used:

a) a mixture of parametric sigmoidal functions in the form of:

7

)tanh()(1

∑=

+⋅=

P

iii bxaxg (4)

The non-linear function in (4) is a generalization of previous solutions as in [18] and presents a number

of advantages compared to the standard sigmoid functions used in neural networks. It shows a greater

flexibility towards fitting more complex non-linear mixing functions, which helps achieving a better

demixing accuracy and performance of the separation network as our experimental results show. The

parameters ai and bi control the slope and the position of each component in the mixture of the sigmoids.

b) a polynomial non-linear function of the form:

1212

3310)( +

+⋅++⋅+⋅+=

rr xcxcxccxg � (5)

where 2r+1 is the polynomial’s order.

All unknown parameters of the proposed NLBSS network are learnt adaptively using the proposed rules

introduced in the next section.

3. Learning rules 3.1 The Maximum Likelihood Estimator For the unsupervised estimation of the separating filters {wij i, j=1,…, N} as well as the non-linear function

parameters {aij, bij i=1,..,N, j=1,..,P} and {cij i=1,..,N, j=1,3,…,2r+1}, we apply the Maximum Likelihood

Estimation (MLE) criterion. From (2) the pdf of the observed data is given by:

∏=

⋅∂

∂⋅=

N

i i

ii pxxgp

10 ),;(

)()( θWyWx (6)

and their log-likelihood by:

( ) ( )∑=

+

∂

∂+==

N

i i

ii pxxg

pL1

0 ),;(log)(

loglog))(log( θWyWx (7)

where W0 is a NxN matrix with the leading weights of each FIR separating filter wij(0), W is the matrix with

the separating FIR filters wij in its elements and θ is the vector with the parameters of the non-linear

8

separating functions. The FIR filter coefficients of the network’s output layer and the unknown parameters

of the separating non-linearities can be estimated using the stochastic gradient of L with respect to the Wk,

aij, bij and cij as described below.

3.2 Estimation of the separating filters

The stochastic gradient of the log-likelihood L in (7) with respect to the output layer FIR filter

coefficients Wk for the kth filter lag is estimated by:

[ ] [ ] ( )

( )

=−⋅=∂

∂⋅

=⋅+=∂

∂⋅+

=∂

∂=

−−

Kkkt

kt

L

T

k

TTT

kk

,...,2,1,)();();('

);();('

0,)();();('

);();('

∆

10

0

10

uθW,ypθW,yp

Wy

θW,ypθW,yp

uθW,ypθW,ypW

Wy

θW,ypθW,ypW

WW (8)

which can be written in a more compact matrix form as:

[ ]

=−⋅−

=⋅−

=∂

∂=

−

Lkkt

kL

T

TT

kk

,...,2,1,)()(

0,)(∆

1

11

0

uyΦ

uyΦW

WW (9)

where the matrix Φ1(y) is given by:

[ ]TNN

T

NN

NN yhyhypyp

ypyp

)(,),();();(

,,);();(

)( 11

'

11

1'1

1 �� =

−=

θW,θW,

θW,θW,

yΦ (10)

The non-linear function hi(yi) is equal to the cumulative density function of the source signals si and it's

choice plays an important role to the efficacy of the signal separation network. For the case of speech signals

it has been shown that their pdf can be approximated by a gamma distribution variant or the Laplacian

density. For the latter case as Charkani and Deville have reported [23] and various speech separation

applications have shown, the choice of hi(yi) = sign(yi) is considered the best.

By closer inspection we can see that the adaptation rule in (9) presents the drawback that it requires

computationally expensive matrix inversion operations for the estimation of the leading weights for the

9

separating filters. To overcome this problem we use the natural gradient [24] instead of the stochastic

gradient approach for the estimation of W0, which results in:

[ ] ( ) ( )[ ] 001001

0000

0 )()()();();('∆ WWuyΦIWWu

θW,ypθW,ypWWW

WW ⋅⋅⋅−=⋅⋅

⋅+=⋅

∂

∂=

− TTTTTT ttL (11)

Using (9) and (11), the separating filter coefficients adaptation can be performed using the following rule:

( )[ ]

( )[ ]

−⋅⋅−=

⋅⋅⋅−⋅+=

+

+

Tnk

nk

nTnTnn

kt

tt

)()(

)())((

1)()1(

)(0

)(01

)(0

)1(0

uyΦWW

WWuyΦIWW

µ

µ

(12)

The above learning rules perform online and update the filter coefficients every time a new sample is

presented to the non-linear BSS network.

3.3 Estimation of the non-linear separating function parameters

3.3.1 Mixture of sigmoidal non-linearities

Let us denote with aj = [a1j,..,aNj]T the vector containing the slope parameters of the jth mixture term of the

network’s non-linear functions in (4). By taking the stochastic gradient of L with respect to the vector aj we

find after trivial calculations and applying the chain rule of partial derivatives that:

[ ])()(∆ 10)()( yΦWDxΦ

αα Tj

aj

aj

j nL+−=

∂

∂= (13)

where:

( )

( )

( )

( )

T

NkNNkK

k N

NjNNjNNj

kkK

k

jjjj

abxa

u

bxaua

bxau

bxaua

+∂

∂

+∂∂

∂

+∂

∂

+∂∂

∂

−=

∑∑==

)tanh(

)tanh(,,

)tanh(

)tanh()(

1

2

1111 1

11111

2

)(mxΦ (14)

and

( ) ( )

+

∂

∂+

∂

∂= )tanh(,,)tanh()( 111

1

)(NjNNj

Njjj

j

ja bxa

abxa

adiag mxD (15)

10

Using similar calculations, the adaptation equation for the biases bj of the non-linear function in (4) is:

[ ])()(∆ 10)()( yΦWDxΦ

bb Tj

bj

bj

j nL+⋅−=

∂

∂= (16)

where,

( )

( )

( )

( )

T

NkNNkK

k N

NjNNjNNj

kkK

k

jjjj

bbxa

x

bxaxb

bxax

bxaxb

+∂

∂

+∂∂

∂

+∂

∂

+∂∂

∂

−=

∑∑==

)tanh(

)tanh(,,

)tanh(

)tanh()(

1

2

1111 1

11111

2

)(�xΦ (17)

and

( ) ( )

+

∂

∂+

∂

∂= )tanh(,,)tanh()( 111

1

)(NjNNj

Njjj

j

jb bxa

bbxa

bdiag �xD (18)

Using the above equations, the adaptation rules for the non-linear function parameters are given by:

( ))()( 10),(),()()1( yΦWDxΦαα Tjn

ajn

anj

nj n +⋅−=+ (19)

( ))()( 10)(,)(,)()1( yΦWDxΦbb Tjn

bjn

bnj

nj n +⋅−=+ (20)

3.3.2 Polynomial non-linear function

Using the same calculations as above, the polynomial coefficients of the second type of the non-linear

function (5) can be estimated, calculating the stochastic gradient of the log-likelihood L with respect to the

coefficients cj = [c1j,, …, cNj]T

[ ])()(∆ 10)()( yΦWDxΦ

cc Tj

cj

cj

j nL+−=

∂

∂= (21)

where,

11

( )

( )

( )

( )

[ ]

=

+=

⋅⋅+++⋅⋅+

⋅

⋅⋅+++⋅⋅+

⋅−

=

=

⋅++⋅+⋅+∂

∂

⋅++⋅+⋅+∂∂

∂

⋅++⋅+⋅+∂

∂

⋅++⋅+⋅+∂∂

∂

−=

+

−

+

−

+

+

+

+

+

+

+

+

0,0,0

12,...,3,1

)12(3,,

)12(3

)(

)(

2)12(

231

1

21)12(1

211311

11

)(

12)12(

3310

12)12(

3310

2

121)12(1

311311110

1

121)12(1

311311110

11

2

)(

j

rj

xcrxcc

xj

xcrxcc

xj

xcxcxccx

xcxcxccxc

xcxcxccx

xcxcxccxc

T

T

rNrNNNN

jN

rr

j

jc

rNrNNNNNN

N

rNrNNNNNN

NNj

rr

rr

j

jc

m

m

m

m

m

m

o

m

m

xΦ

xΦ

(22)

and

( ) ( )

[ ][ ]

=

+==

⋅++⋅+

∂

∂⋅++⋅+

∂

∂= +

+

+

+

0,1,112,...,3,1,,,)(

,,)(

1)(

12)12(10

121)12(111110

1

)(

jdiagrjxxdiag

xcxccc

xcxccc

diag

jN

jj

c

rNrNNNN

Nj

rr

j

jc

l

m

mmm

xD

xD

(23)

Similarly, the adaptation rule for the non-linear function parameters is given by:

( ))()( 10)(,)(,)()1( yΦWDxΦcc Tjn

cjn

cnj

nj n +⋅−=+ (24)

4. Experiments

To test the proposed NLBSS method for the case of convolutive non-linear mixtures we conducted several

phoneme recognition experiments, including two simultaneous speakers in a simulated real room

environment under the existence of different types of severe post non-linear distortions. The room, was a

typical office room, with dimensions of 6.5m x 4.5m x 2.5m shown in Fig. 3. Two omni-directional

microphones were placed 60cm apart from each other. The first speaker was placed at a distance of 1.5m

12

directly in front of the first microphone. The second speaker was located at a distance of 2m in front of the

second microphone. To simulate the reverberant environment, a subset of 120 pairs of sentences were

chosen randomly from the testing set of the TIMIT database, together with four real room impulse responses

hij (i,j=1,2) with 512 taps each (K=512). The signals of two speakers were mixed according to (1), with their

Relative Energy Level (REL) ranging from -20 dB to 20 dB with respect to the first speaker, in order to

examine in detail the dependence of the proposed NLBSS method’s performance on a wide range of

different mixing conditions. For the evaluation task, we examined the degradation that non-linear distortions

cause to the recognition accuracy of ASR systems in a non-linear multi-simultaneous speaker environment,

and measured the improvement THAT our method achieves when used in such environments.

60cm

1.5m

2.0m

Speaker 1

Speaker 2

Mic 1 Mic 2

1.5m

3m

Figure 3. The speakers / microphones topology used in the experiments

The functions that were used (Fig. 4) for the post non-linear mixing step are:

(a) f1(x) = tanh(0.5x+0.3), f2(x) = tanh(0.5x+0.3)

(b) f1(x) = 0.1x + tanh(0.5x+1.2), f2(x) = x + 0.3 x3 (25)

(c) f1(x) = tanh(0.3x+2x3), f2(x) = tanh(0.1x+2x3)

and range from nearly linear (case 1) to highly non-linear (case 3), covering a wide area of non-linear

mixing distortions.

13

-2 -1 0 1 2-1

-0.5

0

0.5

1

-2 -1 0 1 2-1

-0.5

0

0.5

1

-2 -1 0 1 2-0.5

0

0.5

1

1.5

-2 -1 0 1 2-5

0

5

-2 -1 0 1 2-1

-0.5

0

0.5

1

-2 -1 0 1 2-1

-0.5

0

0.5

1

(a)

(b)

(c)

Figure 4. The non-linear mixing functions used in the experiments

4.1 The Phoneme Recognition System The feature extraction module of the automatic speech recognizer that was used throughout our experiments,

extracted the 12 Mel-cepstrum coefficients plus the energy parameter from the speech signals. The mean

value of the Mel-cepstrum coefficients was subtracted from each coefficient and the first- and second- order

differences were formed to capture the dynamic evolution of speech signals, resulting to a total number of

39 parameters. The phoneme recognition decoder was based on Continuous Density Hidden Markov Models

(CDHMM). Each phoneme-unit HMM was a five states left to right CDHMM with no state skip. The output

distribution probabilities were modeled by means of a Gaussian component with diagonal covariance matrix.

The classification was achieved by reaching the maximum probability of the observation sequence for each

phoneme model. In the training process the segmental K-Means algorithm was used to estimate each

CDHMM's parameter from multiple observations. The complete training set from the TIMIT database was

used for the training of the recognition system, while for the testing we used the 120 pairs of sentences taken

14

from the testing set. For the recognition experiments, a set of 39 different phoneme categories was

employed.

4.2 Experimental Results Extensive phoneme recognition experiments using the aforementioned speech recognition system were

carried out by processing the speakers' clean recordings (CLEAR) from the TIMIT speech database, the non-

linear convolutive speech mixtures acquired by the two microphones (MIXED), the resulting separated

speech signals estimated by common linear BSS methods [6] (LBSS), the resulting separated speech signals

estimated by our proposed NLBSS algorithms (12), (19), (20) (NLBSS1), and the separated speech signals

resulting from (24) (NLBSS2).

20

30

40

50

60

70

REL (dB)

Rec

ogni

tion

Scor

e (%

)

CLEAR MIXED LBSS NLBSS(1) NLBSS(2)

CLEAR 60,72 60,72 60,72 60,72 60,72

MIXED 25,12 28,41 32,15 38,17 45,12

LBSS 30,25 34,05 39,65 43,22 46,57

NLBSS(1) 49,56 52,56 56,65 58,72 59,68

NLBSS(2) 48,52 50,91 56,91 58,63 59,77

-20 -10 0 10 20

Figure 5. Percent phoneme recognition results for the first channel and the nonlinear mixing function

(25a)

15

20

30

40

50

60

70

REL (dB)

Rec

ogni

tion

Scor

e (%

)


CLEAR 58,16 58,16 58,16 58,16 58,16

MIXED 42,15 38,54 31,56 26,44 22,12

LBSS 43,25 42,14 35,58 32,05 27,98

NLBSS(1) 56,12 55,05 52,14 49,85 46,35

NLBSS(2) 55,62 54,93 51,97 49,14 46,03

-20 -10 0 10 20

Figure 6. Percent phoneme recognition results for the second channel and the nonlinear mixing function (25a)

The phoneme recognition results for the first and most easiest non-linear mixing scenario are shown in

Fig. 5 and Fig. 6 for both output channels. It can be observed that the proposed NLBSS networks perform

well and manage not only to separate the competing speakers speech signals in every Relative Energy Level

situation, but also to approximate well the inverse of the non linear mixing functions, reaching an average

recognition score of 55% and 52% for both output channels and an improvement of more than 20% in

comparison to the non-linearly distorted signals. On the other hand LBSS performs poorly and reaches a

phoneme recognition score of 38% and 36% showing only 5% improvement, thus justifying the necessity

for non-linear separation functions integration in speech separation systems when robust speech recognition

is needed in a non-linear and competing speaker environment. For the approximation of the inverse of the

non-linear mixing functions we have tried successively a mixture of one to ten components of parametric

sigmoid functions. However, further increase of the terms of the non-linear function beyond six didn’t

contribute significantly to improving the phoneme recognition score of the ASR system. Similar

experiments were conducted for the polynomial separating function that showed that the best results were

achieved when using a seventh order odd polynomial. Finally, noticeable is that best improvement of the

recognition accuracy was measured in the case of high Interference (REL= -20,20 dB for the first and the

16

second speaker respectively) where our method achieved a phoneme recognition rate of 49% and 46% for

both output channels, reaching an improvement of more than 24%.

20

30

40

50

60

70

REL (dB)

Rec

ogni

tion

Scor

e (%

)CLEAR MIXED LBSS NLBSS(1) NLBSS(2)

CLEAR 60,72 60,72 60,72 60,72 60,72

MIXED 23,25 25,47 28,02 31,25 36,84

LBSS 25,45 27,88 29,65 31,44 34,01

NLBSS(1) 45,12 48,56 50,54 53,86 58,98

NLBSS(2) 44,28 46,93 49,12 52,16 56,43

-20 -10 0 10 20

Figure 7. Percent phoneme recognition results for the first channel and the nonlinear mixing function (25b)

15

25

35

45

55

65

REL (dB)

Rec

ogni

tion

Scor

e (%

)


CLEAR 58,16 58,16 58,16 58,16 58,16

MIXED 32,54 30,56 29,12 24,56 20,56

LBSS 30,12 28,45 25,36 24,58 23,64

NLBSS(1) 54,25 53,12 49,86 45,85 43,52

NLBSS(2) 53,76 52,18 48,96 44,91 42,38

-20 -10 0 10 20

Figure 8. Percent phoneme recognition results for the second channel and the nonlinear mixing function (25b)

For the second non-linear mixing case, the phoneme recognition results are shown in Fig. 7,8. Again, our

methods manage to separate and remove the non-linear distortion on the signals acquired by both

microphones reaching an average recognition score of 51.4% and 49.7% for both used non-linear separating

17

functions for the first output channel and 49%, 48.4% for the second output channel respectively, while the

phoneme recognition improvement was measured to 22.45% and 20.76% for the first and 21.8% and 21.07%

for the second output channel compared to the recognition accuracy obtained by the microphone signals.

However, the most important improvement to the recognition accuracy is observed in the case of high

Interference where our methods have achieved an average phoneme recognition rate of 44% and 42% for

both output channels compared to the corresponding 23% and 20% from the mixed signals, reaching a

significant improvement of more than 20%. It is also evident that the LBSS algorithm performs

disappointingly improving the recognition accuracy by the poor amount of only 1% for both output

channels.

15

25

35

45

55

65

REL (dB)

Rec

ogni

tion

Scor

e (%

)


CLEAR 60,72 60,72 60,72 60,72 60,72

MIXED 20,58 22,12 25,47 30,31 33,55

LBSS 27,48 28,05 29,68 30,3 30,78

NLBSS(1) 43,15 45,58 49,68 53,78 53,81

NLBSS(2) 42,18 43,54 46,12 49,91 51,63

-20 -10 0 10 20

Figure 9. Percent phoneme recognition results for the first channel and the nonlinear mixing function (25c)

18

15

25

35

45

55

65

REL (dB)

Rec

ogni

tion

Scor

e (%

)


CLEAR 58,16 58,16 58,16 58,16 58,16

MIXED 29,54 26,35 23,13 20,14 18,95

LBSS 28,54 26,59 26,41 22,57 20,69

NLBSS(1) 50,14 49,62 47,83 43,63 40,29

NLBSS(2) 49,16 47,33 45,68 41,22 38,91

-20 -10 0 10 20

Figure 10. Percent phoneme recognition results for the second channel and the nonlinear mixing function (25c)

Finally, the same experimental results can be observed in Fig. 9,10 for the last and most difficult case of

non-linear mixture of the speech signals. Both separating methods NLBSS1 and NLBSS2 work well and

achieve a recognition accuracy of 49% and 46% for the first output channel and 46% and 44% for the

second channel with an average improvement of 22% and 21% accordingly compared to the poor 26.4% and

23.6% of the highly distorted mixed signals. Both NLBSS networks outperform the linear BSS network

resulting to more than 18% phoneme recognition accuracy using a mixture of 7 parametric sigmoidal terms

and an 13th order odd-polynomial function

The significance of the proposed NLBSS methods for the case of non-linear distortions can be also

proved by closer examination of our experimental results in the case of low interference. Particularly, in

such conditions the speech signals are distorted mainly by the non-linear mixing functions while the

distortion imposed by the convolutive mixture of speech signals is as expected of lower significance. In

these conditions the LBSS method didn’t succeed to improve the phoneme recognition accuracy of the ASR

system, while in some other cases degraded it, in opposition to the proposed NLBSS techniques introduced

in this paper which removed various kinds of non-linear distortions successfully reaching a recognition rate

19

improvement of more than 20% for both output channels especially when difficult non-linearities were

implemented (Fig. 4b, 4c).

In Fig. 11 we present our algorithms performance where we can observe that the joint distribution of the

separated speech signals using both non-linear function approximations matches the joint distribution of the

clear signals resulting to statistical independence and quality separation in opposition to the poor linear BSS

method’s separation efficacy under non-linear distortions.

(a) (b) (c) (d) (e)

Figure 11. Scatter plots of the joint pdf for the case of (a) clean, (b) non-linearly and convolutively mixed

signals using (25b) (c) separated speech signals using the LBSS method (d) separated speech signals using

mixture of sigmoid separating functions and (e) separated speech signals using odd-polynomial separating

functions.

5. Conclusions

In this paper we presented a novel solution to the Blind Speech Separation problem of convolutive and post

non-linear mixtures of speech signals based on a neural network topology that uses a mixture of parametric

sigmoid and higher order odd polynomial non-linear functions. The proposed learning rules were derived

using the Maximum Likelihood Estimation criterion. Extensive speech recognition experiments in various

mixing situations with high non-linear distortions have proved our method’s efficacy and improved the

phoneme recognition accuracy of the separated speech signals significantly. Furthermore, the proposed

methods were found to outperform standard linear BSS techniques justifying the necessity for non-linear

20

functions integration in speech separation systems when robust speech recognition is needed in non-linear

and multiple speaker environments.

References

[1] S. Amari, & A. Cichocki, Adaptive Blind Signal Processing-Neural Network Approaches, IEEE

Proceedings, 86(10), 1998, 2026-2048.

[2] J. Cardoso, Blind Signal Separation: Statistical Principles, IEEE Proceedings, 86(10), 1998, 2009-

2025.

[3] M. Girolami, Self-Organising Neural Networks. Independent Component Analysis and Blind Source

Separation (Springer-Verlag, 1999).

[4] S. Haykin, Unsupervised Adaptive Filtering, Vol. I Blind Source Separation (J. Wiley & Sons, 2000).

[5] T. W. Lee, Independent Component Analysis (Kluwer Academic Publishers, 1998).

[6] A. Koutras, E. Dermatas, & G. Kokkinakis, Blind separation of speakers in noisy reverberant

environments: A neural network approach, Neural Network World Journal, 10(4), 2000, 619-630.

[7] A. Koutras, E. Dermatas, & G. Kokkinakis, Blind speech separation of moving speakers in real

reverberant environments, Proc. IEEE Conf. On Acoustic Speech and Signal Processing, Istanbul,

Turkey, 2000, Vol. 2, 1133-1136.

[8] K. Yen, J. Huang, & Y. Zhao, Co-channel speech separation in the presence of correlated and

uncorrelated noises, Proc. EuroSpeech, Budapest, Hungary, 1999, Vol. 6, 2587-2590.

[9] K. Yen, & Y. Zhao, Co-channel speech separation for robust Automatic Speech recognition, Proc.

IEEE Conf. On Acoustic Speech and Signal Processing, Munich, Germany, 1997, Vol. 2, 859-862.

[10] T. W. Lee, A. Bell, & R. Lambert, Blind separation of delayed and convolved sources, in Advances in

Neural Information Processing Systems, 9 (MIT Press, Cambridge MA, 1997), 758-764.

[11] S. Choi S, & A. Cichocki, Adaptive blind separation of speech signals: Cocktail Party Problem, Proc.

International Conference of Speech Processing, Seoul, Korea, 1997, 617-622.

[12] G. Burel, Blind separation of sources: A nonlinear neural algorithm. Neural Networks, 5(6), 1992,

937-947.

21

[13] P. Pajunen, A. Hyvarinen, & J. Karhunen, Nonlinear blind source separation by self-organizing maps.

Proc. 3rd International Conference on Neural Information Processing, Hong Kong, 1996, Vol 2,

1207-1210.

[14] G. Deco, & W. Brauer, Nonlinear higher-order statistical decorrelation by volume-conserving

architectures, Neural Networks, 8(4), 1995, 525-535.

[15] P. Pahunen, & J. Karhunen, A maximum likelihood approach to non-liner blind separation, Proc.

International Conference on Artificial Neural Networks, Lausanne, Switzerland, 1997, 541-546.

[16] A. Koutras, E. Dermatas, & G. Kokkinakis, Neural Network approach to non-linear Independent

Component Analysis, to be published in Lecture notes in Computer Science, (Springer-Verlag, 2001).

[17] T. W. Lee, B. Koehler, & R. Orglmeister, Blind source separation of nonlinear mixing models, Proc.

IEEE International Workshop on Neural Network for Signal Processing, Florida, USA, 1997, 406-415

[18] H. Yang, S. Amari, & A. Cichocki, Information theoretic approach to blind separation of sources in

non-linear mixture. Signal Processing, 64(3), 1998, 291-300.

[19] H. Valpola, Nonlinear Independent component Analysis using ensemble learning: Theory, Proc. 2nd

Int. Workshop on Independent Component Analysis and Blind Signal Separation, Espoo, Finland,

2000, 251-256.

[20] H. Valpola, X. Giannakopoulos, A. Honkela, & J. Karhunen, Nonlinear Independent component

Analysis using ensemble learning: Experiments and discussion, Proc. 2nd Int. Workshop on

Independent Component Analysis and Blind Signal Separation, Espoo, Finland, 2000, 351-356.

[21] J. Karhunen, Nonlinear Independent Component Analysis, in Roberts & Everson (Eds.), Independent

Component Analysis: Principles and Practice, (Cambridge University Press, 2001).

[22] A. Taleb, & C. Jutten, Nonlinear source separation: the post-nonlinear mixtures, Proc. European

Symposium of Artificial Neural Networks, Bruges, Belgium, 1997, 279-284.

[23] N. Charkani, & Y. Deville, Optimization of the asymptotic performance of time-domain convolutive

source separation algorithms, Proc. European Symposium of Artificial Neural Networks, Bruges,

Belgium, 1997, 273-278.

[24] S. Amari, Natural Gradient works efficiently in learning, Neural Computation, 10(2), 1998, 251-276.

BLIND SPEECH SEPARATION OF NON-LINEAR CONVOLUTIVE MIXTURES FOR ROBUST SPEECH RECOGNITION

Documents

Transcript of BLIND SPEECH SEPARATION OF NON-LINEAR CONVOLUTIVE MIXTURES FOR ROBUST SPEECH RECOGNITION