Post on 12-Mar-2023
BLIND SPEECH SEPARATION OF NON-LINEAR CONVOLUTIVE
MIXTURES FOR ROBUST SPEECH RECOGNITION
Athanasios Koutras, Evangelos Dermatas and George Kokkinakis
WCL, Electrical and Computer Engineering, University of Patras
26100 Patras, HELLAS.
koutras@giapi.wcl2.ee.upatras.gr
Abstract
In this paper we present a novel solution to the convolutive and post non-linear Blind Speech Separation
(NLBSS) problem. The non-linear separating functions are chosen to be a mixture of parametric sigmoid
functions and higher order odd polynomial functions. The estimation of the separating filter coefficients and
the parameters of the separating non-linear functions is derived using the Maximum Likelihood Estimation
principle. Extensive experiments including different types of complex non-linear mixing functions and real
room impulse responses were carried out to simulate a mixing scenario of two speakers that are talking
simultaneously in a real room environment under the effect of high non-linear distortions. Our proposed
method succeeded in separating the non-linear mixture signals and improved the phoneme recognition
accuracy of an automatic speech recognition system by more than 20% in comparison to the accuracy
measured with the non-linear mixture signals. Furthermore, this method was found to outperform standard
linear Blind Speech Separation methods by 20%, justifying the necessity for non-linear separating functions
integration in speech recognition systems in a non-linear and multi-simultaneous speaker environment.
Keywords: Blind Speech separation, non-linear mixtures, speech recognition, Maximum Likelihood.
1. Introduction
The problem of Blind Source Separation (BSS) primarily consists in recovering a set of statistically
independent signals or sources given only a number of observed mixture data. The term ''blind'' is justified
by the fact that the only a-priori knowledge that we have for the signals is their statistical independence and
2
no other information about the mixing model parameters and the transfer paths from the sources to the
sensors is available beforehand.
Generally, a great number of algorithms for BSS have been proposed over the last 15 years that deal with
the linear instantaneous (memoryless) mixture of sources [1-5] as well as with the situation of linear
convolutive mixtures [3-5]; some of them have been already tested in the case of multiple competing
speakers showing great performance [6-11]. The case of instantaneous mixture of speech signals is the
simplest one and can be encountered when multiple speakers are talking simultaneously in an anechoic
room. However, when dealing with real room acoustics one has to consider the echoes and the delayed
versions of the speech signals as well. This case is the most frequently encountered situation in real world,
where speech signals from multiple speakers are received by a number of microphones located in the room.
Each microphone acquires the signals produced by every speaker, that consist of several delayed and
modified copies of the original speech sources due to the reflections on the walls and objects of the room.
Depending upon the level and the type of the room noise, the strength of the echoes and the amount of the
reverberation, the resulting signals received by the microphones may be highly distorted minimizing the
efficacy of any Automatic Speech Recognition (ASR) system. As a result, there has been a great necessity
for the use of robust BSS techniques as a front end in order to separate convolutive mixtures of speech
signals, reduce the distortion effects and improve the recognition accuracy of ASR systems in various multi-
simultaneous speaker situations [6-11].
Although a lot of work that deals with the aforementioned linear mixing models has already been done,
presentation of algorithms that deal with more realistic and practical non-linear mixing models has been very
limited [12-22]. In the case of non-linear mixing situations, linear BSS methods fail to extract the
independent sources when employed [16,18], therefore new non-linear BSS (NLBSS) methods or extensions
of existing linear BSS ones must be introduced. Such extensions can be achieved by employing non-linear
separating functions that suitably transform the mixtures so that the outputs become statistically
independent. Still, such an approach although clear and straightforward, would be difficult - if not
3
impossible - to be implemented and work efficiently without first properly limiting the function class for the
de-mixing transforms. In its full generality, the non-linear blind separation problem is intractable since the
indeterminacies in the separating solutions are much more severe than those existing in the linear mixing
case [21]. Basically, this can be mainly attributed to the fact that if x and y are two independent random
variables, any of their non-linear transformations f(x) and g(y) still remain independent. Thus if a BSS
solution is found for the non-linear mixing model, this might be quite different from the correct solution of
the respective non-linear BSS where one would find the original sources. As a result, without any prior
knowledge about the mixing function f of the non-linear mixing model, criteria using only statistical
independence cannot be considered efficient enough for recovering the source signals without distortions
and the non-linearly mixed sources are recovered not only up to any scaling and permutation factor, but up
to any non-linear function as well. So in general, NLBSS is impossible using only the sources independence
assumption without any prior knowledge about the non-linear functions. Under these facts, to limit the
function class for the de-mixing functions is equivalent to assume some prior knowledge about the non-
linear mixing function.
In the last years a few algorithms that deal with the NLBSS problem have been presented, despite the
aforementioned difficulties, some of them based on neural networks. Burel [12] has proposed a neural
network based solution to deal with the case of known nonlinearities that depend on an unknown set of
parameters. The self-organizing map (SOM) has been also used to separate sources in non-linear mixture
situations [13]. However this method’s efficacy is hindered by the exponential growth of the network’s
complexity since it requires a huge number of neurons for good accuracy and is restricted to separating
sources with probability density functions with bounded supports. Additionally, the use of SOMs leads to a
high interpolation error when applied for the separation of continuous sources. Deco and Brauer [14] have
studied the very particular and restricting case of volume conserving non-linear transforms. Pajunen et al.
[15] have proposed a non-linear BSS technique based on the Maximum Likelihood. Koutras et al. [16] have
proposed a neural network topology that uses a special case of mixtures of parametric non-linearities that
deals with the NLBSS problem. Lee et al. [17] have addressed the problem of non-linear separation based on
4
information maximization criteria using a parametric sigmoidal non-linearity and higher order polynomials.
Yang et al. [18] have proposed a neural network topology to separate non-linear mixtures, where the inverse
of the non-linearity can be approximated by a two-layer perceptron with parametric sigmoidal non-linearities
on the network’s neurons. Lappalainen [19] and Lappalainen et al. [20] have also proposed a NLBSS
method that separates non-linearly mixed signals using ensemble learning techniques. A recent survey that
describes the NLBSS problem can also be found in [21]. Finally, Taleb et al. [22] have dealt with a special
case of post non-linear mixtures (PNL) using entropy maximization criteria. The PNL mixing model is
useful in practical signal processing applications as it can adequately model the non-linear distortions that
are introduced by sensors. All these approaches deal with the case of memoryless non-linear mixing models.
Work that deals with the convolutive non-linear mixing models with memory has not been yet presented to
our knowledge.
In this paper, we present a novel solution to the PNL blind speech separation problem of convolutive
mixtures that is based on a neural network topology. To extract the non-linear distortions imposed by this
model two non-linear separating functions are implemented; a generalized form of a mixture of parametric
sigmoidal non-linearities as in [16] and higher order odd polynomial based non-linearities. These non-linear
functions have presented a better fitting manner in situations that involve non-linear high complexity mixing
functions, showing accurate approximation and satisfactory separating performance [16]. Extensive
phoneme recognition experiments were performed using mixtures of two simultaneous speech signals from
the TIMIT database that were convolutively and post non-linearly mixed using real room impulse responses
and various types of non-linear functions. The separation network’s learning rules are derived using the
Maximum Likelihood Estimation (MLE) criterion to estimate the separating filters coefficients and the
parameters of the non-linear separating functions and show great efficacy when used to separate the
simultaneous speech signals. Phoneme recognition results have proved the proposed method’s efficacy
achieving an improvement of more than 20% of the recognition rate in comparison to the non-linearly
distorted speech signals accuracy, under all types of non-linear distortions, which leads to a recognition
score of 50%. In addition, the proposed method was found to outperform typical linear BSS methods by
5
20% making necessary the integration of non-linear functions in speech separation systems when robust
speech recognition is needed in non-linear and multiple speaker environments.
The structure of this paper is as follows: In section 2, we present the general non-linear BSS mixing and
separating models. In section 3 we derive the learning rules for the separating filter coefficients and the non-
linear separation functions parameters using the MLE criterion. In section 4 our experimental set-up and the
phoneme recognition results are presented and finally in the last section some conclusions and remarks are
given.
2. Post Non-Linear Convolutive Mixtures
Let us have N statistically independent sources {s1, s2,..,sN} that are mixed by the post non-linear model with
memory shown in Fig. 1. A post non-linear mixture model was first introduced by Taleb and Jutten [22] for
the case of memoryless non-linear mixing models. In our approach, we extend this particular model to the
case of non-linear mixing models with memory.
s1(t)
sN(t)
H11(z)
.
.
.
+
+x1(t)
xM(t)HMN(z)
H1N(z)
HM1(z)
f1
fM
.
.
.
Figure 1. The convolutive non-linear mixture model
The M convolutively and non-linearly mixed signal observations {x1,x2,..,xM} are given by the following
equation:
Miktskhftx jN
j
K
kijii �,2,1,)()()(
1 0=
−⋅= ∑ ∑
= =
(1)
6
where fi is an unknown invertible and derivable non-linear mixing function and hij(k) are the linear FIR filter
coefficients that are used to model the transfer paths between the ith sensor and the jth source. Throughout
this work, the number of the sources is considered equal to the number of the sensors and is known
beforehand (N=M).
To separate the mixed signals xi, we use the NLBSS network shown in Fig. 2 without any prior
knowledge about the convolutive mixture H, the non-linear functions fi and the original sources s(t).
W11(z)
.
.
.
+
+
WNN(z)
W1N(z)
WN1(z)
x1(t)
xN(t)
g1
gN
u1(t)
uN(t)
y1(t)
yN(t)
Figure 2. The non-linear convolutive blind signal separation network
The output yi(t) of the neural network is given by the equation:
( ))()()(1 0
ktxgkwty jjN
j
K
kiji −⋅= ∑ ∑
= =
(2)
or in a more compact matrix form by:
( )[ ]∑=
−⋅=
K
k
Tk ktt
0)()( xgWy (3)
where g = [g1, g2, …, gN] denotes the hidden neurons non linear transfer functions and Wk is a square NxN
matrix that contains the separating filter coefficients wij(k) for the kth filter lag.
To approximate the inverse of the non-linear mixing functions fi –1, two different types of non-linear
functions g are used:
a) a mixture of parametric sigmoidal functions in the form of:
7
)tanh()(1
∑=
+⋅=
P
iii bxaxg (4)
The non-linear function in (4) is a generalization of previous solutions as in [18] and presents a number
of advantages compared to the standard sigmoid functions used in neural networks. It shows a greater
flexibility towards fitting more complex non-linear mixing functions, which helps achieving a better
demixing accuracy and performance of the separation network as our experimental results show. The
parameters ai and bi control the slope and the position of each component in the mixture of the sigmoids.
b) a polynomial non-linear function of the form:
1212
3310)( +
+⋅++⋅+⋅+=
rr xcxcxccxg � (5)
where 2r+1 is the polynomial’s order.
All unknown parameters of the proposed NLBSS network are learnt adaptively using the proposed rules
introduced in the next section.
3. Learning rules 3.1 The Maximum Likelihood Estimator For the unsupervised estimation of the separating filters {wij i, j=1,…, N} as well as the non-linear function
parameters {aij, bij i=1,..,N, j=1,..,P} and {cij i=1,..,N, j=1,3,…,2r+1}, we apply the Maximum Likelihood
Estimation (MLE) criterion. From (2) the pdf of the observed data is given by:
∏=
⋅∂
∂⋅=
N
i i
ii pxxgp
10 ),;(
)()( θWyWx (6)
and their log-likelihood by:
( ) ( )∑=
+
∂
∂+==
N
i i
ii pxxg
pL1
0 ),;(log)(
loglog))(log( θWyWx (7)
where W0 is a NxN matrix with the leading weights of each FIR separating filter wij(0), W is the matrix with
the separating FIR filters wij in its elements and θ is the vector with the parameters of the non-linear
8
separating functions. The FIR filter coefficients of the network’s output layer and the unknown parameters
of the separating non-linearities can be estimated using the stochastic gradient of L with respect to the Wk,
aij, bij and cij as described below.
3.2 Estimation of the separating filters
The stochastic gradient of the log-likelihood L in (7) with respect to the output layer FIR filter
coefficients Wk for the kth filter lag is estimated by:
[ ] [ ] ( )
( )
=−⋅=∂
∂⋅
=⋅+=∂
∂⋅+
=∂
∂=
−−
Kkkt
kt
L
T
k
TTT
kk
,...,2,1,)();();('
);();('
0,)();();('
);();('
∆
10
0
10
uθW,ypθW,yp
Wy
θW,ypθW,yp
uθW,ypθW,ypW
Wy
θW,ypθW,ypW
WW (8)
which can be written in a more compact matrix form as:
[ ]
=−⋅−
=⋅−
=∂
∂=
−
Lkkt
kL
T
TT
kk
,...,2,1,)()(
0,)(∆
1
11
0
uyΦ
uyΦW
WW (9)
where the matrix Φ1(y) is given by:
[ ]TNN
T
NN
NN yhyhypyp
ypyp
)(,),();();(
,,);();(
)( 11
'
11
1'1
1 �� =
−=
θW,θW,
θW,θW,
yΦ (10)
The non-linear function hi(yi) is equal to the cumulative density function of the source signals si and it's
choice plays an important role to the efficacy of the signal separation network. For the case of speech signals
it has been shown that their pdf can be approximated by a gamma distribution variant or the Laplacian
density. For the latter case as Charkani and Deville have reported [23] and various speech separation
applications have shown, the choice of hi(yi) = sign(yi) is considered the best.
By closer inspection we can see that the adaptation rule in (9) presents the drawback that it requires
computationally expensive matrix inversion operations for the estimation of the leading weights for the
9
separating filters. To overcome this problem we use the natural gradient [24] instead of the stochastic
gradient approach for the estimation of W0, which results in:
[ ] ( ) ( )[ ] 001001
0000
0 )()()();();('∆ WWuyΦIWWu
θW,ypθW,ypWWW
WW ⋅⋅⋅−=⋅⋅
⋅+=⋅
∂
∂=
− TTTTTT ttL (11)
Using (9) and (11), the separating filter coefficients adaptation can be performed using the following rule:
( )[ ]
( )[ ]
−⋅⋅−=
⋅⋅⋅−⋅+=
+
+
Tnk
nk
nTnTnn
kt
tt
)()(
)())((
1)()1(
)(0
)(01
)(0
)1(0
uyΦWW
WWuyΦIWW
µ
µ
(12)
The above learning rules perform online and update the filter coefficients every time a new sample is
presented to the non-linear BSS network.
3.3 Estimation of the non-linear separating function parameters
3.3.1 Mixture of sigmoidal non-linearities
Let us denote with aj = [a1j,..,aNj]T the vector containing the slope parameters of the jth mixture term of the
network’s non-linear functions in (4). By taking the stochastic gradient of L with respect to the vector aj we
find after trivial calculations and applying the chain rule of partial derivatives that:
[ ])()(∆ 10)()( yΦWDxΦ
αα Tj
aj
aj
j nL+−=
∂
∂= (13)
where:
( )
( )
( )
( )
T
NkNNkK
k N
NjNNjNNj
kkK
k
jjjj
abxa
u
bxaua
bxau
bxaua
+∂
∂
+∂∂
∂
+∂
∂
+∂∂
∂
−=
∑∑==
)tanh(
)tanh(,,
)tanh(
)tanh()(
1
2
1111 1
11111
2
)(mxΦ (14)
and
( ) ( )
+
∂
∂+
∂
∂= )tanh(,,)tanh()( 111
1
)(NjNNj
Njjj
j
ja bxa
abxa
adiag mxD (15)
10
Using similar calculations, the adaptation equation for the biases bj of the non-linear function in (4) is:
[ ])()(∆ 10)()( yΦWDxΦ
bb Tj
bj
bj
j nL+⋅−=
∂
∂= (16)
where,
( )
( )
( )
( )
T
NkNNkK
k N
NjNNjNNj
kkK
k
jjjj
bbxa
x
bxaxb
bxax
bxaxb
+∂
∂
+∂∂
∂
+∂
∂
+∂∂
∂
−=
∑∑==
)tanh(
)tanh(,,
)tanh(
)tanh()(
1
2
1111 1
11111
2
)(�xΦ (17)
and
( ) ( )
+
∂
∂+
∂
∂= )tanh(,,)tanh()( 111
1
)(NjNNj
Njjj
j
jb bxa
bbxa
bdiag �xD (18)
Using the above equations, the adaptation rules for the non-linear function parameters are given by:
( ))()( 10),(),()()1( yΦWDxΦαα Tjn
ajn
anj
nj n +⋅−=+ (19)
( ))()( 10)(,)(,)()1( yΦWDxΦbb Tjn
bjn
bnj
nj n +⋅−=+ (20)
3.3.2 Polynomial non-linear function
Using the same calculations as above, the polynomial coefficients of the second type of the non-linear
function (5) can be estimated, calculating the stochastic gradient of the log-likelihood L with respect to the
coefficients cj = [c1j,, …, cNj]T
[ ])()(∆ 10)()( yΦWDxΦ
cc Tj
cj
cj
j nL+−=
∂
∂= (21)
where,
11
( )
( )
( )
( )
[ ]
=
+=
⋅⋅+++⋅⋅+
⋅
⋅⋅+++⋅⋅+
⋅−
=
=
⋅++⋅+⋅+∂
∂
⋅++⋅+⋅+∂∂
∂
⋅++⋅+⋅+∂
∂
⋅++⋅+⋅+∂∂
∂
−=
+
−
+
−
+
+
+
+
+
+
+
+
0,0,0
12,...,3,1
)12(3,,
)12(3
)(
)(
2)12(
231
1
21)12(1
211311
11
)(
12)12(
3310
12)12(
3310
2
121)12(1
311311110
1
121)12(1
311311110
11
2
)(
j
rj
xcrxcc
xj
xcrxcc
xj
xcxcxccx
xcxcxccxc
xcxcxccx
xcxcxccxc
T
T
rNrNNNN
jN
rr
j
jc
rNrNNNNNN
N
rNrNNNNNN
NNj
rr
rr
j
jc
m
m
m
m
m
m
o
m
m
xΦ
xΦ
(22)
and
( ) ( )
[ ][ ]
=
+==
⋅++⋅+
∂
∂⋅++⋅+
∂
∂= +
+
+
+
0,1,112,...,3,1,,,)(
,,)(
1)(
12)12(10
121)12(111110
1
)(
jdiagrjxxdiag
xcxccc
xcxccc
diag
jN
jj
c
rNrNNNN
Nj
rr
j
jc
l
m
mmm
xD
xD
(23)
Similarly, the adaptation rule for the non-linear function parameters is given by:
( ))()( 10)(,)(,)()1( yΦWDxΦcc Tjn
cjn
cnj
nj n +⋅−=+ (24)
4. Experiments
To test the proposed NLBSS method for the case of convolutive non-linear mixtures we conducted several
phoneme recognition experiments, including two simultaneous speakers in a simulated real room
environment under the existence of different types of severe post non-linear distortions. The room, was a
typical office room, with dimensions of 6.5m x 4.5m x 2.5m shown in Fig. 3. Two omni-directional
microphones were placed 60cm apart from each other. The first speaker was placed at a distance of 1.5m
12
directly in front of the first microphone. The second speaker was located at a distance of 2m in front of the
second microphone. To simulate the reverberant environment, a subset of 120 pairs of sentences were
chosen randomly from the testing set of the TIMIT database, together with four real room impulse responses
hij (i,j=1,2) with 512 taps each (K=512). The signals of two speakers were mixed according to (1), with their
Relative Energy Level (REL) ranging from -20 dB to 20 dB with respect to the first speaker, in order to
examine in detail the dependence of the proposed NLBSS method’s performance on a wide range of
different mixing conditions. For the evaluation task, we examined the degradation that non-linear distortions
cause to the recognition accuracy of ASR systems in a non-linear multi-simultaneous speaker environment,
and measured the improvement THAT our method achieves when used in such environments.
60cm
1.5m
2.0m
Speaker 1
Speaker 2
Mic 1 Mic 2
1.5m
3m
Figure 3. The speakers / microphones topology used in the experiments
The functions that were used (Fig. 4) for the post non-linear mixing step are:
(a) f1(x) = tanh(0.5x+0.3), f2(x) = tanh(0.5x+0.3)
(b) f1(x) = 0.1x + tanh(0.5x+1.2), f2(x) = x + 0.3 x3 (25)
(c) f1(x) = tanh(0.3x+2x3), f2(x) = tanh(0.1x+2x3)
and range from nearly linear (case 1) to highly non-linear (case 3), covering a wide area of non-linear
mixing distortions.
13
-2 -1 0 1 2-1
-0.5
0
0.5
1
-2 -1 0 1 2-1
-0.5
0
0.5
1
-2 -1 0 1 2-0.5
0
0.5
1
1.5
-2 -1 0 1 2-5
0
5
-2 -1 0 1 2-1
-0.5
0
0.5
1
-2 -1 0 1 2-1
-0.5
0
0.5
1
(a)
(b)
(c)
Figure 4. The non-linear mixing functions used in the experiments
4.1 The Phoneme Recognition System The feature extraction module of the automatic speech recognizer that was used throughout our experiments,
extracted the 12 Mel-cepstrum coefficients plus the energy parameter from the speech signals. The mean
value of the Mel-cepstrum coefficients was subtracted from each coefficient and the first- and second- order
differences were formed to capture the dynamic evolution of speech signals, resulting to a total number of
39 parameters. The phoneme recognition decoder was based on Continuous Density Hidden Markov Models
(CDHMM). Each phoneme-unit HMM was a five states left to right CDHMM with no state skip. The output
distribution probabilities were modeled by means of a Gaussian component with diagonal covariance matrix.
The classification was achieved by reaching the maximum probability of the observation sequence for each
phoneme model. In the training process the segmental K-Means algorithm was used to estimate each
CDHMM's parameter from multiple observations. The complete training set from the TIMIT database was
used for the training of the recognition system, while for the testing we used the 120 pairs of sentences taken
14
from the testing set. For the recognition experiments, a set of 39 different phoneme categories was
employed.
4.2 Experimental Results Extensive phoneme recognition experiments using the aforementioned speech recognition system were
carried out by processing the speakers' clean recordings (CLEAR) from the TIMIT speech database, the non-
linear convolutive speech mixtures acquired by the two microphones (MIXED), the resulting separated
speech signals estimated by common linear BSS methods [6] (LBSS), the resulting separated speech signals
estimated by our proposed NLBSS algorithms (12), (19), (20) (NLBSS1), and the separated speech signals
resulting from (24) (NLBSS2).
20
30
40
50
60
70
REL (dB)
Rec
ogni
tion
Scor
e (%
)
CLEAR MIXED LBSS NLBSS(1) NLBSS(2)
CLEAR 60,72 60,72 60,72 60,72 60,72
MIXED 25,12 28,41 32,15 38,17 45,12
LBSS 30,25 34,05 39,65 43,22 46,57
NLBSS(1) 49,56 52,56 56,65 58,72 59,68
NLBSS(2) 48,52 50,91 56,91 58,63 59,77
-20 -10 0 10 20
Figure 5. Percent phoneme recognition results for the first channel and the nonlinear mixing function
(25a)
15
20
30
40
50
60
70
REL (dB)
Rec
ogni
tion
Scor
e (%
)
CLEAR MIXED LBSS NLBSS(1) NLBSS(2)
CLEAR 58,16 58,16 58,16 58,16 58,16
MIXED 42,15 38,54 31,56 26,44 22,12
LBSS 43,25 42,14 35,58 32,05 27,98
NLBSS(1) 56,12 55,05 52,14 49,85 46,35
NLBSS(2) 55,62 54,93 51,97 49,14 46,03
-20 -10 0 10 20
Figure 6. Percent phoneme recognition results for the second channel and the nonlinear mixing function (25a)
The phoneme recognition results for the first and most easiest non-linear mixing scenario are shown in
Fig. 5 and Fig. 6 for both output channels. It can be observed that the proposed NLBSS networks perform
well and manage not only to separate the competing speakers speech signals in every Relative Energy Level
situation, but also to approximate well the inverse of the non linear mixing functions, reaching an average
recognition score of 55% and 52% for both output channels and an improvement of more than 20% in
comparison to the non-linearly distorted signals. On the other hand LBSS performs poorly and reaches a
phoneme recognition score of 38% and 36% showing only 5% improvement, thus justifying the necessity
for non-linear separation functions integration in speech separation systems when robust speech recognition
is needed in a non-linear and competing speaker environment. For the approximation of the inverse of the
non-linear mixing functions we have tried successively a mixture of one to ten components of parametric
sigmoid functions. However, further increase of the terms of the non-linear function beyond six didn’t
contribute significantly to improving the phoneme recognition score of the ASR system. Similar
experiments were conducted for the polynomial separating function that showed that the best results were
achieved when using a seventh order odd polynomial. Finally, noticeable is that best improvement of the
recognition accuracy was measured in the case of high Interference (REL= -20,20 dB for the first and the
16
second speaker respectively) where our method achieved a phoneme recognition rate of 49% and 46% for
both output channels, reaching an improvement of more than 24%.
20
30
40
50
60
70
REL (dB)
Rec
ogni
tion
Scor
e (%
)CLEAR MIXED LBSS NLBSS(1) NLBSS(2)
CLEAR 60,72 60,72 60,72 60,72 60,72
MIXED 23,25 25,47 28,02 31,25 36,84
LBSS 25,45 27,88 29,65 31,44 34,01
NLBSS(1) 45,12 48,56 50,54 53,86 58,98
NLBSS(2) 44,28 46,93 49,12 52,16 56,43
-20 -10 0 10 20
Figure 7. Percent phoneme recognition results for the first channel and the nonlinear mixing function (25b)
15
25
35
45
55
65
REL (dB)
Rec
ogni
tion
Scor
e (%
)
CLEAR MIXED LBSS NLBSS(1) NLBSS(2)
CLEAR 58,16 58,16 58,16 58,16 58,16
MIXED 32,54 30,56 29,12 24,56 20,56
LBSS 30,12 28,45 25,36 24,58 23,64
NLBSS(1) 54,25 53,12 49,86 45,85 43,52
NLBSS(2) 53,76 52,18 48,96 44,91 42,38
-20 -10 0 10 20
Figure 8. Percent phoneme recognition results for the second channel and the nonlinear mixing function (25b)
For the second non-linear mixing case, the phoneme recognition results are shown in Fig. 7,8. Again, our
methods manage to separate and remove the non-linear distortion on the signals acquired by both
microphones reaching an average recognition score of 51.4% and 49.7% for both used non-linear separating
17
functions for the first output channel and 49%, 48.4% for the second output channel respectively, while the
phoneme recognition improvement was measured to 22.45% and 20.76% for the first and 21.8% and 21.07%
for the second output channel compared to the recognition accuracy obtained by the microphone signals.
However, the most important improvement to the recognition accuracy is observed in the case of high
Interference where our methods have achieved an average phoneme recognition rate of 44% and 42% for
both output channels compared to the corresponding 23% and 20% from the mixed signals, reaching a
significant improvement of more than 20%. It is also evident that the LBSS algorithm performs
disappointingly improving the recognition accuracy by the poor amount of only 1% for both output
channels.
15
25
35
45
55
65
REL (dB)
Rec
ogni
tion
Scor
e (%
)
CLEAR MIXED LBSS NLBSS(1) NLBSS(2)
CLEAR 60,72 60,72 60,72 60,72 60,72
MIXED 20,58 22,12 25,47 30,31 33,55
LBSS 27,48 28,05 29,68 30,3 30,78
NLBSS(1) 43,15 45,58 49,68 53,78 53,81
NLBSS(2) 42,18 43,54 46,12 49,91 51,63
-20 -10 0 10 20
Figure 9. Percent phoneme recognition results for the first channel and the nonlinear mixing function (25c)
18
15
25
35
45
55
65
REL (dB)
Rec
ogni
tion
Scor
e (%
)
CLEAR MIXED LBSS NLBSS(1) NLBSS(2)
CLEAR 58,16 58,16 58,16 58,16 58,16
MIXED 29,54 26,35 23,13 20,14 18,95
LBSS 28,54 26,59 26,41 22,57 20,69
NLBSS(1) 50,14 49,62 47,83 43,63 40,29
NLBSS(2) 49,16 47,33 45,68 41,22 38,91
-20 -10 0 10 20
Figure 10. Percent phoneme recognition results for the second channel and the nonlinear mixing function (25c)
Finally, the same experimental results can be observed in Fig. 9,10 for the last and most difficult case of
non-linear mixture of the speech signals. Both separating methods NLBSS1 and NLBSS2 work well and
achieve a recognition accuracy of 49% and 46% for the first output channel and 46% and 44% for the
second channel with an average improvement of 22% and 21% accordingly compared to the poor 26.4% and
23.6% of the highly distorted mixed signals. Both NLBSS networks outperform the linear BSS network
resulting to more than 18% phoneme recognition accuracy using a mixture of 7 parametric sigmoidal terms
and an 13th order odd-polynomial function
The significance of the proposed NLBSS methods for the case of non-linear distortions can be also
proved by closer examination of our experimental results in the case of low interference. Particularly, in
such conditions the speech signals are distorted mainly by the non-linear mixing functions while the
distortion imposed by the convolutive mixture of speech signals is as expected of lower significance. In
these conditions the LBSS method didn’t succeed to improve the phoneme recognition accuracy of the ASR
system, while in some other cases degraded it, in opposition to the proposed NLBSS techniques introduced
in this paper which removed various kinds of non-linear distortions successfully reaching a recognition rate
19
improvement of more than 20% for both output channels especially when difficult non-linearities were
implemented (Fig. 4b, 4c).
In Fig. 11 we present our algorithms performance where we can observe that the joint distribution of the
separated speech signals using both non-linear function approximations matches the joint distribution of the
clear signals resulting to statistical independence and quality separation in opposition to the poor linear BSS
method’s separation efficacy under non-linear distortions.
(a) (b) (c) (d) (e)
Figure 11. Scatter plots of the joint pdf for the case of (a) clean, (b) non-linearly and convolutively mixed
signals using (25b) (c) separated speech signals using the LBSS method (d) separated speech signals using
mixture of sigmoid separating functions and (e) separated speech signals using odd-polynomial separating
functions.
5. Conclusions
In this paper we presented a novel solution to the Blind Speech Separation problem of convolutive and post
non-linear mixtures of speech signals based on a neural network topology that uses a mixture of parametric
sigmoid and higher order odd polynomial non-linear functions. The proposed learning rules were derived
using the Maximum Likelihood Estimation criterion. Extensive speech recognition experiments in various
mixing situations with high non-linear distortions have proved our method’s efficacy and improved the
phoneme recognition accuracy of the separated speech signals significantly. Furthermore, the proposed
methods were found to outperform standard linear BSS techniques justifying the necessity for non-linear
20
functions integration in speech separation systems when robust speech recognition is needed in non-linear
and multiple speaker environments.
References
[1] S. Amari, & A. Cichocki, Adaptive Blind Signal Processing-Neural Network Approaches, IEEE
Proceedings, 86(10), 1998, 2026-2048.
[2] J. Cardoso, Blind Signal Separation: Statistical Principles, IEEE Proceedings, 86(10), 1998, 2009-
2025.
[3] M. Girolami, Self-Organising Neural Networks. Independent Component Analysis and Blind Source
Separation (Springer-Verlag, 1999).
[4] S. Haykin, Unsupervised Adaptive Filtering, Vol. I Blind Source Separation (J. Wiley & Sons, 2000).
[5] T. W. Lee, Independent Component Analysis (Kluwer Academic Publishers, 1998).
[6] A. Koutras, E. Dermatas, & G. Kokkinakis, Blind separation of speakers in noisy reverberant
environments: A neural network approach, Neural Network World Journal, 10(4), 2000, 619-630.
[7] A. Koutras, E. Dermatas, & G. Kokkinakis, Blind speech separation of moving speakers in real
reverberant environments, Proc. IEEE Conf. On Acoustic Speech and Signal Processing, Istanbul,
Turkey, 2000, Vol. 2, 1133-1136.
[8] K. Yen, J. Huang, & Y. Zhao, Co-channel speech separation in the presence of correlated and
uncorrelated noises, Proc. EuroSpeech, Budapest, Hungary, 1999, Vol. 6, 2587-2590.
[9] K. Yen, & Y. Zhao, Co-channel speech separation for robust Automatic Speech recognition, Proc.
IEEE Conf. On Acoustic Speech and Signal Processing, Munich, Germany, 1997, Vol. 2, 859-862.
[10] T. W. Lee, A. Bell, & R. Lambert, Blind separation of delayed and convolved sources, in Advances in
Neural Information Processing Systems, 9 (MIT Press, Cambridge MA, 1997), 758-764.
[11] S. Choi S, & A. Cichocki, Adaptive blind separation of speech signals: Cocktail Party Problem, Proc.
International Conference of Speech Processing, Seoul, Korea, 1997, 617-622.
[12] G. Burel, Blind separation of sources: A nonlinear neural algorithm. Neural Networks, 5(6), 1992,
937-947.
21
[13] P. Pajunen, A. Hyvarinen, & J. Karhunen, Nonlinear blind source separation by self-organizing maps.
Proc. 3rd International Conference on Neural Information Processing, Hong Kong, 1996, Vol 2,
1207-1210.
[14] G. Deco, & W. Brauer, Nonlinear higher-order statistical decorrelation by volume-conserving
architectures, Neural Networks, 8(4), 1995, 525-535.
[15] P. Pahunen, & J. Karhunen, A maximum likelihood approach to non-liner blind separation, Proc.
International Conference on Artificial Neural Networks, Lausanne, Switzerland, 1997, 541-546.
[16] A. Koutras, E. Dermatas, & G. Kokkinakis, Neural Network approach to non-linear Independent
Component Analysis, to be published in Lecture notes in Computer Science, (Springer-Verlag, 2001).
[17] T. W. Lee, B. Koehler, & R. Orglmeister, Blind source separation of nonlinear mixing models, Proc.
IEEE International Workshop on Neural Network for Signal Processing, Florida, USA, 1997, 406-415
[18] H. Yang, S. Amari, & A. Cichocki, Information theoretic approach to blind separation of sources in
non-linear mixture. Signal Processing, 64(3), 1998, 291-300.
[19] H. Valpola, Nonlinear Independent component Analysis using ensemble learning: Theory, Proc. 2nd
Int. Workshop on Independent Component Analysis and Blind Signal Separation, Espoo, Finland,
2000, 251-256.
[20] H. Valpola, X. Giannakopoulos, A. Honkela, & J. Karhunen, Nonlinear Independent component
Analysis using ensemble learning: Experiments and discussion, Proc. 2nd Int. Workshop on
Independent Component Analysis and Blind Signal Separation, Espoo, Finland, 2000, 351-356.
[21] J. Karhunen, Nonlinear Independent Component Analysis, in Roberts & Everson (Eds.), Independent
Component Analysis: Principles and Practice, (Cambridge University Press, 2001).
[22] A. Taleb, & C. Jutten, Nonlinear source separation: the post-nonlinear mixtures, Proc. European
Symposium of Artificial Neural Networks, Bruges, Belgium, 1997, 279-284.
[23] N. Charkani, & Y. Deville, Optimization of the asymptotic performance of time-domain convolutive
source separation algorithms, Proc. European Symposium of Artificial Neural Networks, Bruges,
Belgium, 1997, 273-278.
[24] S. Amari, Natural Gradient works efficiently in learning, Neural Computation, 10(2), 1998, 251-276.