Post on 01-May-2023
Effects of Syllable Language Model on
Distinctive Phonetic Features (DPFs) based
Phoneme Recognition Performance
Mohammad Nurul Huda Department of CSE, United International University, Dhaka, Bangladesh
Email: mnh@cse.uiu.ac.bd
Manoj Banik Department of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Email: manojbanik@aust.edu
Ghulam Muhammad College of Computer & Information Science, King Saud University, Riyadh, Saudi Arabia
Email: ghulam@ksu.edu.sa
Mashud Kabir Department of EEE, United International University, Dhaka, Bangladesh
Email: mkabir@eee.uiu.ac.bd
Bernd J. Kroger Dept. of Phoniatrics and Communication Disorders, University Hospital Aachen, Aachen, Germany
Email: bkroeger@ukaachen.de
Abstract—This paper presents a distinctive phonetic
features (DPFs) based phoneme recognition method by
incorporating syllable language models (LMs). The method
comprises three stages. The first stage extracts three DPF
vectors of 15 dimensions each from local features (LFs) of
an input speech signal using three multilayer neural
networks (MLNs). The second stage incorporates an
Inhibition/Enhancement (In/En) network to obtain more
categorical DPF movement and decorrelates the DPF
vectors using the Gram-Schmidt orthogonalization
procedure. Then, the third stage embeds acoustic models
(AMs) and LMs of syllable-based subwords to output more
precise phoneme strings. From the experiments, it is
observed that the proposed method provides a higher
phoneme correct rate as well as a tremendous improvement
of phoneme accuracy. Moreover, it shows higher phoneme
recognition performance at fewer mixture components in
hidden Markov models (HMMs).
Index Terms—distinctive phonetic features, language
models, syllable based subword, inhibition/enhancement
network, hidden Markov models
I. INTRODUCTION
Conventional automatic speech recognition (ASR)
systems use stochastic pattern matching techniques in
which word candidates are matched against word
templates represented by hidden Markov models
(HMMs). Although these techniques perform adequately
Featureextractor
Input utterance
Word recognizer
OOV detector
Phonemerecognizer
Sound-to-letterconverter
Word-list
AMs Phone-list
OOV
utterance
OOVphonemes
OOVword
Word
strings
Featureextractor
Input utterance
Word recognizer
OOV detector
Phonemerecognizer
Sound-to-letterconverter
Word-list
AMs Phone-list
OOV
utterance
OOVphonemes
OOVword
Word
strings
Figure 1. An ASR system with OOV detection.
in certain limited applications, they always reject a new
vocabulary or a so-called out-of-vocabulary (OOV) word.
Therefore, an accurate phonetic typewriter or a phoneme
recognizer is expected to assist next-generation ASR
systems for resolving this OOV-word problem [1, 2] via a
short interaction (talk-back) by automatically adding the
word into a word lexicon from the phoneme strings of an
input utterance (see Fig. 1).
Though various distinctive phonetic features (DPFs)
based methods, which can solve coarticulatory
phenomena [3, 4], were proposed to obtain a more
accurate phoneme recognizer [5, 6, 7, 8], they provide a
higher phoneme correct rate (PCR) with poor phoneme
accuracy rate (PAR). The reason for providing lower
phoneme recognition accuracy is the violation of some
phonotactic constraints at different places of the output
phoneme string of an input utterance. Therefore, it is
needed to incorporate syllable-based subword items,
JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 543
© 2010 ACADEMY PUBLISHERdoi:10.4304/jmm.5.6.543-550
TABLE I. JAPANESE BALANCED DPF-SET FOR CLASSIFYING ATR PHONEMES
DPFs a i u e o N w y j my ky dy by gy ny hy ry py p t k ts ch b d g z m n s sh h f r
vocalic
high
low
nil
anterior
back
nil
coronal
plosive
affricative
continuant
voiced
unvoiced
nasal
semi-vowel
which set some grammatical constraints at the recognition
stage, for obtaining a phoneme recognizer with higher
accuracy.
In this paper, we propose a more accurate phoneme
recognition method by incorporating syllable language
models (LMs), where the method comprises three stages.
The first stage extracts three DPF vectors of 15
dimensions each from local features (LFs) of an input
speech signal using three multilayer neural networks
(MLNs). The second stage incorporates an
Inhibition/Enhancement (In/En) network to obtain more
categorical DPF movement and decorrelates the DPF
vectors using the Gram-Schmidt (GS) algorithm. Then,
the third stage embeds acoustic models (AMs) and LMs
of syllables at the recognition stage of an HMM-based
classifier. The objective of this study is to incorporate
rule based (syllable constraints) LMs for resisting those
phoneme strings that violates syllable rules and
consequently, for obtaining a higher PAR.
The paper is organized as follows: Section II discusses
the DPFs and Section III describes a hybrid neural
network (Recurrent Neural Network (RNN) and MLN)
based phoneme recognition method. Section IV explains
the system configuration of the proposed phoneme
recognition method. Experimental database and setup are
provided in Section V, while experimental results are
analyzed in Section VI. Finally, in Section VII, some
conclusions are drawn.
II. DISTINCTIVE PHONETIC FEATURES
A phoneme can easily be identified by using its unique
DPF set [9, 10]. The Japanese balanced DPF set [11, 12]
for classifying phonemes have 15 elements which are
vocalic, high, low, intermediate expression between high
and low <nil>, anterior, back, intermediate expression
between anterior and back <nil>, coronal, plosive,
affricate, continuant, voiced, unvoiced, nasal and
semivowel. Table I shows this balanced DPF set. Here,
present and absent elements of the DPFs are indicated by
“+” and “-” signs, respectively.
III. HYBRID NEURAL NETWORK-BASED PHONEME
RECOGNITION METHOD
At first stage, this system [13] extracts a 45
dimensional DPF feature vector using a hybrid neural
network (HNN) which consists of a RNN and an MLN,
and then these features are decorrelated using the GS
algorithm [11]. Here, the RNN is used for handling a
longer context window [14]. Fig. 2 shows a block
diagram of the system using HNN. Input acoustic vector
is formed by taking preceding “t-3”-th and succeeding
“t+3”-th frames together with the current t-th frame. Each
input frame is formed by 25 LF values. The RNN
generates 45 DPF values of which first 15 for the
preceding, middle 15 for the current, and the rest for the
succeeding frame. The MLN generates 45 DPF values
and reduces misclassification of phoneme at phoneme
boundaries by incorporating a seven frame context
window (from t-3 to t+3) as input. At second stage, the
system incorporates HMM-based AMs that are obtained
using the orthogonalized feature vector of first stage, and
LMs of syllable-based subword, which will be explained
in Section IV.C, to output more precise phoneme strings.
IV. PROPOSED PHONEME RECOGNITION METHOD
Fig. 3 shows the proposed phoneme recognition
method that comprises three stages. The first stage
extracts a 45-dimensional DPF vector from the LFs of an
input speech signal using three MLNs. The second stage
incorporates In/En functionalities to obtain modified DPF
patterns and decorrelates the DPF vector using the GS
orthogonalization [11] before connecting it with an
HMM-based classifier.
A. DPF Extractor
In this method, three MLNs instead of two MLNs [5]
are used to construct the DPF extractor. The first MLN,
544 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010
© 2010 ACADEMY PUBLISHER
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
MLN
Gra
m-S
chm
idt
Orth
ogo
na
liza
tion
15
15
15
15
15
15
HM
M-b
ase
d
AS
R
AMs
LMs
(Syllable)
Phoneme
Strings
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
MLN
Gra
m-S
chm
idt
Orth
ogo
na
liza
tion
15
15
15
15
15
15
HM
M-b
ase
d
AS
R
AMs
LMs
(Syllable)
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
MLN
Gra
m-S
chm
idt
Orth
ogo
na
liza
tion
15
15
15
15
15
15
HM
M-b
ase
d
AS
R
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
Speech signalxt-3 : 25 LF
xt : 25 LF
xt+3:25 LF
Local F
eatu
re E
xtra
ctio
n
Outp
ut L
ayer
Hid
den L
ayer
OL H
L E
xte
rnal in
put
OL H
L E
xte
rnal in
put
delay
delay
delay
delay
yt-3:15 DPF
yt: 15 DPF
yt+3: 15 DPF
350
RNN
MLN
Gra
m-S
chm
idt
Orth
ogo
na
liza
tion
15
15
15
15
15
15
HM
M-b
ase
d
AS
R
AMs
LMs
(Syllable)
Phoneme
Strings
Figure 2. Hybrid Neural Network based Phoneme Recognition Method.
Phoneme45
xt-3
xt
xt+3
Input for MLN:25 dim. x 3 fr.
Speech signal
ML
NLF-D
PF
(Ma
pp
ing
LF
to D
PF
)
Inhib
ition/E
nhance
ment
Netw
ork
45
Gra
m-S
chm
idt
Orth
ogonaliza
tion
45
ML
NDyn
(Re
stric
ting
DP
F d
yn
am
ics
)
Lo
ca
l Fe
atu
re E
xtra
ctio
n
45
45 DPF
45
DPF∆
DPF∆∆
ML
Ncntxt
DPFt
DPFt-3
DPFt+3
HMM
AMs
LM
(syllable)
string
Phoneme45
xt-3
xt
xt+3
Input for MLN:25 dim. x 3 fr.
Speech signal
ML
NLF-D
PF
(Ma
pp
ing
LF
to D
PF
)
Inhib
ition/E
nhance
ment
Netw
ork
45
Gra
m-S
chm
idt
Orth
ogonaliza
tion
45
ML
NDyn
(Re
stric
ting
DP
F d
yn
am
ics
)
Lo
ca
l Fe
atu
re E
xtra
ctio
n
45
45 DPF
45
DPF∆
DPF∆∆
ML
Ncntxt
DPFt
DPFt-3
DPFt+3
xt-3
xt
xt+3
Input for MLN:25 dim. x 3 fr.
Speech signal
ML
NLF-D
PF
(Ma
pp
ing
LF
to D
PF
)
Inhib
ition/E
nhance
ment
Netw
ork
45
Gra
m-S
chm
idt
Orth
ogonaliza
tion
45
ML
NDyn
(Re
stric
ting
DP
F d
yn
am
ics
)
Lo
ca
l Fe
atu
re E
xtra
ctio
n
45
45 DPF
45
DPF∆
DPF∆∆
ML
Ncntxt
DPFt
DPFt-3
DPFt+3
HMM
AMs
LM
(syllable)
string
Figure 3. Proposed Phoneme Recognition Method.
MLNLF-DPF, outputs DPFs [11, 12] for the inputted
acoustic features, LFs [15], while the second MLN,
MLNcntxt, reduces misclassification at phoneme
boundaries by taking seven frame context (from t-3 to
t+3) as input, and the third MLN, MLNDyn, restricts the
DPF dynamics by incorporating dynamic parameters
( DPF and DPF) into its input. Here, the MLNLF-DPF,
which is trained using the standard back-propagation
learning algorithm, has two hidden layers of 256 and 96
units, respectively and takes three input vectors (t-3, t,
t+3) of LFs of 25 dimensions each. The 45-dimensional
context-dependent DPF vector provided by the MLNLF-
DPF at time t is appended into the MLNcntxt, which consists
of five layers including three hidden layers of 90, 180,
and 90 units, respectively, and generates a 45-
dimensional DPF vector with a small number of errors at
phoneme boundaries by incorporating context window of
particular size. This 45-dimensional DPF vector and its
corresponding DPF and DPF vectors calculated by
three-point linear regression (LR) are appended into the
subsequent MLNDyn, which consists of four layers
including two hidden layers of 300 and 100 units,
respectively, and outputs a 45-dimensional DPF vector
with reduced fluctuations and dynamics. Both the
MLNcntxt and MLNDyn are also trained using the standard
back-propagation algorithm. Here, for the MLNLF-DPF and
MLNcntxt, the context window size is selected <t-3, t, t+3>
instead of <t-3, t-2, t-1, t, t+1, t+2, t+3> because
approximately same performances are obtained by the
both and reduced computational cost is required for the
context window, <t-3, t, t+3>. On the other hand, the
specified unit of each hidden layer for each MLN is
selected by tuning process.
B. Inhibition/Enhancement Network
The DPF extractor provides 45 DPF patterns (15 for
preceding context, 15 for current context, and 15 for
following context) for each input vector along time axis.
These patterns may not match with the pattern of input-
phoneme string for some phonemes and consequently,
some phonemes are incorrectly recognized by the HMM
classifier. This phoneme misclassification sometimes
occurs when the values of DPF peaks and DPF dips are
closer to each other. Therefore, a mechanism, which is
called In/En network [5, 6, 7], is needed to obtain clearly
JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 545
© 2010 ACADEMY PUBLISHER
separable DPF peaks and dips. An algorithm for this
network is given below:
Step1: For each element of the DPF vectors, find the
acceleration ( ) parameters by using three-point LR.
Step2: Check whether ( ) is positive (concave pattern)
or negative (convex pattern) or zero (steady state).
Step3: Calculate )(∆∆f
if pattern is convex,
if pattern is concave,
if steady state,
0.1)( =∆∆f (3)
Step4: Find modified DPF patterns by multiplying the
DPF patterns with )(∆∆f .
C. Syllable-Based Language Model
The phoneme classification inaccuracy at the acoustic
phonetic level is a major weakness in most speech
recognition systems. However, the inaccuracy often
violates phonotactic constraints at the acoustic phonetic
level. Usually, three types of phoneme classification
errors appear: (i) phoneme insertions, (ii) phoneme
deletions, and (iii) phoneme substitutions. A better
performance is expected if a language model is adopted
in a recognition system for post-processing phoneme
estimates and for making corrections with regard to the
grammatical constraints. Words, syllables, and
phonological rules, etc. of a language provide some
constraints on a phoneme sequence generation, but a
language has a large number of words and knowledge-
based rules. Besides, since the number of syllables of
Japanese language is limited, they can be used to enhance
the phoneme recognition performance. For an example, a
phoneme recognizer generates a phoneme strings, /k/ /a/
/w/ /k/ /e/ /y/ /m/ /a/ /s/ /u/ for an input speech
/kakemasu/, which means “write” in English language.
The actual phoneme strings for the utterance is /k/ /a/ /k/
/e/ /m/ /a/ /s/ /u/. Therefore, there occur two insertions by
the phoneme recognizer. After adopting LMs that support
only CV, CVV, CV1V2, CVN, and CVQ syllable
structures, we can correct the phoneme strings, /k/ /a/ /w/
/k/ /e/ /y/ /m/ /a/ /s/ /u/ by resisting the generation of
strings, /k//a//w/ and /k//e//y/ incorrectly.
V. EXPERIMENTS
A. Speech Databases
If we use same data set for all classifiers used in the
experiments, there are some possibilities of biasness for
the test data to the training data at each stage. Because of
this, open test for each stage classifier (neural network-
based and HMM-based) is done by using different
training and test data sets. The following five clean data
sets are used in our experiments.
D1. Training data set for MLNLF-DPF
A subset of the Acoustic Society of Japan (ASJ)
Continuous Speech Database [16] comprising 4503
sentences uttered by 30 different male speakers (16 kHz,
16 bit) is used.
D2. Training data set for MLNcntxt
This data set contains 5000 sentences that are taken
from Japanese Newspaper Article Sentences (JNAS) [17]
Continuous Speech Database; the sentences have been
uttered by 33 different male speakers (16 kHz, 16 bit).
D3. Training data set for MLNDyn
This data set contains 5000 JNAS [17] sentences
uttered by 33 different male speakers (16 kHz, 16 bit).
Speakers of this data set are different from the D2 data
set.
D4. Training data set for HMM classifier
This data set takes 5000 JNAS [17] sentences uttered
by 33 different male speakers (16 kHz, 16 bit). Speakers
of this data set are different from the D2 and D3 data set.
D5. Test data set
This test data set comprises 2379 JNAS [17] sentences
uttered by 16 different male speakers (16 kHz, 16 bit).
B. Experimental Setup
The frame length and frame rate (frame shift between
two consecutive frames) are set to 25 ms and 10 ms,
respectively, to obtain acoustic features from an input
speech signal. LFs are a 25-dimensional vector consisting
of 12 delta coefficients along time axis, 12 delta
coefficients along frequency axis, and delta coefficient of
log power of a raw speech signal [15].
Since our goal is to design a more accurate phoneme
recognizer, PCR and PAR for D5 data set are evaluated
using an HMM-based classifier. The D4 data set is used
for the proposed method to design 38 Japanese
monophone HMMs with five states, three loops, and left-
to-right models. Input features for the classifier are DPFs.
On the other hand, in the HNN-based phoneme
recognition method, the D1 data set is used to design the
HMM and orthogonalized DPFs are used as input
features. In the HMMs, the output probabilities are
represented in the form of Gaussian mixtures, and
diagonal matrices are used. The number of components
per mixture is set to 1, 2, 4, 8, and 16, respectively.
In our experiments of the three-MLN and HNN, the
non-linear function is a sigmoid from 0 to 1 (1/(1+exp(-
x))) for the hidden and output layers.
For the In/En network, the value of the enhancement
coefficient, C1, is set to 4.0 after evaluating the proposed
method, DPF (3-MLN+In/En+GS, dim:45), for different
values of C1 , such as 2, 4, and 6, and the value of the
steepness coefficient, , is set to 80. The value of
inhibitory coefficient, C2, is fixed to 0.25 after observing
the DPF data patterns to keep the values of f( )
between 0.25 and 1.0.
For the experiments incorporating LMs, the Julius
3.5.3 version [18] is used to embed acoustic models and
trigram subword models (short and long syllables) as
∆∆−+=∆∆ βec
cf
)1(1)(
1
1(1)∆∆−+
=∆∆ βec
cf
)1(1)(
1
1(1)
∆∆+−
+=∆∆ βe
ccf
1
)1(2)( 2
2(2)
∆∆+−
+=∆∆ βe
ccf
1
)1(2)( 2
2(2)
546 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010
© 2010 ACADEMY PUBLISHER
LMs. It is a two-pass system, where the first pass is a
bigram and the second one is a trigram to set more
constraints on the output phoneme sequence. Syllable
structures are CV, CVV, CV1V2, CVN, and CVQ.
Weight/Insertion penalties are 5.0/-1.0 and 6.0/00 for the
first and second passes, respectively.
To investigate the effect of syllable-based subword
LM, we have designed the following phoneme
recognition tests, where the DPF extractor inputs a 45-
dimensional orthogonalized DPF vector to the HMM.
(a) 3-MLN+In/En+GS [7]
(b) 3-MLN+In/En+GS+LM [Proposed]
For further investigation, we have designed the
following experiments for testing the phoneme
recognition performance with and without incorporating
LM.
(c) Hybrid+GS
(d) Hybrid+GS+LM [13]
To compare the methods, Huda [5] and the proposed
one, we have designed the following two experiments for
evaluating PCR and PA.
(e) 2-MLN+In/En+GS [5]
(b) 3-MLN+In/En+GS+LM [Proposed]
For comparing context window size, <t-3, t, t+3> for
method (e) with <t-3, t-2, t-1, t, t+1, t+2, t+3> for method
(f), we have evaluated PCR and PA by designing the
following two experiments.
(f) Cont.2-MLN+In/En+GS
(e) 2-MLN+In/En+GS [5].
VI. EXPERIMENTAL RESULTS AND DISCUSSION
Fig. 4 shows the PCR of the proposed method before
and after adding syllable-based subword LM. For all the
mixture components investigated, the method with LM
provides higher PCR. For example, at mixture component
16, the method without LM shows 84.50% PCRs, while
the corresponding value for the method with LM is
87.40%.
On the other hand, the method with LM outperformed
the method without LM for phoneme accuracy, which is
shown in Fig. 5. By using the mixture components 1 and
2, the syllable knowledge leads to correction of the
highest number of phonemes and shows the highest level
improvement. Again, the addition of syllable knowledge
shows the highest accuracy (80.54%) at mixture
component 16. Therefore, language constraints resist
misclassification of phonemes and consequently, provide
higher phoneme recognition performance in a clean
acoustic environment.
Again, Figs. 6 and 7 depicted the PCR and PAR,
respectively, for the HNN-based method with and
without incorporating syllable-based LM. Incorporation
of LM improves the PCR and PAR for all the mixture
components investigated. It is observed from the Fig. 6
that the language model improves the PCR most (6.92%)
at mixture component eight, while the PAR is improved
by 6.98%, which is shown in Fig. 7.
Clean
70
75
80
85
90
1 2 4 8 16
Number of mixture component(s)
Ph
on
em
e C
orr
ect
Rate
(%)
3-MLN+In/En+GS3-MLN+In/En+GS+LM
Figure 4. Comparison of phoneme correct rate for the proposed
method before and after adding LM.
Clean
50
60
70
80
90
1 2 4 8 16
Number of mixture component(s)
Ph
on
em
e A
ccu
racy(%
)
3-MLN+In/En+GS3-MLN+In/En+GS+LM
Figure 5. Comparison of phoneme accuracy for the proposed method
before and after adding LM.
Clean
70
75
80
85
90
1 2 4 8 16
Number of mixture component(s)
Phonem
e C
orr
ect R
ate
(%)
Hybrid+GS
Hybrid+GS+LM
Figure 6. Comparison of phoneme correct rate for hybrid neural
network based method before and after adding LM.
It is claimed that syllable-based subword increases
PCR and PAR, where improvement of PAR is
tremendous. The reason for improving PAR is that
syllables setup some rules on phoneme generation. These
rules resist some phonemes which violates the constraints
set by syllables.
JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 547
© 2010 ACADEMY PUBLISHER
Clean
30
40
50
60
70
1 2 4 8 16
Number of mixture component(s)
Phonem
e A
ccura
cy(%
)
Hybrid+GS
Hybrid+GS+LM
Figure 7. Comparison of phoneme accuracy for hybrid neural network
based method before and after adding LM.
Figure 8. Segmentation of utterance, /jiNkoese/ using hybrid neural
network.
Figure 9. Segmentation of utterance, /jiNkoese/ using 3-MLN.
Segmentation for a clean /jiNkoese/ utterance is shown
in Figs. 8 and 9 for a balanced-DPF set [11] using an
HNN and 3-MLN, respectively. In both the figures,
"Solid thin line" and "Solid bold line" represent "ideal
segmentation" and "output segmentation", respectively;
"nasal", "nil(high/low)", and "high" of phoneme /N/, and
"unvoiced", "coronal", and "anterior" of phoneme /s/ are
denoted by (1), (2), (3), (4), (5), and (6), respectively.
After observing these marked places, we can say that the
3-MLN exhibits more precise segmentation (less
deviation from ideal boundary) than the HNN, reduces
some fluctuations and provides more smoothed DPF
curves, and hence, it misclassifies fewer phonemes.
Consequently, the 3-MLN based proposed method
provides higher phoneme recognition performance than
the HNN based method.
Tables II and III show PCR and PAR, respectively for
the methods, [5] and proposed one in this study. From
both the tables, it is observed that the incorporation of
one additional MLN and LMs increases PCR and PAR
for all mixture components investigated, where PAR
improvement is tremendous than PCR. For an example, at
mixture component 16, approximately 4.00%
improvement of PCR is observed (see Table II), while the
corresponding value of PAR is approximately 14.00%
(see Table III).
Again, Tables IV and V show PCR and PAR,
respectively for the methods, Cont.2-MLN+In/En+GS
with continuous <t-3, t-2, t-1, t, t+1, t+2, t+3> and [5]
with discrete <t-3, t, t+3>. It is shown from both the
tables that the methods, Cont.2-MLN+In/En+GS with
continuous seven frames and [5] with discrete seven
frames provide almost same performances for all higher
mixture components investigated. Moreover, higher
dimensionality in the method with continuous seven
frames requires more computation in the training process
of neural networks and consequently, slows down both
the training and test phase. Therefore, discrete <t-3, t,
t+3> is selected for experiments.
Table VI shows the effect of different training datasets,
for the same test data (D5), on phoneme recognition
performance. From the table, it is observed that the
methods, with same training data set for all classifiers
(first data row of the table) and with different training
data set for each classifier (second data row of the table),
show almost same performances for all the mixture
components investigated.
TABLE II. PHONEME CORRECT RATE OF METHOD [5] AND PROPOSED
Phoneme Correct Rate (%)
Methods
87.4087.1686.8986.5584.10Proposed
83.3383.3083.2682.8082.182-MLN+In/En+GS [5]
16 mix8 mix4 mix2 mix1 mix
Phoneme Correct Rate (%)
Methods
87.4087.1686.8986.5584.10Proposed
83.3383.3083.2682.8082.182-MLN+In/En+GS [5]
16 mix8 mix4 mix2 mix1 mix
TABLE III. PHONEME ACCURACY RATE OF METHOD [5] AND PROPOSED
Phoneme Accuracy (%)
Methods
80.5479.8879.6978.9876.55Proposed
66.4763.7962.5459.0056.532-MLN+In/En+GS [5]
16 mix8 mix4 mix2 mix1 mix
Phoneme Accuracy (%)
Methods
80.5479.8879.6978.9876.55Proposed
66.4763.7962.5459.0056.532-MLN+In/En+GS [5]
16 mix8 mix4 mix2 mix1 mix
548 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010
© 2010 ACADEMY PUBLISHER
TABLE IV. PHONEME CORRECT RATE FOR THE METHODS, CONT.2-
MLN+IN/EN+GS WITH CONTINUOUS <T-3, T-2, T-1, T, T+1, T+2, T+3>AND [5] WITH DISCRETE <T-3, T, T+3>
Phoneme Correct Rate (%)
Methods
83.3383.3083.2682.8082.182-MLN+In/En+GS[5]
83.3583.2783.1183.1182.84Cont.2-MLN+In/En+GS
16 mix8 mix4 mix2 mix1 mix
Phoneme Correct Rate (%)
Methods
83.3383.3083.2682.8082.182-MLN+In/En+GS[5]
83.3583.2783.1183.1182.84Cont.2-MLN+In/En+GS
16 mix8 mix4 mix2 mix1 mix
TABLE V. PHONEME ACCURACY RATE FOR THE METHODS, CONT.2-
MLN+IN/EN+GS WITH CONTINUOUS <T-3, T-2, T-1, T, T+1, T+2, T+3>AND [5] WITH DISCRETE <T-3, T, T+3>
Phoneme Accuracy (%)
Methods
66.4763.7962.5459.0056.532-MLN+In/En+GS[5]
65.4364.0462.3860.4760.47Cont.2-MLN+In/En+GS
16 mix8 mix4 mix2 mix1 mix
Phoneme Accuracy (%)
Methods
66.4763.7962.5459.0056.532-MLN+In/En+GS[5]
65.4364.0462.3860.4760.47Cont.2-MLN+In/En+GS
16 mix8 mix4 mix2 mix1 mix
TABLE VI. EFFECTS OF TRAINING DATASETS ON PHONEME RECOGNITION
PERFORMANCES
83.1482.6082.2881.3681.59Train: MLN D1, MLN D2, HMM D4
Test: D5
82.8782.2682.4682.2681.76Train: MLN D1, MLN D1, HMM D1
Test: D5
16 mix8 mix4 mix2 mix1 mix
Phoneme Correct Rate (%)
Training and Test Data Set for each
Classifier
83.1482.6082.2881.3681.59Train: MLN D1, MLN D2, HMM D4
Test: D5
82.8782.2682.4682.2681.76Train: MLN D1, MLN D1, HMM D1
Test: D5
16 mix8 mix4 mix2 mix1 mix
Phoneme Correct Rate (%)
Training and Test Data Set for each
Classifier
VII. CONCLUSION
In this paper, a DPF-based phoneme recognition
method is presented by incorporating Japanese syllable-
based LMs. From the experiments using JNAS database,
the following conclusions are drawn.
i) The proposed method with LMs increases
phoneme correct rate by approximately 2% to
3% in comparison with the method [7] for all the
mixture components investigated (see Fig. 4),
while this improvement value is 2% to 4%
compared to the method [5] (see Table II).
ii) It improves phoneme recognition accuracy by
9% to 18% in comparison with the method [7]
(see Fig. 5). On the other hand, 14% to 20%
improvement is observed by the proposed
method compared to the method [5] (see Table
III).
iii) It requires fewer mixture components to obtain a
higher recognition performance.
In future research, an evaluation of Bengali phoneme
recognition accuracy will be tried using the method
proposed in this paper.
REFERENCES
[1] O. Scharenborg and S. Seneff, "A two-pass for strategy
handling OOVs in a large vocabulary recognition task,"
Proc. Interspeech, 2005.
[2] I. Bazzi and J. R. Glass, "Modeling OOV words for ASR,"
Proceedings of ICSLP, Beijing, China, p. 401-404, 2000.
[3] K. Kirchhoff, et. al, “Combining acoustic and articulatory
feature information for robust speech recognition,” Speech
Commun.,vol.37, pp.303-319, 2002.
[4] K. Kirchhoffs, “Robust Speech Recognition Using
Articulatory information,” Ph.D thesis, University of
Bielefeld, Germany, July 1999.
[5] M. N. Huda, H. Kawashima, and T. Nitta, “Distinctive
Phonetic Feature (DPF) extraction based on MLNs and
Inhibition/Enhancement Network,” IEICE Trans. Inf. &
Syst., Vol.E92-D, No.4, April 2009.
[6] M. N. Huda, H. Kawashima, K. Katsurada, and T. Nitta,
“Distinctive phonetic feature (DPF) based phoneme
recognition using MLNs and Inhibition/Enhancement
network for noise robust ASR,” Proc NCSP’09, Honolulu,
Hawaii, USA, March 2009.
[7] M. N. Huda, H. Kawashima, and T. Nitta, “Distinctive
phonetic feature extraction based on 3-stage MLNs and
Inhibition/Enhancement network,” Proc Technical Report
of IEICE, SP08, December 2008.
[8] Huda Mohammad Nurul, Muhammad Ghulam, Junsei
Horikawa and Tsuneo Nitta, "Distinctive Phonetic Feature
(DPF) Based Phone Segmentation using Hybrid Neural
Networks," Proc. INTERSPEECH’07, pp. 94-97, Aug.
2007.
[9] S. King and P. Taylor, "Detection of Phonological Features
in Continuous Speech using Neural Networks," Computer
Speech and Language 14(4), pp. 333-345, 2000.
[10] E. Eide, "Distinctive Features for Use in an Automatic
Speech Recognition System," Proc. Eurospeech 2001,
vol.III, pp.1613-1616, 2001.
[11] T. Fukuda and T. Nitta, "Orthogonalized Distinctive
Phonetic Feature Extraction for Noise-Robust Automatic
Speech Recognition," The Institute of Electronics,
Information and Communication Engineers (IEICE)
Transactions on Information and Systems, Vol. E87-D,
No.5, pp. 1110-1118, 2004.
[12] T. Fukuda, W. Yamamoto, and T. Nitta, "Distinctive
Phonetic feature Extraction for robust speech recognition,"
Proc. ICASSP'03, vol.II, pp.25-28, 2003.
[13] M. N. Huda, et. al. “DPF-based phoneme recognition for
OOV words detection,” Acoustic Society of Japan (ASJ)
meeting, March 2008.
[14] T. Robinson, “An application of Recurrent Nets to Phone
Probability Estimation,” IEEE Trans. Neural Networks,
Volume 5, Number 3, 1994.
[15] T. Nitta, "Feature extraction for speech recognition based
on orthogonal acoustic-feature planes and LDA," Proc.
ICASSP’99, pp.421-424, 1999.
[16] T. Kobayashi, et. al, "ASJ Continuous Speech Corpus for
Research," Acoustic Society of Japan Trans. Vol.48,
No.12, pp.888-893, 1992.
[17] JNAS: Japanese Newspaper Article Sentences.
http://www.milab.is.tsukuba.ac.jp/jnas/instruct.htm.
[18] JULIUS 3.5.3 released on December 2006
http://julius.sourceforge.jp/en_index.php.
Mohammad Nurul Huda was born in
Lakshmipur, Bangladesh in 1973. He
received his B. Sc. and M. Sc. in
Computer Science and Engineering
degrees from Bangladesh University of
Engineering & Technology (BUET),
Dhaka in 1997 and 2004, respectively.
He also completed his Ph. D from the
Department of Electronics and Information Engineering,
Toyohashi University of Technology, Aichi, Japan. Now, he is
working as an Associate Professor in United International
JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 549
© 2010 ACADEMY PUBLISHER
University, Dhaka, Bangladesh. His research fields include
Phonetics, Automatic Speech Recognition, Neural Networks,
Artificial Intelligence and Algorithms. He is a member of
International Speech Communication Association (ISCA).
Manoj Banik was born in Habiganj,
Bangladesh in 1972. He completed his
B.Sc. in Computer Science and
Engineering Degree from BUET, Dhaka,
Bangladesh. Now, he is working as an
Assistant Professor in the Department of
Computer Science and Engineering of
Ahsanullah University of Science and
Technology, Dhaka, Bangladesh. His
research interests include Speech Recognition, Phonetics and
Algorithms. He is a member of Bangladesh Engineers Institute
(IEB).
Ghulam Muhammad was born in
Rajshahi, Bangladesh in 1973. He
received his B. Sc. in Computer Science
and Engineering degree from Bangladesh
University of Engineering & Technology
(BUET), Dhaka in 1997. He also
completed his M.E and Ph. D from the
Department of Electronics and
Information Engineering, Toyohashi
University of Technology, Aichi, Japan in 2003 and 2006,
respectively. Now, he is working as an Assistant Professor in
King Saud University, Riyadh, Saudi Arabia. His research
interest includes Automatic Speech Recognition and human-
computer interface. He is a member of IEEE.
Mashud Kabir was born in
Narayanganj, Bangladesh in 1976. He
has completed his Bachelor of Science
(BSc) in Electrical & Electronic
Engineering from Bangladesh University
of Engineering & Technology (BUET) in
2000. He earned his Master of Science
(MSc) in Communication Engineering
from University of Stuttgart, Germany in
2003. He worked at Mercedes-Benz Technology Center,
Germany from 2003 to 2005 as a PhD student. He has work-
experience in research & development projects of BMW, Land-
Rover and Audi in automotive infotainment network and
electronic area for more than five years. The author achieved his
PhD degree from the University of Tuebingen, Germany in
2008. He is working as an Assistant Professor in United
International University, Dhaka, Bangladesh.
Bernd J. Kröger received his Master's
degree in Physics, University of
Münster in 1985. He also received his
PhD degree in Phonetics, University of
Cologne, Germany in 1989. Now, he is
working as a senior scientific staff
member at Department of Phoniatrics,
Pedaudiology, and Communication
Disorders, UKA and RWTH Aachen University. His research
fields are Articulatory feature–based Automatic Speech
Recognition and Speech Synthesis.
550 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010
© 2010 ACADEMY PUBLISHER