Effects of syllable language model on distinctive phonetic features (DPFs) based phoneme recognition...

8
Effects of Syllable Language Model on Distinctive Phonetic Features (DPFs) based Phoneme Recognition Performance Mohammad Nurul Huda Department of CSE, United International University, Dhaka, Bangladesh Email: [email protected] Manoj Banik Department of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh Email: [email protected] Ghulam Muhammad College of Computer & Information Science, King Saud University, Riyadh, Saudi Arabia Email: [email protected] Mashud Kabir Department of EEE, United International University, Dhaka, Bangladesh Email: [email protected] Bernd J. Kroger Dept. of Phoniatrics and Communication Disorders, University Hospital Aachen, Aachen, Germany Email: [email protected] Abstract—This paper presents a distinctive phonetic features (DPFs) based phoneme recognition method by incorporating syllable language models (LMs). The method comprises three stages. The first stage extracts three DPF vectors of 15 dimensions each from local features (LFs) of an input speech signal using three multilayer neural networks (MLNs). The second stage incorporates an Inhibition/Enhancement (In/En) network to obtain more categorical DPF movement and decorrelates the DPF vectors using the Gram-Schmidt orthogonalization procedure. Then, the third stage embeds acoustic models (AMs) and LMs of syllable-based subwords to output more precise phoneme strings. From the experiments, it is observed that the proposed method provides a higher phoneme correct rate as well as a tremendous improvement of phoneme accuracy. Moreover, it shows higher phoneme recognition performance at fewer mixture components in hidden Markov models (HMMs). Index Terms—distinctive phonetic features, language models, syllable based subword, inhibition/enhancement network, hidden Markov models I. INTRODUCTION Conventional automatic speech recognition (ASR) systems use stochastic pattern matching techniques in which word candidates are matched against word templates represented by hidden Markov models (HMMs). Although these techniques perform adequately Feature extractor Input utterance Word recognizer OOV detector Sound-to-letter converter Word-list AMs Phone-list OOV utterance OOV phonemes OOV word Word strings Feature extractor Input utterance Word recognizer OOV detector Phoneme recognizer Sound-to-letter converter Word-list AMs Phone-list OOV utterance OOV phonemes OOV word Word strings Figure 1. An ASR system with OOV detection. in certain limited applications, they always reject a new vocabulary or a so-called out-of-vocabulary (OOV) word. Therefore, an accurate phonetic typewriter or a phoneme recognizer is expected to assist next-generation ASR systems for resolving this OOV-word problem [1, 2] via a short interaction (talk-back) by automatically adding the word into a word lexicon from the phoneme strings of an input utterance (see Fig. 1). Though various distinctive phonetic features (DPFs) based methods, which can solve coarticulatory phenomena [3, 4], were proposed to obtain a more accurate phoneme recognizer [5, 6, 7, 8], they provide a higher phoneme correct rate (PCR) with poor phoneme accuracy rate (PAR). The reason for providing lower phoneme recognition accuracy is the violation of some phonotactic constraints at different places of the output phoneme string of an input utterance. Therefore, it is needed to incorporate syllable-based subword items, JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 543 © 2010 ACADEMY PUBLISHER doi:10.4304/jmm.5.6.543-550

Transcript of Effects of syllable language model on distinctive phonetic features (DPFs) based phoneme recognition...

Effects of Syllable Language Model on

Distinctive Phonetic Features (DPFs) based

Phoneme Recognition Performance

Mohammad Nurul Huda Department of CSE, United International University, Dhaka, Bangladesh

Email: [email protected]

Manoj Banik Department of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh

Email: [email protected]

Ghulam Muhammad College of Computer & Information Science, King Saud University, Riyadh, Saudi Arabia

Email: [email protected]

Mashud Kabir Department of EEE, United International University, Dhaka, Bangladesh

Email: [email protected]

Bernd J. Kroger Dept. of Phoniatrics and Communication Disorders, University Hospital Aachen, Aachen, Germany

Email: [email protected]

Abstract—This paper presents a distinctive phonetic

features (DPFs) based phoneme recognition method by

incorporating syllable language models (LMs). The method

comprises three stages. The first stage extracts three DPF

vectors of 15 dimensions each from local features (LFs) of

an input speech signal using three multilayer neural

networks (MLNs). The second stage incorporates an

Inhibition/Enhancement (In/En) network to obtain more

categorical DPF movement and decorrelates the DPF

vectors using the Gram-Schmidt orthogonalization

procedure. Then, the third stage embeds acoustic models

(AMs) and LMs of syllable-based subwords to output more

precise phoneme strings. From the experiments, it is

observed that the proposed method provides a higher

phoneme correct rate as well as a tremendous improvement

of phoneme accuracy. Moreover, it shows higher phoneme

recognition performance at fewer mixture components in

hidden Markov models (HMMs).

Index Terms—distinctive phonetic features, language

models, syllable based subword, inhibition/enhancement

network, hidden Markov models

I. INTRODUCTION

Conventional automatic speech recognition (ASR)

systems use stochastic pattern matching techniques in

which word candidates are matched against word

templates represented by hidden Markov models

(HMMs). Although these techniques perform adequately

Featureextractor

Input utterance

Word recognizer

OOV detector

Phonemerecognizer

Sound-to-letterconverter

Word-list

AMs Phone-list

OOV

utterance

OOVphonemes

OOVword

Word

strings

Featureextractor

Input utterance

Word recognizer

OOV detector

Phonemerecognizer

Sound-to-letterconverter

Word-list

AMs Phone-list

OOV

utterance

OOVphonemes

OOVword

Word

strings

Figure 1. An ASR system with OOV detection.

in certain limited applications, they always reject a new

vocabulary or a so-called out-of-vocabulary (OOV) word.

Therefore, an accurate phonetic typewriter or a phoneme

recognizer is expected to assist next-generation ASR

systems for resolving this OOV-word problem [1, 2] via a

short interaction (talk-back) by automatically adding the

word into a word lexicon from the phoneme strings of an

input utterance (see Fig. 1).

Though various distinctive phonetic features (DPFs)

based methods, which can solve coarticulatory

phenomena [3, 4], were proposed to obtain a more

accurate phoneme recognizer [5, 6, 7, 8], they provide a

higher phoneme correct rate (PCR) with poor phoneme

accuracy rate (PAR). The reason for providing lower

phoneme recognition accuracy is the violation of some

phonotactic constraints at different places of the output

phoneme string of an input utterance. Therefore, it is

needed to incorporate syllable-based subword items,

JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 543

© 2010 ACADEMY PUBLISHERdoi:10.4304/jmm.5.6.543-550

TABLE I. JAPANESE BALANCED DPF-SET FOR CLASSIFYING ATR PHONEMES

DPFs a i u e o N w y j my ky dy by gy ny hy ry py p t k ts ch b d g z m n s sh h f r

vocalic

high

low

nil

anterior

back

nil

coronal

plosive

affricative

continuant

voiced

unvoiced

nasal

semi-vowel

which set some grammatical constraints at the recognition

stage, for obtaining a phoneme recognizer with higher

accuracy.

In this paper, we propose a more accurate phoneme

recognition method by incorporating syllable language

models (LMs), where the method comprises three stages.

The first stage extracts three DPF vectors of 15

dimensions each from local features (LFs) of an input

speech signal using three multilayer neural networks

(MLNs). The second stage incorporates an

Inhibition/Enhancement (In/En) network to obtain more

categorical DPF movement and decorrelates the DPF

vectors using the Gram-Schmidt (GS) algorithm. Then,

the third stage embeds acoustic models (AMs) and LMs

of syllables at the recognition stage of an HMM-based

classifier. The objective of this study is to incorporate

rule based (syllable constraints) LMs for resisting those

phoneme strings that violates syllable rules and

consequently, for obtaining a higher PAR.

The paper is organized as follows: Section II discusses

the DPFs and Section III describes a hybrid neural

network (Recurrent Neural Network (RNN) and MLN)

based phoneme recognition method. Section IV explains

the system configuration of the proposed phoneme

recognition method. Experimental database and setup are

provided in Section V, while experimental results are

analyzed in Section VI. Finally, in Section VII, some

conclusions are drawn.

II. DISTINCTIVE PHONETIC FEATURES

A phoneme can easily be identified by using its unique

DPF set [9, 10]. The Japanese balanced DPF set [11, 12]

for classifying phonemes have 15 elements which are

vocalic, high, low, intermediate expression between high

and low <nil>, anterior, back, intermediate expression

between anterior and back <nil>, coronal, plosive,

affricate, continuant, voiced, unvoiced, nasal and

semivowel. Table I shows this balanced DPF set. Here,

present and absent elements of the DPFs are indicated by

“+” and “-” signs, respectively.

III. HYBRID NEURAL NETWORK-BASED PHONEME

RECOGNITION METHOD

At first stage, this system [13] extracts a 45

dimensional DPF feature vector using a hybrid neural

network (HNN) which consists of a RNN and an MLN,

and then these features are decorrelated using the GS

algorithm [11]. Here, the RNN is used for handling a

longer context window [14]. Fig. 2 shows a block

diagram of the system using HNN. Input acoustic vector

is formed by taking preceding “t-3”-th and succeeding

“t+3”-th frames together with the current t-th frame. Each

input frame is formed by 25 LF values. The RNN

generates 45 DPF values of which first 15 for the

preceding, middle 15 for the current, and the rest for the

succeeding frame. The MLN generates 45 DPF values

and reduces misclassification of phoneme at phoneme

boundaries by incorporating a seven frame context

window (from t-3 to t+3) as input. At second stage, the

system incorporates HMM-based AMs that are obtained

using the orthogonalized feature vector of first stage, and

LMs of syllable-based subword, which will be explained

in Section IV.C, to output more precise phoneme strings.

IV. PROPOSED PHONEME RECOGNITION METHOD

Fig. 3 shows the proposed phoneme recognition

method that comprises three stages. The first stage

extracts a 45-dimensional DPF vector from the LFs of an

input speech signal using three MLNs. The second stage

incorporates In/En functionalities to obtain modified DPF

patterns and decorrelates the DPF vector using the GS

orthogonalization [11] before connecting it with an

HMM-based classifier.

A. DPF Extractor

In this method, three MLNs instead of two MLNs [5]

are used to construct the DPF extractor. The first MLN,

544 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010

© 2010 ACADEMY PUBLISHER

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

MLN

Gra

m-S

chm

idt

Orth

ogo

na

liza

tion

15

15

15

15

15

15

HM

M-b

ase

d

AS

R

AMs

LMs

(Syllable)

Phoneme

Strings

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

MLN

Gra

m-S

chm

idt

Orth

ogo

na

liza

tion

15

15

15

15

15

15

HM

M-b

ase

d

AS

R

AMs

LMs

(Syllable)

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

MLN

Gra

m-S

chm

idt

Orth

ogo

na

liza

tion

15

15

15

15

15

15

HM

M-b

ase

d

AS

R

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

Speech signalxt-3 : 25 LF

xt : 25 LF

xt+3:25 LF

Local F

eatu

re E

xtra

ctio

n

Outp

ut L

ayer

Hid

den L

ayer

OL H

L E

xte

rnal in

put

OL H

L E

xte

rnal in

put

delay

delay

delay

delay

yt-3:15 DPF

yt: 15 DPF

yt+3: 15 DPF

350

RNN

MLN

Gra

m-S

chm

idt

Orth

ogo

na

liza

tion

15

15

15

15

15

15

HM

M-b

ase

d

AS

R

AMs

LMs

(Syllable)

Phoneme

Strings

Figure 2. Hybrid Neural Network based Phoneme Recognition Method.

Phoneme45

xt-3

xt

xt+3

Input for MLN:25 dim. x 3 fr.

Speech signal

ML

NLF-D

PF

(Ma

pp

ing

LF

to D

PF

)

Inhib

ition/E

nhance

ment

Netw

ork

45

Gra

m-S

chm

idt

Orth

ogonaliza

tion

45

ML

NDyn

(Re

stric

ting

DP

F d

yn

am

ics

)

Lo

ca

l Fe

atu

re E

xtra

ctio

n

45

45 DPF

45

DPF∆

DPF∆∆

ML

Ncntxt

DPFt

DPFt-3

DPFt+3

HMM

AMs

LM

(syllable)

string

Phoneme45

xt-3

xt

xt+3

Input for MLN:25 dim. x 3 fr.

Speech signal

ML

NLF-D

PF

(Ma

pp

ing

LF

to D

PF

)

Inhib

ition/E

nhance

ment

Netw

ork

45

Gra

m-S

chm

idt

Orth

ogonaliza

tion

45

ML

NDyn

(Re

stric

ting

DP

F d

yn

am

ics

)

Lo

ca

l Fe

atu

re E

xtra

ctio

n

45

45 DPF

45

DPF∆

DPF∆∆

ML

Ncntxt

DPFt

DPFt-3

DPFt+3

xt-3

xt

xt+3

Input for MLN:25 dim. x 3 fr.

Speech signal

ML

NLF-D

PF

(Ma

pp

ing

LF

to D

PF

)

Inhib

ition/E

nhance

ment

Netw

ork

45

Gra

m-S

chm

idt

Orth

ogonaliza

tion

45

ML

NDyn

(Re

stric

ting

DP

F d

yn

am

ics

)

Lo

ca

l Fe

atu

re E

xtra

ctio

n

45

45 DPF

45

DPF∆

DPF∆∆

ML

Ncntxt

DPFt

DPFt-3

DPFt+3

HMM

AMs

LM

(syllable)

string

Figure 3. Proposed Phoneme Recognition Method.

MLNLF-DPF, outputs DPFs [11, 12] for the inputted

acoustic features, LFs [15], while the second MLN,

MLNcntxt, reduces misclassification at phoneme

boundaries by taking seven frame context (from t-3 to

t+3) as input, and the third MLN, MLNDyn, restricts the

DPF dynamics by incorporating dynamic parameters

( DPF and DPF) into its input. Here, the MLNLF-DPF,

which is trained using the standard back-propagation

learning algorithm, has two hidden layers of 256 and 96

units, respectively and takes three input vectors (t-3, t,

t+3) of LFs of 25 dimensions each. The 45-dimensional

context-dependent DPF vector provided by the MLNLF-

DPF at time t is appended into the MLNcntxt, which consists

of five layers including three hidden layers of 90, 180,

and 90 units, respectively, and generates a 45-

dimensional DPF vector with a small number of errors at

phoneme boundaries by incorporating context window of

particular size. This 45-dimensional DPF vector and its

corresponding DPF and DPF vectors calculated by

three-point linear regression (LR) are appended into the

subsequent MLNDyn, which consists of four layers

including two hidden layers of 300 and 100 units,

respectively, and outputs a 45-dimensional DPF vector

with reduced fluctuations and dynamics. Both the

MLNcntxt and MLNDyn are also trained using the standard

back-propagation algorithm. Here, for the MLNLF-DPF and

MLNcntxt, the context window size is selected <t-3, t, t+3>

instead of <t-3, t-2, t-1, t, t+1, t+2, t+3> because

approximately same performances are obtained by the

both and reduced computational cost is required for the

context window, <t-3, t, t+3>. On the other hand, the

specified unit of each hidden layer for each MLN is

selected by tuning process.

B. Inhibition/Enhancement Network

The DPF extractor provides 45 DPF patterns (15 for

preceding context, 15 for current context, and 15 for

following context) for each input vector along time axis.

These patterns may not match with the pattern of input-

phoneme string for some phonemes and consequently,

some phonemes are incorrectly recognized by the HMM

classifier. This phoneme misclassification sometimes

occurs when the values of DPF peaks and DPF dips are

closer to each other. Therefore, a mechanism, which is

called In/En network [5, 6, 7], is needed to obtain clearly

JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 545

© 2010 ACADEMY PUBLISHER

separable DPF peaks and dips. An algorithm for this

network is given below:

Step1: For each element of the DPF vectors, find the

acceleration ( ) parameters by using three-point LR.

Step2: Check whether ( ) is positive (concave pattern)

or negative (convex pattern) or zero (steady state).

Step3: Calculate )(∆∆f

if pattern is convex,

if pattern is concave,

if steady state,

0.1)( =∆∆f (3)

Step4: Find modified DPF patterns by multiplying the

DPF patterns with )(∆∆f .

C. Syllable-Based Language Model

The phoneme classification inaccuracy at the acoustic

phonetic level is a major weakness in most speech

recognition systems. However, the inaccuracy often

violates phonotactic constraints at the acoustic phonetic

level. Usually, three types of phoneme classification

errors appear: (i) phoneme insertions, (ii) phoneme

deletions, and (iii) phoneme substitutions. A better

performance is expected if a language model is adopted

in a recognition system for post-processing phoneme

estimates and for making corrections with regard to the

grammatical constraints. Words, syllables, and

phonological rules, etc. of a language provide some

constraints on a phoneme sequence generation, but a

language has a large number of words and knowledge-

based rules. Besides, since the number of syllables of

Japanese language is limited, they can be used to enhance

the phoneme recognition performance. For an example, a

phoneme recognizer generates a phoneme strings, /k/ /a/

/w/ /k/ /e/ /y/ /m/ /a/ /s/ /u/ for an input speech

/kakemasu/, which means “write” in English language.

The actual phoneme strings for the utterance is /k/ /a/ /k/

/e/ /m/ /a/ /s/ /u/. Therefore, there occur two insertions by

the phoneme recognizer. After adopting LMs that support

only CV, CVV, CV1V2, CVN, and CVQ syllable

structures, we can correct the phoneme strings, /k/ /a/ /w/

/k/ /e/ /y/ /m/ /a/ /s/ /u/ by resisting the generation of

strings, /k//a//w/ and /k//e//y/ incorrectly.

V. EXPERIMENTS

A. Speech Databases

If we use same data set for all classifiers used in the

experiments, there are some possibilities of biasness for

the test data to the training data at each stage. Because of

this, open test for each stage classifier (neural network-

based and HMM-based) is done by using different

training and test data sets. The following five clean data

sets are used in our experiments.

D1. Training data set for MLNLF-DPF

A subset of the Acoustic Society of Japan (ASJ)

Continuous Speech Database [16] comprising 4503

sentences uttered by 30 different male speakers (16 kHz,

16 bit) is used.

D2. Training data set for MLNcntxt

This data set contains 5000 sentences that are taken

from Japanese Newspaper Article Sentences (JNAS) [17]

Continuous Speech Database; the sentences have been

uttered by 33 different male speakers (16 kHz, 16 bit).

D3. Training data set for MLNDyn

This data set contains 5000 JNAS [17] sentences

uttered by 33 different male speakers (16 kHz, 16 bit).

Speakers of this data set are different from the D2 data

set.

D4. Training data set for HMM classifier

This data set takes 5000 JNAS [17] sentences uttered

by 33 different male speakers (16 kHz, 16 bit). Speakers

of this data set are different from the D2 and D3 data set.

D5. Test data set

This test data set comprises 2379 JNAS [17] sentences

uttered by 16 different male speakers (16 kHz, 16 bit).

B. Experimental Setup

The frame length and frame rate (frame shift between

two consecutive frames) are set to 25 ms and 10 ms,

respectively, to obtain acoustic features from an input

speech signal. LFs are a 25-dimensional vector consisting

of 12 delta coefficients along time axis, 12 delta

coefficients along frequency axis, and delta coefficient of

log power of a raw speech signal [15].

Since our goal is to design a more accurate phoneme

recognizer, PCR and PAR for D5 data set are evaluated

using an HMM-based classifier. The D4 data set is used

for the proposed method to design 38 Japanese

monophone HMMs with five states, three loops, and left-

to-right models. Input features for the classifier are DPFs.

On the other hand, in the HNN-based phoneme

recognition method, the D1 data set is used to design the

HMM and orthogonalized DPFs are used as input

features. In the HMMs, the output probabilities are

represented in the form of Gaussian mixtures, and

diagonal matrices are used. The number of components

per mixture is set to 1, 2, 4, 8, and 16, respectively.

In our experiments of the three-MLN and HNN, the

non-linear function is a sigmoid from 0 to 1 (1/(1+exp(-

x))) for the hidden and output layers.

For the In/En network, the value of the enhancement

coefficient, C1, is set to 4.0 after evaluating the proposed

method, DPF (3-MLN+In/En+GS, dim:45), for different

values of C1 , such as 2, 4, and 6, and the value of the

steepness coefficient, , is set to 80. The value of

inhibitory coefficient, C2, is fixed to 0.25 after observing

the DPF data patterns to keep the values of f( )

between 0.25 and 1.0.

For the experiments incorporating LMs, the Julius

3.5.3 version [18] is used to embed acoustic models and

trigram subword models (short and long syllables) as

∆∆−+=∆∆ βec

cf

)1(1)(

1

1(1)∆∆−+

=∆∆ βec

cf

)1(1)(

1

1(1)

∆∆+−

+=∆∆ βe

ccf

1

)1(2)( 2

2(2)

∆∆+−

+=∆∆ βe

ccf

1

)1(2)( 2

2(2)

546 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010

© 2010 ACADEMY PUBLISHER

LMs. It is a two-pass system, where the first pass is a

bigram and the second one is a trigram to set more

constraints on the output phoneme sequence. Syllable

structures are CV, CVV, CV1V2, CVN, and CVQ.

Weight/Insertion penalties are 5.0/-1.0 and 6.0/00 for the

first and second passes, respectively.

To investigate the effect of syllable-based subword

LM, we have designed the following phoneme

recognition tests, where the DPF extractor inputs a 45-

dimensional orthogonalized DPF vector to the HMM.

(a) 3-MLN+In/En+GS [7]

(b) 3-MLN+In/En+GS+LM [Proposed]

For further investigation, we have designed the

following experiments for testing the phoneme

recognition performance with and without incorporating

LM.

(c) Hybrid+GS

(d) Hybrid+GS+LM [13]

To compare the methods, Huda [5] and the proposed

one, we have designed the following two experiments for

evaluating PCR and PA.

(e) 2-MLN+In/En+GS [5]

(b) 3-MLN+In/En+GS+LM [Proposed]

For comparing context window size, <t-3, t, t+3> for

method (e) with <t-3, t-2, t-1, t, t+1, t+2, t+3> for method

(f), we have evaluated PCR and PA by designing the

following two experiments.

(f) Cont.2-MLN+In/En+GS

(e) 2-MLN+In/En+GS [5].

VI. EXPERIMENTAL RESULTS AND DISCUSSION

Fig. 4 shows the PCR of the proposed method before

and after adding syllable-based subword LM. For all the

mixture components investigated, the method with LM

provides higher PCR. For example, at mixture component

16, the method without LM shows 84.50% PCRs, while

the corresponding value for the method with LM is

87.40%.

On the other hand, the method with LM outperformed

the method without LM for phoneme accuracy, which is

shown in Fig. 5. By using the mixture components 1 and

2, the syllable knowledge leads to correction of the

highest number of phonemes and shows the highest level

improvement. Again, the addition of syllable knowledge

shows the highest accuracy (80.54%) at mixture

component 16. Therefore, language constraints resist

misclassification of phonemes and consequently, provide

higher phoneme recognition performance in a clean

acoustic environment.

Again, Figs. 6 and 7 depicted the PCR and PAR,

respectively, for the HNN-based method with and

without incorporating syllable-based LM. Incorporation

of LM improves the PCR and PAR for all the mixture

components investigated. It is observed from the Fig. 6

that the language model improves the PCR most (6.92%)

at mixture component eight, while the PAR is improved

by 6.98%, which is shown in Fig. 7.

Clean

70

75

80

85

90

1 2 4 8 16

Number of mixture component(s)

Ph

on

em

e C

orr

ect

Rate

(%)

3-MLN+In/En+GS3-MLN+In/En+GS+LM

Figure 4. Comparison of phoneme correct rate for the proposed

method before and after adding LM.

Clean

50

60

70

80

90

1 2 4 8 16

Number of mixture component(s)

Ph

on

em

e A

ccu

racy(%

)

3-MLN+In/En+GS3-MLN+In/En+GS+LM

Figure 5. Comparison of phoneme accuracy for the proposed method

before and after adding LM.

Clean

70

75

80

85

90

1 2 4 8 16

Number of mixture component(s)

Phonem

e C

orr

ect R

ate

(%)

Hybrid+GS

Hybrid+GS+LM

Figure 6. Comparison of phoneme correct rate for hybrid neural

network based method before and after adding LM.

It is claimed that syllable-based subword increases

PCR and PAR, where improvement of PAR is

tremendous. The reason for improving PAR is that

syllables setup some rules on phoneme generation. These

rules resist some phonemes which violates the constraints

set by syllables.

JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 547

© 2010 ACADEMY PUBLISHER

Clean

30

40

50

60

70

1 2 4 8 16

Number of mixture component(s)

Phonem

e A

ccura

cy(%

)

Hybrid+GS

Hybrid+GS+LM

Figure 7. Comparison of phoneme accuracy for hybrid neural network

based method before and after adding LM.

Figure 8. Segmentation of utterance, /jiNkoese/ using hybrid neural

network.

Figure 9. Segmentation of utterance, /jiNkoese/ using 3-MLN.

Segmentation for a clean /jiNkoese/ utterance is shown

in Figs. 8 and 9 for a balanced-DPF set [11] using an

HNN and 3-MLN, respectively. In both the figures,

"Solid thin line" and "Solid bold line" represent "ideal

segmentation" and "output segmentation", respectively;

"nasal", "nil(high/low)", and "high" of phoneme /N/, and

"unvoiced", "coronal", and "anterior" of phoneme /s/ are

denoted by (1), (2), (3), (4), (5), and (6), respectively.

After observing these marked places, we can say that the

3-MLN exhibits more precise segmentation (less

deviation from ideal boundary) than the HNN, reduces

some fluctuations and provides more smoothed DPF

curves, and hence, it misclassifies fewer phonemes.

Consequently, the 3-MLN based proposed method

provides higher phoneme recognition performance than

the HNN based method.

Tables II and III show PCR and PAR, respectively for

the methods, [5] and proposed one in this study. From

both the tables, it is observed that the incorporation of

one additional MLN and LMs increases PCR and PAR

for all mixture components investigated, where PAR

improvement is tremendous than PCR. For an example, at

mixture component 16, approximately 4.00%

improvement of PCR is observed (see Table II), while the

corresponding value of PAR is approximately 14.00%

(see Table III).

Again, Tables IV and V show PCR and PAR,

respectively for the methods, Cont.2-MLN+In/En+GS

with continuous <t-3, t-2, t-1, t, t+1, t+2, t+3> and [5]

with discrete <t-3, t, t+3>. It is shown from both the

tables that the methods, Cont.2-MLN+In/En+GS with

continuous seven frames and [5] with discrete seven

frames provide almost same performances for all higher

mixture components investigated. Moreover, higher

dimensionality in the method with continuous seven

frames requires more computation in the training process

of neural networks and consequently, slows down both

the training and test phase. Therefore, discrete <t-3, t,

t+3> is selected for experiments.

Table VI shows the effect of different training datasets,

for the same test data (D5), on phoneme recognition

performance. From the table, it is observed that the

methods, with same training data set for all classifiers

(first data row of the table) and with different training

data set for each classifier (second data row of the table),

show almost same performances for all the mixture

components investigated.

TABLE II. PHONEME CORRECT RATE OF METHOD [5] AND PROPOSED

Phoneme Correct Rate (%)

Methods

87.4087.1686.8986.5584.10Proposed

83.3383.3083.2682.8082.182-MLN+In/En+GS [5]

16 mix8 mix4 mix2 mix1 mix

Phoneme Correct Rate (%)

Methods

87.4087.1686.8986.5584.10Proposed

83.3383.3083.2682.8082.182-MLN+In/En+GS [5]

16 mix8 mix4 mix2 mix1 mix

TABLE III. PHONEME ACCURACY RATE OF METHOD [5] AND PROPOSED

Phoneme Accuracy (%)

Methods

80.5479.8879.6978.9876.55Proposed

66.4763.7962.5459.0056.532-MLN+In/En+GS [5]

16 mix8 mix4 mix2 mix1 mix

Phoneme Accuracy (%)

Methods

80.5479.8879.6978.9876.55Proposed

66.4763.7962.5459.0056.532-MLN+In/En+GS [5]

16 mix8 mix4 mix2 mix1 mix

548 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010

© 2010 ACADEMY PUBLISHER

TABLE IV. PHONEME CORRECT RATE FOR THE METHODS, CONT.2-

MLN+IN/EN+GS WITH CONTINUOUS <T-3, T-2, T-1, T, T+1, T+2, T+3>AND [5] WITH DISCRETE <T-3, T, T+3>

Phoneme Correct Rate (%)

Methods

83.3383.3083.2682.8082.182-MLN+In/En+GS[5]

83.3583.2783.1183.1182.84Cont.2-MLN+In/En+GS

16 mix8 mix4 mix2 mix1 mix

Phoneme Correct Rate (%)

Methods

83.3383.3083.2682.8082.182-MLN+In/En+GS[5]

83.3583.2783.1183.1182.84Cont.2-MLN+In/En+GS

16 mix8 mix4 mix2 mix1 mix

TABLE V. PHONEME ACCURACY RATE FOR THE METHODS, CONT.2-

MLN+IN/EN+GS WITH CONTINUOUS <T-3, T-2, T-1, T, T+1, T+2, T+3>AND [5] WITH DISCRETE <T-3, T, T+3>

Phoneme Accuracy (%)

Methods

66.4763.7962.5459.0056.532-MLN+In/En+GS[5]

65.4364.0462.3860.4760.47Cont.2-MLN+In/En+GS

16 mix8 mix4 mix2 mix1 mix

Phoneme Accuracy (%)

Methods

66.4763.7962.5459.0056.532-MLN+In/En+GS[5]

65.4364.0462.3860.4760.47Cont.2-MLN+In/En+GS

16 mix8 mix4 mix2 mix1 mix

TABLE VI. EFFECTS OF TRAINING DATASETS ON PHONEME RECOGNITION

PERFORMANCES

83.1482.6082.2881.3681.59Train: MLN D1, MLN D2, HMM D4

Test: D5

82.8782.2682.4682.2681.76Train: MLN D1, MLN D1, HMM D1

Test: D5

16 mix8 mix4 mix2 mix1 mix

Phoneme Correct Rate (%)

Training and Test Data Set for each

Classifier

83.1482.6082.2881.3681.59Train: MLN D1, MLN D2, HMM D4

Test: D5

82.8782.2682.4682.2681.76Train: MLN D1, MLN D1, HMM D1

Test: D5

16 mix8 mix4 mix2 mix1 mix

Phoneme Correct Rate (%)

Training and Test Data Set for each

Classifier

VII. CONCLUSION

In this paper, a DPF-based phoneme recognition

method is presented by incorporating Japanese syllable-

based LMs. From the experiments using JNAS database,

the following conclusions are drawn.

i) The proposed method with LMs increases

phoneme correct rate by approximately 2% to

3% in comparison with the method [7] for all the

mixture components investigated (see Fig. 4),

while this improvement value is 2% to 4%

compared to the method [5] (see Table II).

ii) It improves phoneme recognition accuracy by

9% to 18% in comparison with the method [7]

(see Fig. 5). On the other hand, 14% to 20%

improvement is observed by the proposed

method compared to the method [5] (see Table

III).

iii) It requires fewer mixture components to obtain a

higher recognition performance.

In future research, an evaluation of Bengali phoneme

recognition accuracy will be tried using the method

proposed in this paper.

REFERENCES

[1] O. Scharenborg and S. Seneff, "A two-pass for strategy

handling OOVs in a large vocabulary recognition task,"

Proc. Interspeech, 2005.

[2] I. Bazzi and J. R. Glass, "Modeling OOV words for ASR,"

Proceedings of ICSLP, Beijing, China, p. 401-404, 2000.

[3] K. Kirchhoff, et. al, “Combining acoustic and articulatory

feature information for robust speech recognition,” Speech

Commun.,vol.37, pp.303-319, 2002.

[4] K. Kirchhoffs, “Robust Speech Recognition Using

Articulatory information,” Ph.D thesis, University of

Bielefeld, Germany, July 1999.

[5] M. N. Huda, H. Kawashima, and T. Nitta, “Distinctive

Phonetic Feature (DPF) extraction based on MLNs and

Inhibition/Enhancement Network,” IEICE Trans. Inf. &

Syst., Vol.E92-D, No.4, April 2009.

[6] M. N. Huda, H. Kawashima, K. Katsurada, and T. Nitta,

“Distinctive phonetic feature (DPF) based phoneme

recognition using MLNs and Inhibition/Enhancement

network for noise robust ASR,” Proc NCSP’09, Honolulu,

Hawaii, USA, March 2009.

[7] M. N. Huda, H. Kawashima, and T. Nitta, “Distinctive

phonetic feature extraction based on 3-stage MLNs and

Inhibition/Enhancement network,” Proc Technical Report

of IEICE, SP08, December 2008.

[8] Huda Mohammad Nurul, Muhammad Ghulam, Junsei

Horikawa and Tsuneo Nitta, "Distinctive Phonetic Feature

(DPF) Based Phone Segmentation using Hybrid Neural

Networks," Proc. INTERSPEECH’07, pp. 94-97, Aug.

2007.

[9] S. King and P. Taylor, "Detection of Phonological Features

in Continuous Speech using Neural Networks," Computer

Speech and Language 14(4), pp. 333-345, 2000.

[10] E. Eide, "Distinctive Features for Use in an Automatic

Speech Recognition System," Proc. Eurospeech 2001,

vol.III, pp.1613-1616, 2001.

[11] T. Fukuda and T. Nitta, "Orthogonalized Distinctive

Phonetic Feature Extraction for Noise-Robust Automatic

Speech Recognition," The Institute of Electronics,

Information and Communication Engineers (IEICE)

Transactions on Information and Systems, Vol. E87-D,

No.5, pp. 1110-1118, 2004.

[12] T. Fukuda, W. Yamamoto, and T. Nitta, "Distinctive

Phonetic feature Extraction for robust speech recognition,"

Proc. ICASSP'03, vol.II, pp.25-28, 2003.

[13] M. N. Huda, et. al. “DPF-based phoneme recognition for

OOV words detection,” Acoustic Society of Japan (ASJ)

meeting, March 2008.

[14] T. Robinson, “An application of Recurrent Nets to Phone

Probability Estimation,” IEEE Trans. Neural Networks,

Volume 5, Number 3, 1994.

[15] T. Nitta, "Feature extraction for speech recognition based

on orthogonal acoustic-feature planes and LDA," Proc.

ICASSP’99, pp.421-424, 1999.

[16] T. Kobayashi, et. al, "ASJ Continuous Speech Corpus for

Research," Acoustic Society of Japan Trans. Vol.48,

No.12, pp.888-893, 1992.

[17] JNAS: Japanese Newspaper Article Sentences.

http://www.milab.is.tsukuba.ac.jp/jnas/instruct.htm.

[18] JULIUS 3.5.3 released on December 2006

http://julius.sourceforge.jp/en_index.php.

Mohammad Nurul Huda was born in

Lakshmipur, Bangladesh in 1973. He

received his B. Sc. and M. Sc. in

Computer Science and Engineering

degrees from Bangladesh University of

Engineering & Technology (BUET),

Dhaka in 1997 and 2004, respectively.

He also completed his Ph. D from the

Department of Electronics and Information Engineering,

Toyohashi University of Technology, Aichi, Japan. Now, he is

working as an Associate Professor in United International

JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 549

© 2010 ACADEMY PUBLISHER

University, Dhaka, Bangladesh. His research fields include

Phonetics, Automatic Speech Recognition, Neural Networks,

Artificial Intelligence and Algorithms. He is a member of

International Speech Communication Association (ISCA).

Manoj Banik was born in Habiganj,

Bangladesh in 1972. He completed his

B.Sc. in Computer Science and

Engineering Degree from BUET, Dhaka,

Bangladesh. Now, he is working as an

Assistant Professor in the Department of

Computer Science and Engineering of

Ahsanullah University of Science and

Technology, Dhaka, Bangladesh. His

research interests include Speech Recognition, Phonetics and

Algorithms. He is a member of Bangladesh Engineers Institute

(IEB).

Ghulam Muhammad was born in

Rajshahi, Bangladesh in 1973. He

received his B. Sc. in Computer Science

and Engineering degree from Bangladesh

University of Engineering & Technology

(BUET), Dhaka in 1997. He also

completed his M.E and Ph. D from the

Department of Electronics and

Information Engineering, Toyohashi

University of Technology, Aichi, Japan in 2003 and 2006,

respectively. Now, he is working as an Assistant Professor in

King Saud University, Riyadh, Saudi Arabia. His research

interest includes Automatic Speech Recognition and human-

computer interface. He is a member of IEEE.

Mashud Kabir was born in

Narayanganj, Bangladesh in 1976. He

has completed his Bachelor of Science

(BSc) in Electrical & Electronic

Engineering from Bangladesh University

of Engineering & Technology (BUET) in

2000. He earned his Master of Science

(MSc) in Communication Engineering

from University of Stuttgart, Germany in

2003. He worked at Mercedes-Benz Technology Center,

Germany from 2003 to 2005 as a PhD student. He has work-

experience in research & development projects of BMW, Land-

Rover and Audi in automotive infotainment network and

electronic area for more than five years. The author achieved his

PhD degree from the University of Tuebingen, Germany in

2008. He is working as an Assistant Professor in United

International University, Dhaka, Bangladesh.

Bernd J. Kröger received his Master's

degree in Physics, University of

Münster in 1985. He also received his

PhD degree in Phonetics, University of

Cologne, Germany in 1989. Now, he is

working as a senior scientific staff

member at Department of Phoniatrics,

Pedaudiology, and Communication

Disorders, UKA and RWTH Aachen University. His research

fields are Articulatory feature–based Automatic Speech

Recognition and Speech Synthesis.

550 JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010

© 2010 ACADEMY PUBLISHER