Tone nucleus modeling for Chinese lexical tone recognition
Transcript of Tone nucleus modeling for Chinese lexical tone recognition
Speech Communication 42 (2004) 447–466
www.elsevier.com/locate/specom
Tone nucleus modeling for Chinese lexical tone recognition
Jinsong Zhang *,1, Keikichi Hirose
Department of Information and Communication Engineering, School of Engineering, University of Tokyo,
Bunkyo-ku, Tokyo 113-8656, Japan
Received 20 October 2003; received in revised form 20 October 2003; accepted 8 January 2004
Abstract
This paper presents a new scheme to deal with variations in fundamental frequency (F0) contours for lexical tone
recognition in continuous Chinese speech. We divide F0 contour of a syllable into tone nucleus and adjacent articu-
latory transitions. We only use acoustic features of the tone nucleus for tone recognition. Tone nucleus of a syllable is
assumed to be the target F0 of the associated lexical tone, and usually conforms more likely to the standard tone pattern
than the articulatory transitions. A tone nucleus can be detected from a syllable F0 contour by a two-step algorithm.
First, the syllable F0 contour is segmented into several linear F0 loci that serve as candidates for the tone-nucleus using
segmental K-means segmentation algorithm. Then, tone nucleus is chosen from a set of candidates by a predictor based
on linear discriminant analysis. Speaker dependent tone recognition experiments using tonal HMMs showed our new
approach achieved an improvement of up to 6% for tone recognition rate compared with a conventional one. This
indicates not only that tone-nucleus keeps important discriminant information for the lexical tones, but also that our
tone-nucleus based tone recognition algorithm works properly.
� 2004 Elsevier B.V. All rights reserved.
Keywords: Tone recognition; Underlying-target; Articulatory transition; Tone-nucleus
1. Introduction
Chinese (Mandarin or Standard Chinese) is a
well-known tonal language in which pitch tones
play important phonemic roles. Each syllable in
Chinese corresponds to a morpheme (ideographic
character) and is associated with a pitch tone
*Corresponding author. Present address: ATR Spoken
Language Translation Laboratories, 2-2-2 Kansai Science City,
Kyoto 619-0288, Japan. Tel.: +81-774-95-1314; fax: +81-774-
95-1308.
E-mail address: [email protected] (J. Zhang).1 This work was done when the author was affiliated with the
University of Tokyo.
0167-6393/$ - see front matter � 2004 Elsevier B.V. All rights reserv
doi:10.1016/j.specom.2004.01.001
(usually referred to as lexical tone). Phonemically,
a syllable is divided into two parts: an Initial and a
Final. An Initial can be a consonant or none. A
Final may be a vowel, a diphthong, or a triph-
thong and with an optional nasal ending. There
are four basic lexical tones (referred to as Tones 1,
2, 3, 4, respectively) and a neutral tone. The four
basic lexical tones are characterized by their per-ceptually distinctive pitch patterns which are
conventionally called by linguists as: high-level,
high-rising, low-dipping and high-falling tones
(Chao, 1968, pp. 25–26). The neutral tone,
according to (Chao, 1968), does not have any
specific pitch pattern, and is highly dependent on
the preceding tone and usually perceived to be
ed.
448 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
temporally short and zero pitch range [p. 35].
There are only a very small number of morphemes,
such as suffixes and particles, which are always
uttered in the neutral tone [p. 36]. Fundamentalfrequency (hence F0) contours are the main
acoustic manifestations of pitch tones, and there
seem to be distinctive F0 patterns associated with
the four basic lexical tones as illustrated in Fig. 1.
There are a number of reasons for recognizing
lexical tones using F0 contours. First, while the
total number of phonetically differentiated sylla-
bles in Chinese is about 1282 (Cheng and Wang,1990, p. 45), the number of base syllables, where
only differences in segmental features are con-
tained ignoring those of lexical tones, is only 412.
This means there exist a number of homophone
morphemes in Chinese. Tone recognition offers a
method to discriminate homophone words. Sec-
ond, automatic detection of intonation structure
such as prosodic phrase boundaries, focus, stresslocations, etc. is attracting more and more atten-
tions recently (Wightman and Ostendorf, 1994;
Niemann et al., 1998; Hirose and Iwano, 1999;
Stolcke et al., 1999). This is because it probably
offers an answer to problems residing in sponta-
neous speech like ungrammatical utterances, dis-
fluencies, syntax ambiguities, etc. Similar to the
lexical tones, the intonation structure is alsoacoustically manifested mainly through F0 con-
tours (Fujisaki, 1997). The interplay effects result
in so complex sentential F0 variations that those
F0 cues for intonation structure cannot be clarified
unless those due to the lexical tones are known
F0
Time
Tone 1
Tone 2
Tone 3 Tone 4
Fig. 1. Standard distinctive F0 patterns of the four basic lexical
tones.
(Xu, 1997a). Therefore study on tone recognition
from F0 contours is the first and unavoidable step
in incorporating intonation information process-
ing into Chinese spoken dialogue systems. Third,tone recognition is helpful for developing auto-
matic prosodic labeling system for Chinese speech.
A prosodically labeled database is useful for
prosody analyses and development of data-driven
text-to-speech (TTS) systems (Wightman and Os-
tendorf, 1994). Since syllable tones often change
from their original types in the dictionary due to
various factors such as tone neutralization, mor-phophonemic tone sandhi (Chao, 1968), speaker�sdialect, grammatical structure (Shih, 1990) and
etc., it is necessary to recognize and label the syl-
lable tones according to their real tonalities. An
automatic tone recognizer is preferred because
human labeling is time consuming and expensive.
Fourth, tone recognition is crucial to set up com-
puter aided language learning (CALL) systems forforeigners learning Chinese. Pronouncing lexical
tones is rather difficult for people of non-tonal
languages, where pitch tones usually express dif-
ferent stress or intonation patterns instead of dif-
ferentiation of lexical meanings. Tone recognizer
may offer a diagnosis of a learner�s pronunciation,and show the learner the way to correct his (her)
pitch features. Finally, besides the above men-tioned reasons, there are still many other reasons
of tone recognition such as to improve digit rec-
ognition performance (Wang and Seneff, 1998). In
this paper, we propose a new tone recognition
method, which not only offers a systematic way of
dealing with F0 variations for tone recognition,
but may also help to detect sentential intonation
structure.Previous studies have shown that it is easy to
recognize lexical tones in isolated speech (Yang
et al., 1988; Le et al., 1993), but rather difficult to
recognize them from F0 contours of continuous
speech (Wang et al., 1997; Liu et al., 1999). This
different performances can be ascribed to the rea-
son that the lexical tones show consistent tonal F0
patterns when uttered in isolation, but showcomplex variations in continuous speech. The
complex F0 variations originate from the me-
chanic-physiological realization of the compound
intonation functions including the lexical tones
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 449
and sentential intonation structure (Xu and Wang,
2001). On the one hand, confounded nature of
information in F0 contours may obscure the F0
variation cueing lexical tones. For example, focusis usually related to F0-range-expansion of focused
words that are not in the final position of an
utterance and F0-range-suppression of post-focus
words. When a Tone 1 is related to a focus, whose
F0 contour usually shows high and flat pattern,
may change into a rising or falling shape (Xu,
1999). On the other hand, F0 contours, which re-
flect the periods of the successive human vocalcords� vibrations, are to vary due to articulatoryconstraints. For example, voiced or unvoiced syl-
lable initial segments may lead to quite different
syllable F0 contours even when the lexical tones
are the same (Howie, 1974). The inertial charac-
teristic of the bio-mechanical vibrations makes the
neighboring tones interfere with each other exten-
sively so that early portion of a syllable F0 contourtends to vary according to the carryover effect of
the preceding tone (Xu, 1999).
In order to cope with F0 variations in tone
recognition, some researchers attempted to devel-
op prosody models for intonation functions and
articulatory constraints. However, these attempts
usually modeled only a few specific F0 variations,
thus were lack in generality. Wang et al. (1990)proposed to use the accent command coefficients
of a well-known F0 contour generation model
(Fujisaki and Hirose, 1984) to recognize lexical
tones, assuming prosodic phrasing effects could be
modeled by the phrase command coefficients of the
model. In (Wang and Chen, 1994), Wang and
Chen utilized prosodic phrasing information ex-
tracted from text transcripts to aid tone recogni-tion. A disadvantage of these approaches is that
the assumed prosody information is usually un-
available in a speech recognition system. It is ra-
ther difficult to decompose a Chinese sentential F0
contour into the phrase and accent components of
the generation model without human interference.
And the text transcripts are usually the purpose of
tone recognition except in a database labelingsystem. In (Chen and Wang, 1995), Chen and
Wang investigated to model the sentential into-
nation patterns like declarative or interrogative
utterances by five different states of a neural net.
Wang et al. (1997) adopted context-dependent
HMM models to model coarticulation effects be-
tween neighboring lexical tones. Problems associ-
ated with these attempts are that intonationpatterns and coarticulation effects in continuous
speech are usually much more complex than that
the proposals can model. Thus the proposals did
not show to be very efficient, and performance
improvement was limited: only about 1–2% in-
crease of tone recognition rates. Furthermore, due
to a lack of systematic linguistic considerations,
the above approaches showed to be difficult to befurther improved or extended to detect sentential
intonation structure.
Our study reported in this paper is quite dif-
ferent from the conventional ones mentioned
above. The basic difference lies in that we classify a
syllable F0 contour into segments of underlying
target and articulatory transitions, and only use
acoustic features of the segment of underlyingtarget for tone recognition, whereas nearly all the
conventional approaches used acoustic features of
a whole syllable for tone recognition. The proposal
is also different from the main vowel based ap-
proach by us in (Zhang and Hirose, 1998) and by
Chen in (Chen et al., 2001), in that it does not
assume a consistent alignment between the seg-
mental phones and tonalities. Thus offers a sys-tematic approach to deal with F0 variations
resulting from various factors, such as voiced/un-
voiced syllable initials, coarticulation effects be-
tween neighboring tones, sentential intonation
structure. Although the extracted F0 segments
have potential importance for detection of sen-
tential intonation structure, this paper only pre-
sents their application to tone recognition, leavingother issues for further studies.
The rest of this paper is organized as follows.
Section 2 introduces the representation of syllable
F0 variations by a proposed F0 segmental struc-
ture model (referred to as tone-nucleus model), in
which each syllable F0 contour is divided into tone
nucleus and articulatory transitions. The model is
originated in linguistic studies and shows to beconsistent with diverse linguistic findings. Section
3 gives a general description about the tone rec-
ognition system and the speech database used in
the study. Section 4 addresses the statistical means
450 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
to extract the tone nucleus from a syllable F0
contour, based on segmental K-means segmenta-
tion algorithm (Rabiner and Juang, 1993, pp. 382)
and linear discriminant analysis (LDA). Section 5evaluates the application of proposals through
tone recognition experiments on the continuous
speech database. Finally we conclude in Section 6.
2. Segmental representation of syllable F0 varia-
tions
A consistent account of F0 variations in a syl-
lable F0 contour in various phonetic contexts
is desired for the development of a systematic
method to deal with F0 variations for tone rec-
ognition. Based on a survey on linguistic findings,
we proposed a F0 segmental structure model
which divides a syllable F0 contour into under-
lying target and articulatory transitions, and as-sumes no direct relations between the F0 segments
with the internal structure of a syllable. This pro-
posal results from the continuing convergence of
observations on one basic issue:
• Lexical tone is not evenly distributed in a sylla-
ble.
2.1. Underlying target and articulatory transition
By observation and comparison of isolatedtonal F0 contours of syllables with different seg-
mental structures, especially between those with
voiced and unvoiced syllable initials, Howie (1974)
concluded that the domain of a syllable tone in
Chinese does not spread to the entire voiced part
or the entire vocalic part of the syllable, but rather
is confined to the rhyming part of the syllable
including the vowel and its succeeding voicedsegment [p. 147]. In (Rose, 1988), Rose suggested
that the perceptually valid F0 contour should not
include the period corresponding to the articula-
tory transition between two syllables. Shih (1988)
proposed that different tones have different align-
ments with the syllable: some starting from the
rhyme onset (Tones 1 and 3), but others starting
later, even from the middle of the rhyme (Tone 2).In (Whalen and Xu, 1992), the authors tried to
examine the critical information for tone percep-
tion by extracting small segments from a syllable,
and the experimental results led the authors to
conclude that not every portion of the F0 contourindicates the original tone [p. 25]. Also through
perceptual experiments using stimuli sliced from a
number of isolated syllables, Lin (1995) assumed
that a Chinese tone in isolation is mainly related to
the syllabic vowel and its adjacent transitions,
whereas neither initial consonants and glides nor
final nasals play any tone-carrying roles. Based on
statistical results, Xu and Wang (2001) revealedthat there should be no direct relations between
the segmental structure of a syllable and the tonal
contour, but suggested that the early portion of an
F0 contour always varies with the ending F0 of the
preceding syllable, whereas the later portion con-
verges to the contour that seems to conform to the
purported underlying pitch values.
Although there still exist some differences in theabove-mentioned views, those studies, together
with a number of ones unmentioned here, seem to
converge to one point: a portion of a syllable F0
contour looks to conform to the underlying pitch
better than other portions. If we ignore the issue of
alignment of F0 contour with segmental units,
which received rather diverse views in the litera-
ture, we can reach a general agreement on the factthat tone information is not evenly distributed in a
syllable F0 contour: some segment of a syllable F0
contour may carry more tone information than
other ones. Therefore, we can classify a syllable F0
contour into underlying target and articulatory
transitions:
• Underlying target represents the F0 target andserves as the main acoustic cue for pitch percep-
tion.
• Articulatory transitions are the F0 variations
occurring as the transitions to or from the pitch
targets.
2.2. Tone nucleus model
After arriving at a view that a syllable F0
contour could be divided into underlying target
and articulatory transitions, we proposed a F0segmental structure model of Chinese syllable F0
Table 1
Pitch target features of the four lexical tones
Targets Tone 1 Tone 2 Tone 3 Tone 4
Onset H L L H
Offset H H L L
‘‘H’’, ‘‘L’’ depict high and low targets, respectively.
F0Tone 1
Tone 4
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 451
contours (referred to as tone nucleus model) which
gave a systematic view of variations in a syllable
F0 contour.
As illustrated in Fig. 2, a syllable F0 contourmay be divided into three segments: onset course,
tone nucleus and offset course. Tone nucleus rep-
resents the underlying pitch targets and is obliga-
tory, the onset and the offset courses are the
articulatory transitions and optional. Each F0
segment usually exhibits an asymptotically linear
curve (Xu and Wang, 2001), and one syllable F0
should contain no more than three F0 segmentswith quite different slopes. Based on previous
studies (Xu, 1998), we assume that there should be
no consistent relations between the F0 segmental
structure and the syllable internal structure, but
tone nucleus should reside in the Final portion of a
Chinese syllable.
• Tone nucleus: a portion of F0 contour that rep-resents pitch targets of the lexical tone. It is the
segment containing the most critical informa-
tion for tonality perception, thus called as the
tone-critical segment.
• Onset course: the asymptotic F0 transition
locus to the tone-onset target from a preceding
vocal cords� vibration state.• Offset course: the F0 transition locus from thetone-offset target to a succeeding vocal cords�vibration state.
Tone-onset target and tone-offset target indi-
cate the pitch values, which takes either low (L) or
high (H) value, at the tone onset and offset,
respectively. These pitch values serve as distinctive
Sub-syllabic F0 segments
(1) 2 (3)
Tone onset Tone offset
(Onset course) Tone nucleus (Offset course)
Syllable F0 contour
Fig. 2. Illustrations of the proposed F0 segmental structure
model of Chinese syllable F0 contours. F0 segments in paren-
theses are optional; only the tone nucleus is obligatory.
features characterizing the four basic lexical tones(Table 1) (Xu, 1997b).
Fig. 3 illustrates some frequently observed tonal
F0 variations in continuous speech. Although they
show great deviations from the standard F0 pat-
terns in Fig. 1, we can see that, from the view of
tone nucleus model, the tone nuclei delimited by
the vertical sticks still show the consistent patterns
with the underlying pitch targets, and F0 variationsin the beginning and ending portions of a syllable
belong to articulatory transitions.
Among the four basic lexical tones, Tone 3 is
different from the other three tones, in that the
others are associated with rather stable F0 pat-
terns, i.e. Tone 1 with high and flat F0, Tone 2 with
rising F0 and Tone 4 with falling F0 (hence con-
touricity) (Rose, 1988), whereas Tone 3 is foundwith rather wide variety of F0 patterns. Tone 3 was
conventionally associated with a dipping F0 con-
tour (Chao, 1968, p. 26), but other studies have
shown that the final rise seen is usually absent in
Time
c c c
Tone 2
1 2 3
Fig. 3. Illustration of tonal F0 contours with possible articu-
latory transitions for Tones 1, 2 and 4. The left and right ver-
tical sticks in each contour correspond to the possible tone
onset and offset locations, and the medium F0 segment delim-
ited by the tone onset and offset in each contour represents
tone-nucleus of the tone. c1, c2 and c3 depict the onset courses,the tone-nuclei and the offset courses.
time
F0A
B
C
time
F0A
B
C
(a) (b)
Fig. 4. Illustrations of F0 contours of continua of ‘‘Tone 2 Tone 2’’, with a contrast of voiced initial segment (a) and unvoiced initial
segment (b) in the second syllables. The thin vertical lines indicate the syllable boundaries.
452 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
non-prepausal positions (Xu, 1997b). Therefore,
view about the tone nucleus of Tone 3 may beflexible: either the whole syllable F0 or the falling
segment may be considered as tone nucleus.
2.3. Systematic framework to deal with F0 varia-
tions
Through F0 observations it was found that an
articulatory F0 transition may occupy a largeportion of a syllable F0 contour: 30% in (Howie,
1974), 50–100 ms after the release of the initial
consonant in (Rose, 1988, p. 76) (accounting for
about 40% proportion in the given example [p.
65]), and more than 50% when the rising tone
starting from the middle of the rhyme (Shih, 1988).
Although the substantial F0 variations in articu-
latory transitions exert very few influences onhuman pitch perception, they may confuse an
automatic tone recognizer.
Tone-nucleus model offers a possible systematic
framework to deal with F0 variations that result
from both articulatory constraints and con-
founded intonation functions, for tone recognition
and intonation function detection. Here, we illus-
trate briefly the possible ways to deal with thesubstantial F0 variations due to the following
factors 2 for tone recognition:
(1) Voiced/unvoiced syllable initial segments.
(2) Prosodic phrase boundary.
(3) Focus.
2 There are more other factors such as fast speaking rate that
may lead to significant F0 variations.
2.3.1. Voiced/unvoiced syllable initial segments
Fig. 4 illustrates F0 contours of two continuaof ‘‘Tone 2 Tone 2’’. Due to the voicing (or not) on
the initial segment of the second syllables, F0
contours of the second syllables in the two con-
tinua differ greatly. Since conventional tone rec-
ognizers based on hidden Markov modes (HMMs)
or neural net usually take whole syllable F0 con-
tours as important acoustic cues for the lexical
tones (Wang et al., 1997; Chen and Wang, 1995),such a difference due to voiced and unvoiced syl-
lable initial segments may not only affect the
training of tonal acoustic models, but also easily
lead to recognition errors. The F0 contour of the
second syllable in (a) of Fig. 4 has a dipping shape
like that of Tone 3, and tends to be mis-recognized
as Tone 3. However, according to the tone-nucleus
model, F0 loci ‘‘BC’’ in (a) and (b) of Fig. 4 are thetone nuclei, whereas the F0 locus ‘‘AB’’ in (a) is an
articulatory transition. If the articulatory transi-
tion is ignored in tone recognition, the above
problems may be avoided.
2.3.2. Prosodic phrase boundary
Prosodic phrasing refers to the perceived
groupings of words in speech, and it was suggested
(Zhang and Kawanami, 1999) to have influenceson the laryngeal coarticulatory overlap during
tone production. Although this issue is still lack of
a systematic study, it was found that when there is
a prosodic boundary between two neighboring
tones, F0 contour of the second syllable is usually
free from coarticulation effect from the first sylla-
ble. Given a continuum of Tones 4 and 1, ‘‘H’’
targets in Tone 1 are usually rather lower than thatof the ‘‘H’’ target in the preceding Tone 4 mainly
time
F0
h1
Focus
h2BA
C D
E F
Fig. 6. Illustration of a typical F0 contour for a continuum
‘‘Tone 1 Tone 1 Tone 1’’, with a focus placed on the second
Tone 1 syllable.
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 453
due to the carryover lowering effect from the ‘‘L’’
offset target of Tone 4 (Xu and Wang, 2001) when
the two tones are in one word. But when there
exists a prosodic boundary (such as a phraseboundary) between the two tones, the carryover
lowering effect may be stopped and the ‘‘H’’ tar-
gets of Tone 1 are raised to a higher position, as
illustrated in Fig. 5. This often results in a sub-
stantial rising transition F0 locus ‘‘CD’’ to the
‘‘H’’ targets of Tone 1. However, based on the
tone-nucleus model, the rising ‘‘CD’’ F0 locus is an
articulatory transition, and the tone information ismainly carried by the tone nucleus ‘‘DE’’ whose
shape still conforms to the underlying pitch tar-
gets. Furthermore, if the tone nuclei ‘‘AB’’ and
‘‘DE’’ of the two tones can be detected, then the
F0 range reset h of the Tone 1 will be an importantcue for detecting the prosodic phrasing boundary,
which is very useful in developing spontaneous
speech recognition and understanding systems(Niemann et al., 1998).
2.3.3. Focus
Sentential focus may also bring about sub-
stantial F0 variations to syllable F0 contours.
According to Xu (1999), focus is related with threedistinct pitch ranges: expanded range in non-sen-
tence-final focused words, suppressed in post-focus
words, and neutral in all other words. Fig. 6
illustrates this idea through a continuum of three
Tone 1s with a focus on the second Tone 1. We can
see that the focus lead to a substantial rising and a
falling F0 loci in the early portions of F0 contours
of the second and the third syllable respectively.These kinds of F0 loci are regarded as articulatory
transitions according to the tone-nucleus model,
whereas the segments ‘‘AB’’, ‘‘CD’’ and ‘‘EF’’ are
time
F0A
B C
hD
E
Fig. 5. Illustration of an F0 contour of a continuum of ‘‘Tone 4
Tone 1’’, with an prosodic phrase boundary between them.
the tone nuclei of the three Tone 1s. Compared
with the whole syllable F0 contours, the tone
nuclei show more consistent patterns with the
underlying pitch targets. Also, if the three tone
nuclei can be detected, the range differences rep-
resented by h1 and h2 are estimated as the gapsbetween the preceding tone offsets and succeeding
tone onsets of neighboring tone nuclei. These
range differences may serve as acoustic cues for
focus detection, which is important for interpret-
ing the speaker�s intention for developing sponta-neous dialogue systems.
2.3.4. Tone recognition based on tone nuclei
After the above discussions, we can see that
tone recognition may not be affected by the artic-
ulatory transitions by taking only the tone nuclei
into account. Obviously, segmenting a syllable F0
contour into possible tone nucleus and articula-
tory transitions is the prerequisite for this tone
recognition scheme. Since the tone nucleus issuggested residing in a syllable Final, with the
segmentation of Initial and Final assumed avail-
able from a phoneme recognition process, we
focused on the Final portion to detect the tone
nucleus.
3. System overview and speech database
Having explored the linguistic foundations of
our tone recognition algorithm, we now give an
overview of the basic architecture of the tone rec-
ognition system, as illustrated in Fig. 7. With the
aid of phonetic segmentation of Initial consonantsand Finals which are assumed available from a
Table 2
Token numbers of each tone in the training and testing data set
Data set Tone
1
Tone
2
Tone
3
Tone
4
Neutral
tone
Total
Training 1138 1137 1004 2954 186 6419
Testing 473 430 391 1115 158 2567
TonalHMMs
Tone NucleusDetection
Final' F0Extraction
AcousticMatching
RecognizedLexical Tones
PhoneticSegmentation
F0, EnergyFeatures
DiscriminationCoefficients
+
Tonenuclei
Fig. 7. Block diagram for the tone recognition system.
454 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
phoneme recognition process, only F0 contours
corresponding to the Finals are kept for the tone
recognition from the original F0s. The panels (b)
and (c) of Fig. 11 illustrated the original and the
extracted F0 contours of Finals. Then, a tone-
nucleus is detected for each Final�s F0 contour.Finally, tone recognition is carried out by the
matching between the acoustic features of a tonenucleus and tonal HMMs.
Data of a female speaker (0f) in the speech
corpus HKU96, published by Hong Kong Uni-
versity, was used in the following study. There are
two reasons for selecting the data: the first one is
that the problem dealt here is more on intra-
speaker variation than inter-speaker variation.
Thus testing a speaker-dependent database isnecessary as a first step to show the validity of our
approach. The second one is that the speaker 0f is
a phonetician who was involved in developing the
data corpus HKU96: her speech was much more
natural than that of other speakers whose speech
often sounded more like ‘‘citation’’ style. We took
500 utterances (6419 syllables) labeled from
cs0f0001 to cs0f0500 for tonal acoustic modeltraining, and 200 utterances labeled from cs0f0501
to cs0f700 as testing set (2567 syllables) for tone
recognition. Table 2 gives the details of the data-
base. Average utterance length is 12.8 syllables in
both the training and testing set. The utterances
have average speaking rates ranging from 3.8 to
4.9 syllables per second, and sometimes more than
5 syllables per second.
The corpus offers phoneme, syllable (in Pinyin)and lexical tone labels together with orthographic
transcriptions. All the labels were manually
checked to correct possible errors: spectrograms,
and sometimes together with F0 contours, were
referred to for segmental alignment. A few tone
labels were modified according to their real
tonalities: for example, according to the Tone 3–
Tone 3 tone sandhi rule, the first one was changedto Tone 2. And according to some other mor-
phophonemic tone sandhi rules (Chao, 1968),
syllables like ‘‘yi’’ and ‘‘bu’’ have tonal alterna-
tions depending on its following tones. F0 was
extracted by using the integrated F0 tracking
algorithm (IFTA) (Secrest and Doddington, 1983),
and manual correctness were applied to any sig-
nificant F0 tracking errors such as half and doublefrequency. The window size for log-energy analysis
was 20 ms and frame shift for F0 and log-energy
extraction was set at every 10 ms.
4. Tone-nucleus detection
As the relation between tone nucleus and the
syllable internal segmental structure was not found
to be tight (Xu, 1998), we developed a standard
pattern recognition scheme to detect tone nuclei
from syllable Finals� F0 contours, instead of theheuristic approach to use vowel positions to seg-
ment syllable F0 contours as in (Zhang and Hi-rose, 1998; Chen et al., 2001). The scheme, as
illustrated in Fig. 8, mainly includes two steps:
extracting prosodic features of candidate F0 loci
for tone nucleus, and then discriminating the tone-
nucleus from other candidate F0 loci. However,
the candidate F0 loci are not available beforehand,
they need to be created through automatic seg-
mentation of a Final F0 contour into a concate-nation of asymptotically linear F0 loci.
F0 ContourSegmentation
Tone NucleusSmoothing
PhoneticSegmentation
F0, EnergyFeatures
DiscriminationCoefficients
+ Numberof Segments
=2?
Prosodic FeatureExtration
Discriminate
Tone Nucleus
Yes
No
Fig. 8. The scheme for tone-nucleus detection.
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 455
Some difficulties exist for the segmentation and
discrimination: one is how to robustly segment the
syllable F0 contours into successive linear F0 loci.
Segmentation method based on detecting turning
points like peak and valley points in F0 contours,
as used in (Hsie et al., 1988; Garding and Zhang,
1997), is not a good choice since there are possibleextra-local peak and valley points in F0 contours
due to their temporal fluctuations. Also it is diffi-
cult to find an appropriate threshold to segment an
F0 contour like that of Tone 1 in Fig. 3 since the
slope variation is gradual. Furthermore, number
of linear F0 loci in a syllable F0 contour is
uncertain, it maybe one, two or three. The second
difficulty is that we do not have adequate knowl-edge of acoustic features to discriminate tone nu-
cleus from other F0 loci.
Hence, we proposed to segment F0 contours by
segmental K-means segmentation algorithm (Ra-
biner and Juang, 1993, p. 382), and amalgamate
neighboring segments according whether they can
stand Hypothesis test on equal means of F0 slope
ratios. For the F0 contours that have 2 segments,we collected a number of acoustic features, such as
time, energy and F0 related variables, to do vari-
ance analyses (ANNOVA) with respect to the
position of tone nucleus in order to get an image
about feature distributions of tone nuclei. Then we
exploited the method of linear discriminant anal-ysis (LDA) to design a discriminant function to
predict the position of tone nucleus.
4.1. F0 contour segmentation via segmental K-
means segmentation algorithm
Segmental K-means segmentation algorithm is
a variant of the well-known K-means iterative
procedure for data clustering. The variation lies in
that the re-assignment of samples to the clusters is
achieved by finding the optimum segmental state
sequence via the Viterbi algorithm, and then by
backtracking along the optimal path. The proce-dure is illustrated in Fig. 9.
Let the observation sequence, O ¼ ðo1o2 � � � oNÞto represent a Final�s F0 contour and is dividedinto I , 16 I 6 3, successive segments. The obser-vation vector oj is a two component vector
ðlog F0j;Dlog F0jÞ. The ith, 16 i6 I , segmentcentroid is assumed to have the p.d.f. of the mul-
tivariate Gaussian pðojUiÞ, where the parametervector Ui include the mean-vector li and covari-
ance matrix Ri, which are obtained from the niobservation points of the ith segment by the
maximum likelihood estimates:
l̂i ¼1
ni
Xnik¼1
ok ð1Þ
bRi ¼1
ni
Xnik¼1
ðok � l̂iÞðok � l̂iÞt ð2Þ
In the re-segmentation step, the likelihood
pðojjUiÞ ¼1
2pjbR ij1=2exp
�� 12ðoj � l̂iÞ
t
� bR�1i ðoj � l̂iÞ
�ð3Þ
can be used in the Viterbi search to decide which
segment the point oj belongs to. When a segmen-tation of a Final�s F0 contour becomes available, acheck will be made whether two successive seg-
ments can be amalgamated or not based on thefollowing two principles:
Segmentation Initialization:n = n = n
Maximum LikelihoodEstimate of Centroid
Parameters
Re-segmentation throughViterbi Search Algorithm,Backtracking Along The
Optimal Path.
Converge?
I=3?n < 5
frames?
T-Test OnNeighboring Segment:
Amalgamate?
I=2
Segmentation Results
A SyllableFinal's F0Contour
I=I-1
n , n , n : segmentallength in frames.
I: the number of segments.
No
Yes
Yes
No
Yes
No further amalgamation.
Yes
No
1 2 3
1 2 3
2
Fig. 9. The segmental K-means segmentation procedure used to segment a syllable F0 contour into a possible number of linear F0 loci.
456 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
• Phonetic rule: a tone-nucleus should be longer
than 50 ms. Because pitch perception studies
have revealed a duration threshold of 40–60ms for a F0 contour perception (Rose, 1988,
p. 70).
• Statistical rule: if there is no significant statisti-
cal evidence for distributional differences be-
tween two neighboring F0 loci, they are merged.
When a Final�s F0 contour is divided into 3segments, the medium segment should correspondto tone-nucleus according to the tone nucleus
model, and its length is required to be longer than
50 ms, i.e. n2 P 5 according to the phonetic rule.
Otherwise the number of segments, I , will be re-duced to 2 and the segmentation will be repeated
once more. Next, two successive segments of F0
contour will be checked whether to amalgamate
into one or not based on T -test on the slope ratioK of the linear regression function of logF0 on
time t. For the observation point ðtj; log F0jÞ in thesegment ci:
log F0j ¼ Kitj þ Ci where Ci is a consonant
Ki may be estimated by taking the average of
Dlog F0j for the ni points in ci:
bKi ¼1
ni
Xnij¼1
Dlog F0j
We test the two hypotheses on if any two neigh-
boring segments have the same slope ratio or not,
H0 : Ki ¼ Kiþ1
H1 : Ki 6¼ Kiþ1
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 457
by using a test statistic Ti;iþ1,
Ti;iþ1 ¼bKi � bKiþ1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiS 1
niþ 1
niþ1
� �r ð4Þ
S ¼ maxPI
i¼1
Pnij¼1
ðDlog F0ij�bKiÞ2
N�IS0
(
where, S0 is the prescribed minimum allowance formicro-fluctuations, calculated by the average var-iance of D log F 0 of whole utterance; N , number ofvoicing points in the Final�s F0 contour; I , numberof F0 segments.
Ti;iþ1 has a Student T distribution with N � Idegree of freedom. The critical point for the two-
tailed test at a level of significance is
cp ¼ tN�I ;1�a=2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiS
1
niþ 1
niþ1
� �sð5Þ
where point tN�I;1�a=2 satisfies
P ½�tN�I ;1�a=26 TN�I 6 tN�I;1�a=2� ¼ 1� a
We reject H0 whenever jbKi � bKiþ1j exceeds thecomputed critical point. If H0 cannot be rejected ata level, this means the two neighboring F0 locihave the same slope ratio, and the two segments
will be merged. A Final�s F0 contour may be di-vided into one, two or three segments, from these
segments one is decided as the tone nucleus. For
log
ener
gyF0
t 0
c1
A
syllable sy
t1
w 1
Fig. 10. Illustration of prosodic feature extraction for syllable w2 in cand F0 contours of syllable w2 and its neighbors. Vertical lines indicathin lines indicating syllable boundaries between syllable w1, w2 andindicating F0 segmentation of syllable w2. ‘‘A’’, ‘‘B’’ and ‘‘C’’ indicatew2 into two segments c1 and c2. t0, t1 and t2 are the frames of syllabl
one-segment case, the whole Final�s F0 contouris the tone-nucleus. For three-segment case, the
medium segment is the tone-nucleus. For two-
segment case, the tone-nucleus is chosen by apredictor of linear discriminant function explained
in the following section.
4.2. Tone-nucleus discrimination for two-segment
F0 contour
For the two segments of a divided Final F0
contour, we assume only one of them as the tonenucleus: either the first half or the second half is
the tone nucleus of the syllable. We use Groups )1and 1 to represent them. We collected a number of
prosodic features, including duration and energy
related ones, to study if they have different distri-
butions for the two groups. Then, we use linear
discriminant analysis (LDA) to select a subset of
the prosodic features, and form a linear discrimi-nant function which could do group classification.
4.2.1. Prosodic features
As illustrated in Fig. 10, they are as follows:
4.2.1.1. Duration related features
• Normalized temporal location of points A and
B with respect to the syllable onset:
tA0 ¼tA � t0t2 � t0
T
c2
B
C
llable syllable w w 2 3
t2
ontinuous speech. The upper and lower panels represent energy
te phonetic segmentation and the F0 segmentation result, where
w3, broken lines indicating C–V boundaries, and dotted linethe segmentation points which divide the F0 contour of syllable
e onset, C–V boundary and syllable offset of syllable w2.
458 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
tB0 ¼tB � t0t2 � t0
• Normalized temporal location of points A, B
and C with respect to the Consonant–Vowel
(C–V) boundary:
tA1 ¼tA � t1t2 � t1
tB1 ¼tB � t1t2 � t1
tC1 ¼tC � t1t2 � t1
• Duration of c1 in frame numbers: n1.• Duration of c2 in frame numbers: n2.• Normalized duration of c1: dur1 ¼ tB1 � tA1.• Normalized duration of c2: dur2 ¼ tC1 � tB1.
4.2.1.2. Energy related features
• Energy slope ratio of F0 segments c1 and c2:
DPc1 ¼PtB � PtAn1 � 1
DPc2 ¼PtC � PtBn2 � 1
where Pj, j ¼ tA; tB; tC depict the frame logenergies respectively.
• Segmental energy-sum ratio:
n ¼ energy sum of c2energy sum of c1
¼PtC
j¼tBPjPtB
k¼tAPk
4.2.2. Statistical distributional analysis
Let x represent for any individual prosodicfeatures mentioned-above, and l�1, l1 for themean values of x for Groups )1 and 1. We use themethod ANNOVA (Milton and Arnold, 1992, p.
467–524) to test the hypothesis:
H0 : l�1 ¼ l1
When we got samples ðx�1;1; . . . ; x�1;N�1Þ and
ðx1;1; . . . ; x1;N1Þ for the two groups from the trainingdata respectively, the following sample means
could be calculated:
l̂i ¼1
Ni
XNi
j¼1xi;j; where i ¼ �1; 1 ð6Þ
l̂ ¼ 1
N
Xi¼�1;1
XNi
j¼1xi;j; where N ¼ N�1 þ N1 ð7Þ
The treatment sum of squares St and the
residual sum of squares Se are calculated as:
St ¼Xi¼�1;1
Niðl̂i � l̂Þ2 ð8Þ
Se ¼Xi¼�1;1
XNi
j¼1ðxi;j � l̂iÞ
2 ð9Þ
The test statistic f equals to:
f ¼ StSe=ðN � 2Þ
and is known to have an F distribution with 1 andN � 2 degrees of freedom, and denoted as Fð1;N�2Þ.A larger f value indicates that there are more
significant statistical evidence that the hypothesis
H0 is not correct.In experiment, we took 250 training utterances
for statistical analysis of tone nuclei. After F0
contour segmentation, we got 664 two-segment-
divided Finals� F0 contours. Then we manuallylabel them to either Group )1 or 1. Among them,289 samples belong to Group )1, and the other375 ones to Group 1. Table 3 gives the statistics of
ANNOVA of each prosodic feature for the tone
nuclei with respect to the two groups.
From Table 3, we can see all the features except
tA0 and tA1 have different distributions for the twogroups ‘‘)1’’ and ‘‘1’’ above 0.999 significancelevel. We also can see the temporal location ofpoint B is of the highest importance for discrimi-nating the two groups, and tB1, the normalizedtemporal location to C–V boundary, is a better
discriminating feature than tB0 which is normalizedto the syllable onset. It is also interesting to note
that if the former F0 segment (c1) is tone-nucleus,it tends to have a descending energy slope (DPc1 ¼�0:046), whereas when the latter F0 segment (c2) istone-nucleus, the former segment (c1) tends tohave a rising energy slope (DPc1 ¼ 0:069).
Table 3
ANNOVA statistics for groups ‘‘)1’’ and ‘‘1’’ based on 664 samples in analysis data
Features Group means Group standard deviation F ð1; 662Þ Significant level
)1 1 )1 1
tA0 0.350 0.315 0.121 0.132 12.0 0.001
tA1 0.011 0.038 0.133 0.187 4.3 0.039
tB0 0.756 0.487 0.092 0.150 723.3 0.000
tB1 0.617 0.230 0.141 0.185 873.5 0.000
tC1 0.834 0.765 0.120 0.207 25.2 0.000
n1 10.2 5.0 3.354 1.813 647.0 0.000
n2 3.0 8.5 1.670 3.168 718.9 0.000
dur1 0.705 0.336 0.161 0.166 825.3 0.000
dur2 0.291 0.602 0.118 0.200 549.9 0.000
DPc1 )0.046 0.069 0.077 0.134 171.0 0.000
DPc2 )0.304 )0.085 0.185 0.101 373.5 0.000
n 0.272 2.365 0.241 1.547 519.4 0.000
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 459
4.2.3. Linear discriminant function based tone-
nucleus predicator
We use a linear discriminant function as the
automatic classifier to predict the tone nucleus for
the two-segment divided Final F0 contours. Let xbe the vector consisting of a number of the pro-
sodic features mentioned-above, the function is
like,
y ¼ wTxþ w0 ð10Þ
x 2 Group � 1 y < 0Group 1 y > 0
�ð11Þ
With the training samples ðx�1;1; . . . ; x�1;N�1Þ ofGroup )1, and ðx1;1; . . . ; x1;N1Þ of Group 1, we useFisher ratio JF (Webb, 1999, pp. 104–105) to findthe w of Eq. (10):
JF ¼ jwTðl�1 � l1Þj2
wTSwwð12Þ
where,
li ¼1
Ni
XNi
j¼1xi;j where i ¼ �1; 1
Sw ¼ 1
N � 2 ðN�1bR�1 þ N1bR1Þ and N ¼ N�1 þ N1
ð13Þ
where bR�1 and bR1 represent the maximum likeli-hood estimates of the convariance matrices of
Groups )1 and 1, respectively. Through maxi-
mizing the ratio JF, the optimal weight w is cal-
culated as,
w ¼ S�1w ðl�1 � l1Þ
Furthermore, through feature selection algo-
rithms such as the sub-optimal sequential forward
selection (SFS) (Webb, 1999, pp. 222–223), asubset of prosodic features can be chosen to form a
discriminant function which achieve a good dis-
crimination performance. The SFS works like this:
(1) Initialization: let X , the objective feature set,be null.
(2) JF estimation: for each xi in the candidate fea-ture set, form a new feature set ðX ; xiÞ, and cal-culate the Fisher ratio JFðX ; xiÞ in Eq. (12).
(3) Best candidate: the feature x ¼argmaxJF�ðX ; xiÞ.
(4) Significance check: ðJFðX ; x Þ � JFðX ÞÞ is
checked to see if the improvement is statisti-
cally significant. If yes, include x into X , anddelete x from the candidate feature set, then
go back to 2. Otherwise, output X .(5) End.
In our experiment, we used the manually la-
beled 664 samples (375 of Group )1 and 289 ofGroup 1) to do the feature selection analysis and
design the linear discriminant function. The final
feature set we got only includes 5 of the original 12
listed-up described prosodic features: dur1, dur2,DPc1 , DPc2 , n. The estimated discriminant function
Table 4
Detection performance of tone nuclei for an open set of 50 utterances
1 segment F0 2 segment F0 3 segment F0 Total
Number of tones 247 161 263 671
Number of correct tone nuclei 247 152 255 654
Correct rate of tone nucleus (%) 100 94.4 97.0 97.5
0 0.5 1 1.5 2 2.58000
6000
4000
2000
0
2000
4000
sil er2 zhi2 yuan2 sil yu2 ying1 xun4 shi2 biao3 shi4 sil
ER ZH I VAN V ING X VN SH I B IAO SH I
(a)Speech wave file with Initial/Final and syllable segmentations.
0 0.5 1 1.5 2 2.54
4.5
5
5.5
6(b)Original F0 contours, with the F0 contours of Finals labled.
logF
0 (H
z)
t2 t2 t2 t2 t1 t4 t2 t3 t4
0 0.5 1 1.5 2 2.54
4.5
5
5.5
6
6.5
(c)Illustration of segmentation of the tone nuclei from the Final F0 contours
logF
0 (H
z)
Seg1Seg2Seg3
0 0.5 1 1.5 2 2.54.5
5
5.5
6(d)Extracted F0 contours of the tone nuclei.
logF
0 (H
z)
Time (second)
Fig. 11. Illustration of the extraction of tone nuclei. The panel (a) depicts the speech wave file, with phonetic segmentations of Initials/
Finals and syllables. The panel (b) depicts the original F0 contours, with the tonality and the Finals� boundaries labeled. The panel (c)illustrated the segmentation process for the tone nuclei from the Finals� F0 contours, where the ‘‘Seg1’’ depicts the results of segmentalK-means segmentation, the ‘‘Seg2’’ for the results of segment merge, and the ‘‘Seg3’’ for the tone nuclei. The panel (d) depicts the
extracted F0 contours of tone nuclei.
460 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 461
had an accuracy of 95.2% for discriminating the
two groups in the training data.
4.3. Tone nucleus detection performance
To evaluate the performance of the tone nu-
cleus detection method, we manually checked the
extracted tone nuclei of an open set consisting of
50 utterances (cs0f0251–cs0f0300) with 671 sylla-
bles. We regard those tone nuclei as errors only
when they obviously missed the targets. So, if thewhole F0 of a Final is chosen as the tone nucleus,
it is regarded as correct no matter whether it in-
cludes transitory courses or not. Table 4 shows the
number of samples in different segmentation
groups and the correct rate of extracted tone nu-
clei.
4.4. F0 smoothing of tone nucleus
After the tone nucleus is detected from a syl-
lable F0 contour, the other F0 segments will be set
to zero for tone recognition. Exactly in our ap-
proach, small portions of F0 of the onset/offset
neighboring to the tone nucleus are kept to com-
pute delta F0s. Fig. 11 illustrates the main steps ofone nuclei extraction. The panel (d) of Fig. 11
shows the extracted tone nuclei.
5. Tone recognition
Hidden Markov models (HMMs) were found
to be an efficient method for pitch tone recognition
in a number of previous studies (Ljolje and Fall-
side, 1987; Yang et al., 1988; Hirose and Hu, 1995;
Wang et al., 1997). We also used HMMs as the
tonal acoustic models in the tone recognition sys-tem. The difference of our approach from others is
that we only use acoustic features corresponding
to the tone nucleus of a syllable in training and
recognition stages, whereas the conventional ones
all used acoustic features of a whole syllable.
Training of the tonal HMMs were done by the
well-known Baum–Welch re-estimation proce-
dure, and acoustic decoding of recognition periodwas done through Viterbi search algorithm.
5.1. Tonal HMMs structure
The HMM structure for the four basic lexical
tones (Tones 1, 2, 3 and 4) has a left–right con-figuration with 5 states. The beginning and ending
states have 2 mixtures of Gaussian probability
functions each, and the 3 middle states, assumed to
represent tone nuclei, have 6 mixtures each. Since
the neutral tone was regarded to be of short
duration, its HMM has a less number of states 3
and less number of mixtures in the middle state
than the basic tones. The frame acoustic vectorconsists of 6 elements:
• log F0j,
• Dlog F0j ¼ log F0j � log F0j�1,• DDlog F0j ¼ log F0jþ1 þ log F0j�1 � 2logF0j,• Pj � Pave, where Pave is the average log powervalue of Pj at an utterance level.
• DPj ¼ Pj � Pj�1,• DDPj ¼ Pjþ1 þ Pj�1 � 2Pj.
No speaker normalization was conducted since
the system was designed for a speaker dependent
task. To extend it to a speaker independent task,
some considerations for speaker normalization
must be taken into account.
5.2. Tone recognition experiment I
The first tone recognition experiment was doneusing context independent (CI) tonal HMMs. The
number of HMMs is 5, i.e. one HMM for each of
Tones 1, 2, 3, 4 and the neutral tone. The con-
ventional method observing full syllable features
acted as the baseline system. Recognition results of
the two approaches for the test set are shown in
Fig. 12 as tone recognition rates in percentage for
the basic four tones and as the total averageincluding the neutral tone.
From Fig. 12 we can see the new approach
increased the absolute average rate by about 6%
compared with the conventional one. This indi-
cates that observing features only of tone nuclei
yields a better performance than observing full
syllable features. When the performances of indi-
vidual tones were viewed, we found one interestingphenomenon: Recognition rate is improved for
Tone 1 Tone 2 Tone 3 Tone 4 Total average
95
90
85
80
75
70
65
Full
NucleusR
ecog
nitio
n R
ates
(%
)
Fig. 12. Tone recognition rates for the four basic tones and absolute average in the two approaches using CI tonal HMMs. ‘‘Full’’
indicates the conventional approach observing full syllable features and ‘‘Nucleus’’ indicates the proposed tone-nucleus based method.
462 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
Tones 1, 2 and 4, but, for Tone 3, it even decreased
a little (about 2%). We thus turn to the confusion
matrices as in Table 5 (for conventional approach
observing full syllable features) and Table 6 (for
the tone-nucleus approach) to find a reasonable
explanation.By comparing Tables 5 and 6, we can see how
the use of tone nucleus improved tone recognition
rates. Taking Tone 1 recognition as the example,
error rate of mis-recognizing Tone 1 as Tone 4
reduced from 12.8% to 3.9%, and that as Tone 2
reduced from 16.2% to 10.5%. However, that as
Tone 3 or the neutral tone was nearly unchanged.
Since both onset and offset pitch targets of Tone 1are H (high), in continuous speech a rising onset
course and/or a falling offset course may appear in
the F0 contour of a Tone 1 syllable as shown in
Fig. 3. A Tone 1�s F0 contour with a rising onsetcourse may easily lead to an error of Tone 2, while
a Tone 1�s F0 contour with a falling offset coursemay easily leads to an error of Tone 4. But in the
tone-nucleus approach, these two kinds of mis-
Table 5
Confusion matrix for the conventional approach based on CI tonal H
Input tone Recognition rates (%)
Tone 1 Tone 2 T
Tone 1 69.2 16.2
Tone 2 8.5 76.4
Tone 3 0 10.4 7
Tone 4 4.5 3.5
The neutral tone 2.4 11.0 2
takes can be avoided. Similar reasons can also be
given to explain the improvements in the recogni-
tion rates of Tones 2 and 4.
The performance degradation of Tone 3 can
also be clarified from the above tables. The mis-
recognition of Tone 3 as the neutral tone increasedsignificantly from 2.4% to 11.6%. A close check of
the data showed that when the original F0 con-
tours of Tone 3s have lowering-rising dipping
shapes, the extracted tone nuclei usually only had
either the lowering or the rising F0 segments. The
loose of the dipping shape or contouricity may
result in more recognition errors of Tone 3 for the
tone nucleus approach than the original method.Furthermore, as both Tone 3 and the neutral tone
are associated with low pitch values, the confusion
between Tone 3 and the neutral tone increased in
the tone nucleus approach. On the other hand, the
mis-recognitions of Tone 3 as Tone 2 and Tone 4
have still been reduced (from 10.4% to 4.7% as
Tone 2 and from 17.2% to 15.7% as Tone 4). Fu-
ture studies should be conducted to improve the
MMs
one 3 Tone 4 The neutral tone
1.4 12.8 0.5
8.7 4.2 2.1
0.0 17.2 2.4
5.7 85.3 1.0
2.6 33.5 30.5
Table 6
Confusion matrix for the tone-nucleus approach based on CI tonal HMMs
Input tone Recognition rates (%)
Tone 1 Tone 2 Tone 3 Tone 4 The neutral tone
Tone 1 83.6 10.5 1.6 3.9 0.4
Tone 2 7.7 84.5 3.8 1.8 2.2
Tone 3 0 4.7 68.0 15.7 11.6
Tone 4 4.1 0.5 4.2 90.7 0.4
The neutral tone 4.9 9.1 12.8 41.5 31.7
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 463
discrimination between Tone 3 and the neutral
tone. 3
The whole recognition performance for the
neutral tone is notably low: 31.7% in the tone-nu-
cleus approach and 30.5% in the conventional ap-proach. This may be ascribed to the fact that F0 and
energy are not enough to characterize the neutral
tone. Some other studies showed the neutral tone
can be better discriminated based on duration fea-
ture (Hirose and Hu, 1995; Chen and Wang, 1995).
This problem should be dealt with in the future.
5.3. Tone recognition experiment II
The second tone recognition experiment used
context-dependent tonal HMMs, which modeled
the F0 variations due to the coarticulation with
neighboring tones to improve tone recognitionperformance (Chen and Wang, 1995; Wang et al.,
1997). For instance, a tone sequence ‘‘T1T2T3T4’’ isacoustically modeled by context-independent (CI)
or context-dependent (CD) models like:
CI : T1 T2 T3 T4CD : T1 � ðT2Þ ðT1Þ � T2 � ðT3Þ ðT2Þ � T3 � ðT4Þ ðT3Þ � T4
Therefore, coarticulation effects on each tonedue to its neighboring tones can be modeled by
3 In our later study in (Zhang and Hirose, 2000), Tone 3 was
found to be characterized by relative low pitch values, and its
recognition rate could be improved significantly based on an
anchoring based normalized method.
allotone models (CD models): bi-tone models used
to model tones at the beginning and end of an
utterance, and tri-tone models used to model tones
in the middle of an utterance. In the recognition
period, due to the fact that neighboring tones arenot known in advance, tone recognition for all
syllables in an input utterance must be realized
through dynamic search to find a legal pass with
the maximum probabilities across the whole can-
didate tone array, where a legal pass means that if
the pass has an allotone candidate, it must include
the context tones of the allotone as its preceding
and succeeding candidates.The number of CD tonal HMMs is 175: 4 · 5
for tones at the beginning of an utterance (neutral
tone does not appears at the beginning of an
utterance), 5 · 5 · 5 for tones in the middle of anutterance, 5 · 5 for tones at the end of an utter-ance and another 5 for isolated tones. Due to
insufficient training data for the context depen-
dent case, CD HMMs have tied transitionmatrices. They are first copied from CI HMMs,
and then re-estimated according to the trainingdata with tri-tone labels. Table 7 lists the results
of the CD experiments based on the acoustic
features of either the full syllables or only tone
nuclei, together with the results of the previous CI
experiments.
Although there exists an insufficient training
data problem, the slight increase in the recognition
Table 7
Absolute average tone recognition rates for the training and test
data in the four experiments
Data set Recognition rates (%)
CI full CI nucleus CD full CD nucleus
Training 76.2 80.8 86.5 91.8
Test 75.3 81.5 76.2 83.1
464 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
rate for the test data from CI model to CD modelcoincides with reported results in other studies
(Wang et al., 1997) (there, CD tonal HMMs
brought 1.8% improvement compared with CI
HMMs). We should note that the tone recognition
rate of the tone nucleus CI approach (81.5%) still
outperformed that of the conventional CD ap-
proach (76.2%).
6. Conclusions
In summary, we have described a new approachto deal with F0 variations for recognizing lexical
tones of Chinese continuous speech. The approach
uses a tone-nucleus model to deal with F0 varia-
tions systematically. A robust algorithm for de-
tecting tone-nuclei from speech signals is proposed
and tested. The algorithm contains three compo-
nents: (1) segmental K-means segmentation algo-
rithm which offers a robust method to collectcandidates for tone-nucleus detection, (2) AN-
NOVA which provides statistical knowledge about
the features of tone nucleus and (3) a tone-nucleus
detector based on linear discriminate function.
HMM-based tone recognizer was tested on one
female natural speech data in the HKU96 data-
base. Performance shows that tone recognition can
be substantially improved compared with otherconventional methods. The performance improve-
ment also confirms that the tone nucleus detection
worked appropriately in the task. Since tone nu-
cleus was also suggested to play a very important
role in Chinese intonation structure, the developed
tone nucleus detection algorithm is applicable to
other applications.
However, the best tone recognition rate is only83.1%, even though the task variability has been
simplified to speaker dependent read speech. At
one hand, this indicates that there exists high
variability in the F0 contours of continuous
speech. On the other hand, there is much room toimprove our tone recognition methods. We con-
sider the following points to further our study.
First, the approach assumes that a tone nucleus
should conform to the underlying pitch values.
Whereas phonetic studies has revealed that this
assumption may not be fully correct, the pitch
values of a tone nucleus can be different from the
underlying target pitch values when duration isvery short, and a tone with much varied tone nu-
cleus can probably be perceived as the purported
tonality as long as the tone context is present (Xu,
1994). Hence, the proposal in this paper has diffi-
culties in recognizing some highly variable tone
nuclei. Second, the algorithms for tone nucleus
detection and recognition can be further improved
by incorporating more appropriate methods. Deci-sion tree based prediction algorithm may be a
promising choice. Third, tonal acoustic models
may be further improved by optimizing the acous-
tic features used and the HMM topology for each
tones. Fourth, this study was only experimented
on a speaker dependent task. There are some
problems when apply the proposed approach to a
speaker independent task. One problem is thatthe approach is lack of processing the F0 range
variabilities of different speakers. Another prob-
lem is that the tone nuclei detection rate may
decrease significantly in speaker independent task
compared with the speaker dependent task, here-
inafter significantly affects the tone recognition. As
future study directions, the methods of speaker
F0 range normalization and soft tone nucleidetection, where a F0 segment is assigned a
probability as tone nucleus other than the yes/no
in this study, may be studied to deal with these
problems.
Acknowledgements
The authors acknowledge the helpful discus-
sions with Dr. Jinfu Ni. The first author would
also thank Dr. Yi Xu and Dr. Frank Soong fortheir advice on the earliest draft of this paper. The
J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466 465
valuable comments of three anonymous reviewers,
which considerably improved the quality of this
paper, are appreciated as well.
References
Chao, Y.-R., 1968. A Grammar of Spoken Chinese. University
of California Press, Berkeley.
Chen, S.-H., Wang, Y.-R., 1995. Tone recognition of contin-
uous Mandarin speech based on neural networks. IEEE
Trans. SAP 3 (2), 146–150.
Chen, C.-J. et al., 2001. Recognize tone languages using pitch
information on the main vowel of each syllable. In: Proc. of
ICASSP.
Cheng, Y.-B., Wang, R.-H., 1990. Speech Signal Processing.
University of Science and Technology of China Press (in
Chinese).
Fujisaki, H., 1997. Prosody, models, and spontaneous speech.
In: Sagisaka, Y., Campbell, N., Higuchi, N. (Eds.), Com-
puting Prosody: Computational Models for Processing
Spontaneous Speech. Springer-Verlag, New York, pp. 27–
42.
Fujisaki, H., Hirose, K., 1984. Analysis of voice fundamental
frequency contours for declarative sentences of Japanese. J.
Acoust. Soc. Jpn. (E) 5 (4), 233–242.
Garding, E., Zhang, J.-L., 1997. Tempo effects in Chinese
prosodic patterns. In: ESCA workshop on Intonation:
Theory, Models and Applications, Athens Greece, Septem-
ber 1997, pp. 145–148.
Hirose, K., Hu, X.-H., 1995. HMM-based tone recognition of
Chinese trisyllables using double codebooks on fundamen-
tal frequency and waveform power. In: Proc. 45th European
Conf. on Speech Communication and Technology, Vol. 1,
Mopm2A.5, Madrid, pp. 31–34.
Hirose, K., Iwano, K., 1999. Detecting prosodic word bound-
aries using statistical models of moraic transition and its use
for continuous speech recognition. In: ICASSP99.
Howie, J.M., 1974. On the domain of tone in Mandarin.
Phonetica 30, 129–148.
Hsie, Ch.-T., Furuichi, Ch., Imai, S., 1988. Recognition of four
tones in continuous Chinese speech. J. IEICE, D J71-D (4),
661–668 (in Japanese).
Lee, L.-Sh., Tseng, Ch.-Y., Gu, H.-Y., Liu, F.-H., Chang,
Ch.-H., Lin, Y.-H., Lee, Y.-M., Tu, Sh.-L., Hsieh, Sh.-H.,
Chen, Ch.-H., 1993. Golden Mandarin(I)––A real-time
Mandarin speech dictation machine for Chinese language
with very large vocabulary. IEEE Trans. SAP 1 (2), 158–
179.
Lin, M.-C., 1995. A perceptual study on the domain of tones in
Beijing Mandarin. China Acta Acustica 20 (6), 437–
445.
Liu, J., He, X.-D., Mo, F.-Y., Yu, T.-Ch., 1999. Study on tone
classification of Chinese continuous speech in speech
recognition system. In: Proc. Eurospeech 99, Vol. 2,
Budapest, Hungary, pp. 891–894.
Ljolje, A., Fallside, F., 1987. Recognition of isolated prosodic
patterns using hidden Markov models. Comput. Speech
Lang. 2, 27–33.
Milton, J.S., Arnold, J.C., 1992. Introduction to Probability
and Statistics. McGraw-Hill.
Niemann, H. et al., 1998. Using prosodic cues in spoken dialog
systems. In: Kosarev, Y. (Ed.), Proc. of SPECOM�98Workshop, St-Petersburg, pp. 17–28.
Rabiner, L., Juang, B.-H., 1993. Fundamentals of speech
recognition. Prentice-Hall PTR.
Rose, P.J., 1988. On the non-equivalence of Fundamental
frequency and pitch in tonal description. In: Bradley, D.,
Henderson, E.J.A., Mazaudon, M., (Eds.), Prosodic Anal-
ysis and Asian Linguistics: To honour R.K. Sprigg, Pacific
Linguistics, C-104, pp. 55–82.
Secrest, B.G., Doddington, G.R., 1983. An integrated pitch
tracking algorithm for speech systems. In: Proc. of ICASSP,
pp. 1352–1355.
Shih, Ch.-L., 1988. Tone and Intonation in Mandarin, Vol. 3.
Working Papers, Cornell Phonetics Laboratory, pp. 83–109.
Shih, Ch.-L., 1990. Mandarin third tone sandhi and prosodic
structure. In: Hoekstra, T., van der Hulst, H. (Eds.), Studies
in Chinese Phonology. Mouton de Gruyter Press, pp. 81–
123.
Stolcke, A. et al., 1999. Modeling the prosody of hidden events
for improved word recognition. In: Eurospeech 99, Buda-
pest, Hungary, September 1999.
Wang, Y.-R., Chen, S.-H., 1994. Tone recognition of contin-
uous Mandarin speech assisted with prosodic information.
J. Acoust. Soc. Am. Pt. 1 96 (5), 2637–2645.
Wang, Ch., Seneff, S., 1998. A study of tones and tempo in
continuous Mandarin digit strings and their application in
telephone quality speech recognition. In: Proc. of ICSLP98,
Sydney, Australia, December 1998.
Wang, Ch.-F., Fujisaki, H., Hirose, K., 1990. Chinese four tone
recognition based on the model for process of generating F0
contours of sentences. In: Proc. of ICSLP90, pp. 221–224.
Wang, H.M., Ho, T.-H., Yang, R.-Ch., Shen, J.-L., Bai, B.-R.,
Hong, J.-Ch., Chen, W.-P., Yu, T.-L., Lee, L.-sh., 1997.
Complete recognition of continuous Mandarin speech for
Chinese language with very large vocabulary using limited
training data. IEEE Trans. Speech Audio Process. 5 (2),
195–200.
Webb, A., 1999. Statistical Pattern Recognition. Arnold Press,
London.
Whalen, D.H., Xu, Y., 1992. Information for Mandarin tones
in the amplitude contour and in brief segments. Phonetica
49, 25–47.
Wightman, C.W., Ostendorf, M., 1994. Automatic labeling of
prosodic patterns. IEEE Trans. SAP 2 (4), 469–481.
Xu, Y., 1994. Production and perception of coarticulated tones.
JASA (4), 2240–2253.
Xu, Y., 1997a. What can tone studies tell us about intonation?
In: Proc. from the ESCA Workshop on Intonation: Theory,
Models and Applications, Athens Greece, pp. 337–340.
Xu, Y., 1997b. Contextual tonal variations in Mandarin. J.
Phonetics 25, 61–83.
466 J. Zhang, K. Hirose / Speech Communication 42 (2004) 447–466
Xu, Y., 1998. Consistency of tone-syllable alignment across
different syllable structures and speaking rates. Phonetica
55, 179–203.
Xu, Y., 1999. Effects of tone and focus on the formation and
alignment of F0 contours. J. Phonetics 27 (1), 55–105.
Xu, Y., Wang, Q.E., 2001. Pitch targets and their realization:
evidence from Mandarin Chinese. Speech Commun. 33,
319–337.
Yang, W.-J., Lee, J.-Ch., Chang, Y.-Ch., Wang, H.-Ch., 1988.
Hidden Markov model for Mandarin lexical tone recogni-
tion. IEEE Trans. ASSP 36 (7), 988–992.
Zhang, J.-S., Hirose, K., 1998. A robust tone recognition
method of Chinese based on subsyllabic F0 contours.
proceedings from ICSLP98, Sydney, pp. 703–706.
Zhang, J.-S., Hirose, K., 2000. Anchoring hypothesis and its
application to tone recognition of Chinese continu-
ous speech. In: ICASSP2000, Istanbul, Turkey, June
2000.
Zhang, J.-S., Kawanami, H., 1999. Modeling carryover and
anticipation effects for Chinese tone recognition. In: Proc. of
Eurospeech�99, Budapest, Hungary, September 1999, pp.747–750.