Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of...
Transcript of Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of...
COMPUTER ANIMATION AND VIRTUAL WORLDS
Comp. Anim. Virtual Worlds 2004; 15: 485–500
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cav.11* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Accurate automatic visible speech synthesisof arbitrary 3D models based onconcatenation of diviseme motioncapture data
By Jiyong Ma*, Ronald Cole, Bryan Pellom, Wayne Wardand Barbara Wise* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
We present a technique for accurate automatic visible speech synthesis from textual input.
When provided with a speech waveform and the text of a spoken sentence, the system
produces accurate visible speech synchronized with the audio signal. To develop the system,
we collected motion capture data from a speaker’s face during production of a set of words
containing all diviseme sequences in English. The motion capture points from the speaker’s
face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a
new utterance, the system locates the required sequence of divisemes, shrinks or expands
each diviseme based on the desired phoneme segment durations in the target utterance, then
moves the polygons in the regions of the lips and lower face to correspond to the spatial
coordinates of the motion capture data. The motion mapping is realized by a key-shape
mapping function learned by a set of viseme examples in the source and target faces. A
well-posed numerical algorithm estimates the shape blending coefficients. Time warping and
motion vector blending at the juncture of two divisemes and the algorithm to search the
optimal concatenated visible speech are also developed to provide the final concatenative
motion sequence. Copyright # 2004 John Wiley & Sons, Ltd.
Received: 9 April 2003; Revised: 2 December 2003
KEY WORDS: visible speech; visual speech synthesis; animated speech; coarticulationmodelling; speech animation; face animation
Introduction
Animating accurate visible speech is one of the most
important research areas in face animation because of its
many practical applications ranging from language
training for the hearing impaired, to films and game
productions, animated agents for human–computer
interaction, virtual avatars, model-based image coding
in MPEG4 and electronic commerce.
Motion capture technologies have been successfully
used in character body animation. By recording motions
directly from real actors and mapping them to character
models, realistic motions can be generated efficiently.
Whereas techniques based on motion capture for 3D
face animation have been proposed in the last decade,
accurate visible speech synthesis for any text is still a
challenge owing to complex facial muscles in lip regions
and significant coarticulation effects in visible speech.
Most techniques proposed in previous work require a
generic 3D face model to be adapted to the subject’s face
with captured facial motions. These techniques cannot be
applied to the case when the subject’s face differs signi-
ficantly from the 3D face model. The emphasis of our
work is on the issues relating to automatic visible speech
synthesis for 3D face models with different meshes.
The work described in this paper explores a new
approach to accurate visible speech synthesis by record-
ing facial movements from real actors and mapping
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd.
*Correspondence to: Jiyong Ma, Center for Spoken LanguageResearch, University of Colorado at Boulder, Campus Box 594,Boulder, Colorado 80309-0594, USA.E-mail: [email protected]
Contract/grant sponsor: NSF CARE; contract/grant number:EIA-9996075.Contract/grant sponsor: NSF/IIR; contract/grant numbers:IIS-0086107; REC-0115419.Contract/grant sponsor: NSF/IERI; contract/grant numbers:EIA-0121201; 1R01HD-44276.01.
them to 3D face models. In the proposed approach,
several tasks are implemented. These include: motion
capture, motion mapping, and motion concatenation.
In motion capture, a set of colour 3D markers is glued
onto a human face. The subject then produces a set of
words that cover important lip transition motions from
one viseme to another. Sixteen visemes are used in our
experiment. The motion capture system we designed
consists of two mirrors and a camcorder. The camcorder
records the video and audio synchronously. The audio
signal is used to segment video clips; thus the motion
image sequence for each diviseme is segmented. Com-
puter vision techniques such as camera calibration, 2D
facial marker tracking, and head pose estimation algo-
rithms are also implemented. The head pose is applied
to eliminate the influence of head motions on the facial
markers’ movement so that the reconstructed 3D facial
marker positions are invariant to the head pose.
Motion mapping is required because the source face
is different from the target face. A mapping function is
learned from a set of training examples of visemes
selected from the source face and designed for the target
face. Visemes for the source face are subjectively sele-
cted from the recorded images, while the visemes for the
target 3D face are manually designed according to their
appearances in the source face. They should visually
resemble those for the source face. For instance, a
viseme that models the /aa/ sound for the source face
should be very similar visually to the same viseme for
the target 3D face.
After the motions are mapped from the source face to
the target face, a motion concatenation technique is
applied to synthesize natural visible speech. Whereas
the concatenation approach is already used in several
image-based approaches1–5 to synthesize visible speech,
the concatenated objects are 2D images or image fea-
tures such as optical flows, etc. The concatenated objects
discussed in this paper are 3D trajectories of lip motions.
The overall approach proposed in this paper can be
applied to any 3D face model, either photo-realistic or
cartoon-like. In addition, to get relevant phonetic and
timing information of input text, we have integrated the
Festival speech synthesis system into our animation
engine; the system converts text to speech. At the
same time, our system uses the SONIC speech recogni-
tion engine developed by Pellom and Hacioglu6 to
force-align and segment pre-recorded speech (i.e., to
provide timing between the input speech and associated
text and/or phoneme sequence). The speech synthesizer
and forced-alignment system allow us to perform ex-
periments with various input text and speech wave files.
In the remainder of this article, the next section
presents a review of previous work. The third section
describes the system architecture of our work. The
fourth section demonstrates the approach to accurate
visible speech. The fifth section discusses modelling
coarticulation in 3D tongue movement; the final section
provides a summary.
RelatedWork
Spoken language is bimodal in nature: auditory and
visual. Between them, visual speech can complement
auditory speech understanding in noisy conditions. For
instance, most hearing-impaired people and foreign
language learners heavily rely on visual cues to enhance
speech understanding. In addition, facial expressions and
lip motions are also essential to sign language under-
standing. Without facial information, sign language un-
derstanding level becomes very low. Therefore, creating a
3D character that can automatically produce accurate
visual speech synchronized with auditory speech will
be at least beneficial to language understanding7 when
direct face-to-face communication is impossible.
Research in the past three decades has shown that
visual cues in spoken language can augment auditory
speech understanding, especially in a noisy environment.
However, automatically producing accurate visible
speech and realistic facial expressions for 3D computer
characters seems to be a non-trivial task. The reasons
include: 3D lip motions are not easy to control and the
coarticulation in visible speech is difficult to model.
Since F. Parke8 created the first 3D face animation
using the key-frame technique, researchers have de-
voted considerable effort to creating convincing 3D
face animation. The approaches include parametric-
based,8 physics-based,9 image-based,1–5 performance-
driven approach,10 and multi-target morphing.11
Although these approaches have enriched 3D face ani-
mation theory and practice, creating convincing visible
speech is still a time-consuming task. To create only a
short scenario of 3D facial animation in movies, it will
take a skilled animator several hours of repeatedly
modifying animation parameters to obtain the desired
animation effect. Although some 3D design authoring
tools such as 3Ds MAX or MAYA are available for
animators, they cannot automatically generate accu-
rate visible speech, and these tools require repeated
adjustment and testing to achieve more optimal anima-
tion parameters for visible speech, which is a
tedious task. Therefore, research that leads to automatic
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 486 Comp. Anim. Virtual Worlds 2004; 15: 485–500
production of natural visible speech and facial expres-
sions will have great theoretical and practical value.
The approaches most closely related to our approach
to visible speech animation and facial animation are the
image-based approach and the performance-driven ap-
proach. A third approach, the physics-based approach,
has the following disadvantages. In the physics-based
approach, a muscle is usually connected to a group of
vertices. This requires animators to manually define
which vertex is associated with which muscle and to
manually put muscles under the skin surface. Muscle
parameters are manually modified by trial and error.
These tasks are tedious and time-consuming. It seems
that no unique parameterization approach has proven
to be sufficient to create face expressions and viseme
targets with simple and intuitive controls. In addition, it
is difficult to map muscle parameters estimated from the
motion capture data to a 3D face model. To simplify the
physics-based approach, Magnenat Thalmann et al.12
proposed the concept of an abstract muscle procedure.
One challenging problem in physics-based approaches is
how to automatically obtain muscle parameters. Inverse
dynamics approaches that use advanced measurement
equipment may provide a scientific solution to the pro-
blem of obtaining facial muscle parameters.
The image-based approach aims at learning face
models from a set of 2D images instead of directly
modelling 3D face models. One typical image-based
animation system called Video Rewrite was proposed
by Bregler et al.1 In this approach, a set of triphone
segments is used to model the coarticulation in visible
speech. For speech animation, the phonetic information
in the audio signal provides cues to locate its corre-
sponding video clip. In the approach, the visible speech
is constructed by concatenating the appropriate visual
triphone sequences from a database. Cosatto and Graf4
and Graf et al.5 also proposed an approach analogous to
speech synthesis, in which the visible speech synthesis
is performed by searching a best path in the triphone
database using the Viterbi algorithm. However, experi-
mental results show that when the lip space is not
populated densely the animations produced may be
jerky. Recently, Ezzat and co-workers2,3 adopted ma-
chine learning and computer vision techniques to
synthesize visible speech from recorded video. In that
system, a visual speech model is learned from the video
data that is capable of synthesizing the human subject’s
lip motion not recorded in the original speech. The
system can produce intelligible visible speech. The
approach has two limitations: (1) the face model is not
3D; (2) the face appearance cannot be changed.
In a performance-driven approach, a motion capture
system is employed to record motions of a subject’s face.
The captured data from the subject are retargeted to a
3D face model. The captured data may be 2D or 3D
positions of feature points on the subject’s face. Most
previous research on performance-driven facial anima-
tion requires the face shape of the subject to closely
resemble the target 3D face model. When the target 3D
face model is sufficiently different from that of the
captured face, face adaptation is required to retarget
the motions. In order to map motions, global and local
face parameter adaptation can be applied. Before mo-
tion mapping, the correspondences between key ver-
tices in the 3D face model and the subject’s face are
manually labelled.13 Moreover, local adaptation is re-
quired for the eye, nose, and mouth zones. However,
this approach is not sufficient to describe complex facial
expressions and lip motions. Chuang et al.14 proposed
an approach to creating facial animation using motion
capture data and shape-blending interpolation. Here,
computer vision is utilized to track the facial features in
2D, while shape-blending interpolation is proposed to
retarget the source motion. Noh and Neumann13 pro-
posed an approach to transferring vertex motion from a
source face model to a target model. It is claimed that
with the aid of an automatic heuristic correspondence
search the approach requires a user to select fewer than
10 points in the model. In addition, Guenter et al.15
created a system for capturing both the 3D geometry
and colour shading information for human facial ex-
pression. Kshirsagar et al.16 utilized motion capture
techniques to obtain facial description parameters and
facial animation parameters defined in the MPEG4 face
animation standard. Recently, Bregler et al.17 developed
a technique to track the motion from animated cartoons
and retarget it on 3D models.
SystemArchitecture ofourWork
The research mentioned above was conducted either for
facial expression animation or for a specific 3D/2D
model. No detailed report discusses how to concatenate
3D facial points for visible speech synthesis based on
motion concatenation. In this paper, we propose a novel
approach for visible speech synthesis based on motion
capture techniques and concatenating lip motions. In
the system, motion capture is used to get the trajectories
of the 3D facial feature points on a subject’s face while
the subject is speaking. Then, the trajectories of the 3D
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 487 Comp. Anim. Virtual Worlds 2004; 15: 485–500
facial feature points are mapped to make the target 3D
face imitate the lip motion.
Our system differs from the prior work using image-
based methods. Instead, we capture motions of 3D facial
feature points, map them onto a 3D face model, and
concatenate motions to get natural visible speech. Com-
pared with image-based methods, an advantage of the
proposed approach is that the motion mapping can be
applicable to any 2D/3D character model. The second
novelty of the proposed approach is that we use a
concatenation approach to synthesize accurate visible
speech. Figure 1 illustrates the proposed system archi-
tecture for accurate visible speech synthesis.
The corpus consists of a set of primitive motion
trajectories of 3D facial markers reconstructed by a
motion capture system. The concept of a viseme and a
diviseme is presented in the next section. A set of viseme
images in the source face is subjectively selected, and
their corresponding 3D facial marker positions consti-
tute the viseme models in the source face. The viseme
models in the target 3D face are designed manually to
enable each viseme model in the target face to resemble
that in the source face. The mapping functions are
learned by the viseme examples in the source and target
faces. For each diviseme, its motion trajectory is com-
puted with the motion capture data and the viseme
models for the target face. When text is input to the
system, a phonetic transcription of the words is gener-
ated by a speech synthesizer, which also produces a
speech waveform corresponding to the text. If the text
is spoken by a human voice, a speech recognition
system is used in forced-alignment mode to provide
the time-aligned phonetic transcription. Time warping
is then applied to the diviseme motion trajectories so
that their time information conforms to the time require-
ments of the generated phonetic information. The Vi-
terbi algorithm is applied to find the best concatenation
path in the space of the diviseme instances. The output
is the optimal visible speech synchronized with the
auditory speech signal.
AccurateVisibleSpeech Synthesis
Visible Speech
Visible speech refers to the movements of the lips,
tongue, and lower face during speech production by
humans. According to the similarity measurement of an
acoustic signal, a phoneme is the smallest identifiable
unit in speech, while a viseme is a particular configura-
tion of the lips, tongue, and lower face for a group of
phonemes with similar visual outcomes. A viseme is an
identifiable unit in visible speech. In English, there
are many phonemes with visual ambiguity. For exam-
ple, phonemes /p/,/b/,/m/ appear visually the same.
These phonemes are grouped into the same viseme class.
Each viseme is usually classified using 70–75%18,19
within-class recognition rate. Phonemes /p/, /b/,/m/
Figure 1. System overview.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 488 Comp. Anim. Virtual Worlds 2004; 15: 485–500
and /th/,/dh/ are universally recognized visemes; the
remaining phonemes are not universally recognized
across languages owing to variations of lip shape in
different individuals. From a statistical point of view, a
viseme is a random vector, because a viseme observed at
different times or under different phonetic contexts may
vary in its appearances.
The basic underlying assumption of our visible
speech synthesis approach is that the complete set of
mouth shapes associated with human speech can be
reasonably approximated by a linear combination of a
set of visemes. Actually, this assumption has been
proven to be acceptable in most authoring tools for 3D
face animation. These systems use shape blending, a
special example of linear combination, to synthesize
visible speech. In this work, we chose 16 visemes from
images of the human subject. Each viseme image was
chosen at the point at which the mouth shape was
judged to be at its extreme shape. Phonemes that look
alike visually fall into the same viseme category. This
was done in a subjective manner, by comparing the
viseme images visually to assess their similarity. The 3D
feature points for each viseme are reconstructed by
our motion capture system. When synthesizing visible
speech from text, we map each phoneme to a viseme to
produce the visible speech. This ensures a unique
viseme target is associated with each phoneme. We
recorded sequences of nonsense words that contain all
possible motion transitions from one viseme to anther.
After the whole corpus was recorded and digitized, the
3D facial feature points were reconstructed. Moreover,
the motion trajectory of each diviseme was used as an
instance of each diviseme. It should be noted that
diphthongs need to be specially treated. Since a
diphthong, such as /ay/ in ‘pie’, consists of two vowels
with a transition between them, i.e., /aa/, /iy/, the
diphthong transition is visually simulated by a diviseme
corresponding to the two vowels.
The mapping from phonemes to visemes is many-to-
one; for instance, in cases where two phonemes are
visually identical, but differ only in sound, /p/,/b/,
/m/. It is also important to note that the mapping from
visemes to phonemes is also one-to-many. One pho-
neme may have different mouth shapes due to the
coarticulation effect. In visible speech synthesis, a key
challenge is to model coarticulation. Coarticulation20
relates to the observation that a speech segment is
influenced by its neighbouring speech segments during
speech production. The coarticulation effect from a
phoneme’s adjacent two phonemes is referred to as
the primary coarticulation effect of the phoneme. The
coarticulation effect from a phoneme’s two second
nearest neighbour phonemes is called the secondary
coarticulation effect. Coarticulation enables people to
pronounce speech in a smooth, rapid, and relatively
effortless manner. Coarticulation modeling is a very
important research topic for natural visible speech ani-
mation and speech production. Some methods for mod-
elling coarticulation in visible speech have been
proposed to increase the realism of animated speech.
When considering the contribution of a phoneme to
visible speech perception,21 there are three phoneme
classes. They are invisible phonemes, protected pho-
nemes, and ‘normal’ phonemes. An invisible phoneme
is a phoneme in which the corresponding mouth shape
is dominated by its following vowel, such as the first
segment in ‘car’, ‘golf’, ‘two’, ‘tea’. The invisible pho-
nemes include the phonemes /t/, /d/, /g/, /h/ and /k/.
In our implementation, lip shapes of all invisible pho-
nemes are directly modelled by motion capture data;
therefore, this type of primary coarticulation from the
adjacent two phonemes is very well modelled. The pro-
tected phonemes are those whose mouth shape must be
preserved in visible speech synthesis to ensure accurate
lip motion. These phonemes include /m/, /b/ and /p/
and /f/, /v/, as in ‘man’, ‘ban’, ‘pan’, ‘fan’ and ‘van’.
Pelachaud et al.22 tackle coarticulation using a look-
ahead model23 that considers articulatory adjustment on
a sequence of consonants followed by or preceded by a
vowel. On the other hand, Cohen and Massaro24 ex-
tended the Lofqvist25 gestural production model for
visible speech animation. In their model, dominance
functions are used to model the effects of coarticulation.
For each segment of speech, there is a dominance
function to control its effect on coarticluation. The
dominance function is defined as a negative exponent
function that contains five parameters to control the
dominance function.
The approach we propose in this paper is to directly
concatenate motions of 3D facial feature points for
English diphones/divisemes. For instance, as shown
in Figure 2 for the word ‘cool’, its phonetic transcription
is /kuwl/. The divisemes in the word are: /k-uw/,
/uw-l/. In order to synthesize the visible speech of the
word, the synthesis system will concatenate the two
motion sequences in motion capture data. In Figure 2,
the top picture depicts three visemes in the word ‘cool’,
while the bottom picture depicts the actual three key
frames of lip shapes mapping from the source face in
one motion capture sequence.
Compared with Cohen and Massaro’s model, our
approach models the visual transition from one
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 489 Comp. Anim. Virtual Worlds 2004; 15: 485–500
phoneme to another directly from motion capture data,
which is encoded for diphones as parameterized trajec-
tories. Therefore, a phoneme’s primary coarticulation
effect can be well modelled. Finally, because the tongue
movement cannot be directly measured in our motion
capture system, a special method is developed to tackle
the coarticulation effect of the tongue.
MotionCapture
Most motion capture systems for face animation are
based on optical capture. These systems are similar to
the optical systems used to capture body motions.
Reflective dots need to be glued onto the human face
at key feature positions, such as eyebrows, the outer
contour of the lips, the cheeks and chin. Our motion
capture system consists of a camcorder, two mirrors
and 31 facial markers in green and blue, as shown in
Figure 3. The video format is NTSC with the frame rate
29.97 frames per second. Each video clip was recorded
and synchronized with the audio signal.
A facial marker tracking system was designed to
automatically track the markers’ motions. A camera
calibration system was also designed. The 2-D trajec-
tories at two views are used to reconstruct the 3D
positions of facial makers as shown in Figure 4(a).
Moreover, a head pose estimation algorithm was de-
signed to estimate head poses at different times.
Figure 2. Two divisemes k-uw and uw-l in the word ‘cool’
represent the transition motions from viseme /k/ to the viseme
/uw/ and from viseme uw to l. The top picture shows three
visemes in the word ‘cool’. The bottom picture shows three key
frames of lip shapes in one motion sequence from one instance
of motion capture data.
Figure 3. Image captured by the mirror and camcorder-based
motion capture system.
Figure 4. (a) Thirty-one 3D dots constructed by the motion capture system; (b) Gurney’s 3D face mesh.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 490 Comp. Anim. Virtual Worlds 2004; 15: 485–500
A visual corpus of a subject speaking a set of nonsense
words was recorded. Each word was chosen so that it
visually instantiated motion transition from one viseme
to another in American English. A total of 16 visemes
were modelled. The mapping table from phonemes to
visemes is shown in Table 1. Theoretically speaking,
increasing the number of modelled visemes should lead
to more accurate synthetic visible speech. The motions
of a diviseme represent the motion transition from the
approximate midpoint of one viseme to the approxi-
mate midpoint of an adjacent viseme, as shown in
Figure 2. A speech recognition system operating in
forced-alignment mode was used to segment the divi-
seme speech segment (i.e., the speech recognizer is used
to determine the time location between phonemes and
to find the resulting diviseme timing information). Each
segmented video clip contained a sequence of images
spanning the duration of the two complete phonemes
corresponding to one diviseme. The diviseme text cor-
pus is listed in the Appendix. It is a subset of the text
corpus used for speech synthesis based on diphone
modelling.
LinearViseme Space
One important problem is how to map the motions of a
set of 3D facial feature points reconstructed from a
motion capture system onto a 3D face model. As shown
in Figure 4, the reconstructed facial feature points are
sparse, while the vertices in the 3D mesh of a face model
are dense. Therefore, many vertices in the 3D face model
have no corresponding points in the set of the recon-
structed 3D facial feature points. However, the move-
ments of vertices in the 3D face model have some type of
special correlations due to the physical constraints of
facial muscles. In order to estimate the movement
correlation among the vertices in the 3D face model, a
set of viseme targets is manually designed for the 3D
face model to provide learning examples. The set of
viseme targets is used as training examples to learn a
mapping from the set of 3D facial feature points in the
source face to the set of vertices in the target 3D face
model. For instance, as shown in Figures 5 and 6, there
1 /i:/week; /I_x/ roses2 /I/ visual; /&/above3 /9r/ read; /&r/ butter; /3r/ bird4 /U/ book; /oU/ boat5 /ei/ stable; /@/ bat; /^/ above; /E/ bet6 /A/ father; />/ caught; /aU/about; />i/boy7 /ai/ tiger8 /T/ think; /D/ thy9 /S/ she; /tS/church; /dZ/judge; /Z/ azure10 /w/wish; /u/ boot11 /s/sat; /z/resign12 /k/can; /g/gap; /h/high; /N/ sing; /j/ yes13 /d/debt14 /v/vice; /f/five15 /l/like; /n/knee16 /m/map; /b/bet;/p/pat17 /sil/ neutral expression
Table1. Mapping fromphonemestovisemes
Figure 5. Sixteen visemes based on captured images. The phonetic symbols are defined in the following words: /iy/ week; /ih/
visual; /uw/ spoon; /r/ read; /uh/ book; /ey/ stable; /aa/ father; /ay/ tiger; /th/ think; /sh/ she.
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 491 Comp. Anim. Virtual Worlds 2004; 15: 485–500
are 16 viseme targets for the source face and target face
respectively. Each mouth shape in the source face as
shown in Figure 5 should be mapped to its correspond-
ing mouth shape in the target face shown in Figure 6.
This problem can be seen as the scattered data inter-
polation.26 In this paper, an approach based on the
inverse problem of shape blending is proposed to con-
struct the approach for the scattered data interpolation
problem.
Linear combination of a set of images or graph pro-
totypes at different poses or views can efficiently ap-
proximate complex objects. This approach has been
widely used in computer vision and computer anima-
tion communities.16,27–31 Among the linear combination
of a set of prototypes, viseme blending is a special
example of the linear combination of a set of shapes.
The viseme blending approach has been proven to be a
very efficient way to synthesize visible speech in most
3D animation authoring tools such as 3Ds MAX and
MAYA. While this approach has been widely used in
animation authoring tools, the shape-blending coeffi-
cients or animation curves require manual definition by
animators, and this process can be quite time-consum-
ing. For an input text and speech, the animated visible
speech can be made to be very accurate by repeatedly
modifying the shape-blending coefficients. The question
is how to automatically obtain linear coefficients of a set
of visemes to approximate the mouth shape in a lip
motion trajectory. The model of a linear combination of
a set of prototypes is much like a linearly deformable
shape model that is capable of approximating shape
variations in a wide range.
Let Gi (i¼ 0, 1, 2, . . . , V� 1) be Si or Ti, where Si and Ti
represent viseme targets for the source face and the
target face respectively. {Gi} can span a linear subspace:�GjG ¼ �V�1
i¼0 wiGi
�, for the source face, S ¼ �V�1
i¼0 wiSi;
for the target face, T ¼ �V�1i¼0 wiTi, where {wi} are linear
combination coefficients or shape-blending coefficients.
We hope to find a mapping function fðSÞ that maps Si to
Ti, i.e., fðSiÞ ¼ Ti, and any observation vector S provided
by the motion capture system should be mapped to T in
the target face that is visually similar with S. Once the
coefficients are estimated with the observation data in
the source face, the observed vector S will be mapped to
T. The simplest form of mapping function is linear with
respect to S. In this condition, if S ¼ �V�1i¼0 wiSi then
fðSÞ ¼ T; this is because the following equation holds:
fðSÞ ¼ f��V�1
i¼0 wiSi
�¼ �V�1
i¼0 wifðSiÞ, if f is linear.
Suppose that there are N frames of observation vec-
tors SðtÞ, t¼ 1, 2, . . . , N in one observed motion capture
sequence. The shape-blending coefficients correspond-
ing to the tth frame are wiðtÞ, i¼ 0, 1, . . . , V� 1. The
robust shape-blending coefficients can be estimated by
minimizing the following fitting error:
minw
XN
t¼1
SðtÞ �XV�1
i¼0
wiðtÞSi
����������
2
þ�XV�1
i¼0
w2i ðtÞ
0@
þ�XV�1
i¼0
wi t þ 1ð Þ � 2wi tð Þ þ wi t � 1ð Þð Þ2
!
with constraints li � wi tð Þ � hi; �V�1i¼0 wi tð Þ ¼ 1; wi 0ð Þ ¼
2wi 1ð Þ � wi 2ð Þ; wi N þ 1ð Þ ¼ 2wi Nð Þ � wi N � 1ð Þ; the con-
straint that the sum of the shape-blending coefficients is
one is to minimize expansion or shrinkage on the
polygon meshes when the mapping function is applied;
where li ¼ �"i; hi ¼ 1 þ �i; "i � 0 and �i � 0; "i and �i are
small positive parameters so that more robust and more
accurate shape-blending coefficients can be estimated
by solving the above optimization problem; where
w ¼ w tð Þf gNt¼1 and w tð Þ ¼ wi tð Þf gV�1
i¼0 ; � is a positive
regularization parameter to control the amplitude of
Figure 6. Sixteen visemes designed for Gurney’s model.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 492 Comp. Anim. Virtual Worlds 2004; 15: 485–500
shape-blending coefficients and � is a positive regular-
ization parameter to control the smoothness of the
trajectory of the shape-blending coefficients.
The above optimization problem involves convex
quadratic programming in which the objective function
is a convex quadratic function and the constraints are
linear. This problem can be solved by the primal–dual
interior-point algorithm.32
In order to reduce the computation load in determin-
ing the mapping function, PCA (principal component
analysis)33 can be applied. PCA is a statistical model that
decomposes high-dimensional data to a set of orthogo-
nal vectors. In this way, a compact representation of
high-dimensional data using lower-dimensional para-
meters can be estimated. Denote B ¼ �T1;�T2; � � � ;ð�TV�1Þ, � ¼ BBt, �Ti ¼ Ti � T0, �T ¼ T � T0, where
T0 denotes the neutral expression target. The eigenvec-
tors of the � are E ¼ �0; �1; . . . ; �3U�1ð Þ; �ik k ¼ 1; U is the
total number of vertices in the 3D face model. The
projection of T using M main components to approx-
imate it is as follows:
�T ffiXM�1
k¼0
�j�j
where the linear combination coefficients are �j ¼ �tj�T.
Usually the M is less than V after discarding the last
principal components. For each viseme targets, �Ti is
decomposed as the following linear combination by
PCA:
�Ti ffiXM�1
j¼0
�ij�j
where �ij ¼ �tj�Ti. Note that the coordinates of �Ti
under the orthogonal basis �j
� �M�1
0are �i0; �i1; . . . ;ð
�iM�1Þt. From the two equations above, we have
�T ¼XV�1
i¼0
wi�Ti ffiXM�1
j¼0
XV�1
i¼0
wi�ij�j ffiXM�1
j¼0
��j�j
��j ¼XV�1
i¼0
wi�ij ð1Þ
After the shape blending coefficients are estimated,
the mapping function is obtained. Thus, the motions of
3D trajectories of facial markers are mapped onto the
motions of vertices in the 3D face model.
TimeWarping
To blend motions at the juncture of two divisemes, the
time-scale of the original motion capture data must be
warped onto the time-scale of the target speech used to
drive the animation. Suppose that the duration of a
phoneme in the target speech stream ranges between
�0; �1½ �, and the time interval for its corresponding divi-
seme in motion capture data ranges between t0; t1½ �.Furthermore, the motion trajectory vector is denoted
as m tð Þ whose elements are ��j defined in equation (1).
The time interval should be transformed into �0; �1½ � so
that the motion trajectory defined in t0; t1½ � is embedded
within �0; �1½ �. Therefore we define the time-warping
function as
t �ð Þ ¼ t0 þ� � �0
�1 � �0t1 � t0ð Þ ð2Þ
In this way, the motion vector m tð Þ will be transformed
into the final time-warped motion vector, n �ð Þ as
follows:
n �ð Þ ¼ m t �ð Þð Þ ð3Þ
MotionVector Blending
The goal in blending the juncture of two adjacent
divisemes in a target utterance is to concatenate the
two divisemes smoothly. For two divisemes denoted by
Vi ¼ pi;0; pi;1
� �, Viþ1 ¼ piþ1;0; piþ1;1
� �respectively, where
pi;0; pi;1 represent the two visemes in Vi. pi;1; piþ1;0 are
different instances of the same viseme; they form the
juncture of divisemes Vi and Viþ1. Suppose that in a
speech segment the duration for the two visemes
pi;1; piþ1;0 should be embedded into the interval �0; �1½ �.The time-warping functions discussed above are used to
transfer the time intervals of the two visemes into �0; �1½ �.In addition, their transformed motion vectors are de-
noted by ni;1 �ð Þ ¼ mi;1 t �ð Þð Þ, niþ1;1 �ð Þ ¼ miþ1;1 t �ð Þð Þ, so
the time domains of the two time-warped motion vec-
tors are now the same. The juncture of the two divi-
semes is derived by blending the two time-aligned
motion vectors as the following:
hi �ð Þ ¼ fi �ð Þni;1 �ð Þ þ 1 � fi �ð Þð Þniþ1;1 �ð Þ ð4Þ
where the blending functions fi �ð Þ are chosen as para-
metric rational Gn continuous blending functions as the
following:34
bn;� tð Þ ¼ � 1 � tð Þnþ1
� 1 � tð Þnþ1þ 1 � �ð Þtnþ1;
t 2 0; 1½ �; � 2 0; 1ð Þ; n >¼ 0
ð5Þ
Other types of blending functions such as polynomial
blending functions exist and may also be available. For
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 493 Comp. Anim. Virtual Worlds 2004; 15: 485–500
example, p tð Þ ¼ 1 � 3t2 þ 2t3 is a C1 blending function,
while p tð Þ ¼ 1 � 6t5 � 15t4 þ 10t3� �
is a C2 blending
function. The blending function acts like a low-pass
filter to smoothly concatenate the two divisemes. The
blending functions are defined as
fi �ð Þ ¼ bn;�� � �0
�1 � �0
� �ð6Þ
Trajectory Synthesis as aGraph Search
We assume that there is a set of diviseme motion
sequences for each diviseme, i.e., there are multiple
instances of lip motions for each diviseme. In the
following, we will discuss how best to concatenate these
sequences for any input text.
LipMotion Graph. As shown in Figure 7, the collec-
tion of diviseme motion sequences could be represented
as a directed graph. Each diviseme motion example is
denoted as a node in the graph. The edge represents the
transition from one diviseme to another. In this way, we
wish to search the optimal path in the graph that
constitutes the optimally concatenated visible speech.
To achieve this goal, we need to define the optimal
objective functions. The objective function is defined to
measure the degree of smoothness of the synthetic
visible speech. Theoretically speaking, the objective
function can be defined to minimize the following
degree of smoothness of the motion trajectory:
minpath
ðtL
t0
Vð2Þ tð Þ�� ��2
dt ð7Þ
where VðtÞ is the concatenated lip motion for an input
text.
However, the above optimal problem (equation 7) is
hard to solve in mathematical aspects. Therefore, a
simplification is needed. Similar to concatenated speech
synthesis, we need to define two cost functions: a target
cost and a concatenation cost. The target cost is a
measure of distance between a given candidate’s fea-
tures and the desired target features. For example, if
observation data about lip motion is provided, the target
features can be lip height, lip width, lip protrusion and
speech features, etc. The target cost corresponds to the
node cost in the graph, while the concatenation cost
corresponds to the edge cost. The concatenation cost
represents the cost of the transition from one diviseme
to another. After the two costs have been defined, a
Viterbi algorithm can be used to compute the optimal
path. Because the basic unit used in this work is the
diviseme, the primary coarticulation effect can be mod-
elled very well. For an input text, its corresponding
phonetic information is known. When no observation
data of lip motion is provided for the target specifica-
tion, the target cost is defined to be zero. The other
reason we let the target cost be zero is that spectral
information extracted from the speech signal may not
provide sufficient information to determine a realistic
synthetic visible speech sequence. For example, the
acoustic features of the speech segments /s/ and /p/
in an utterance of the word ‘spoon’ are quite different to
those for the phoneme /u/, whereas the lip shapes of /
s/ and /p/ in this utterance are very similar to the
phoneme /u/.
For the concatenated cost, we define it as degree of
smoothness of the visual features at the juncture of the
two divisemes. For a divisemes sequence Vi ¼ ðpi;0; pi;1Þi¼ 1, 2, . . . ,N. The concatenated cost of the two units
Vi ¼ ðpi;0; pi;1Þ and Viþ1 ¼ ðpiþ1;0; piþ1;1Þ is defined as
CVi�Viþ1¼
ðkh
ð2Þi ð�Þk
2d�; i ¼ 1; 2; . . . ;N � 1 ð8Þ
where Vi is a diviseme lip motion instance, Vi 2 Ei, and
Ei is the set of diviseme lip motion instances. The hiðtÞ is
defined in equation (4), and the reason that we use the
integral of the squared hð2Þi ðtÞ is to measure the degree of
smoothness of the function at the juncture of two
divisemes. The total cost is
C ¼XN�1
i¼1
CVi�Viþ1ð9Þ
Figure 7. Graph of concatenating visible speech units by
Viterbi search algorithm. One state represents on divisime
motion sequence. The edge from one state to another repre-
sents one possible concatenation path. For the English word
‘county’ /kaunty/, there are four divisemes: kau/; /aun/; /nt/; /
ti:/, each node in same rectangle is an instance of the same
diviseme.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 494 Comp. Anim. Virtual Worlds 2004; 15: 485–500
The visible speech unit concatenation becomes the
following optimal problem:
minðV1;V2;...;VNÞ
C ¼XN�1
i¼1
CVi�Viþ1ð10Þ
subject to the constraints Vi 2 Ei.
Viterbi Search. The above problem can be solved by
searching the shortest path from the first diviseme to the
last diviseme, in which each node is a diviseme motion
instance. Cosatto and Graf4 proposed an approach to
visible speech synthesis based on Viterbi search; how-
ever, the cost definition proposed in this paper is
different from that proposed in Cosatto and Graf4 and
Graf et al.5
The distance between two nodes is the concatenation
cost. Dynamic programming offers an efficient solu-
tion.35 If we let Vi 2 Ei denote a node in stage i and let
dðViÞ be the shortest distance from node Vi 2 Ei to the
destination VN , dðVNÞ ¼ 0, we can obtain
dðViÞ ¼ minViþ12Eiþ1
�CVi�Viþ1
þ dðViþ1Þ�;
i ¼ N � 1;N � 1; . . . ; 1 ð11Þ
V�iþ1 ¼ arc min
Viþ12Eiþ1
�CVi�Viþ1
þ dðViþ1Þ�
ð12Þ
where CVi�Viþ1denotes the concatenated cost from node
Vi to Viþ1. This gives the recursive equations needed to
solve this problem. The best-concatenated diviseme
sequences will be fV�1 ;V
�2 ; . . . ;V
�Ng.
SmoothingAlgorithm
The concatenated trajectory still needs to be smoothed to
achieve accurate visible speech. Ezzat et al.2 proposed an
approach to making the trajectory smooth for visible
speech synthesis. The smoothed trajectory is found by
minimizing the sum of errors between target values and
the values on the trajectory and a smoothness term
computed from the derivative of the trajectory. In the
following, we propose a trajectory-smoothing technique
based on the spline function. Denote the synthetic
trajectory of one component of a parameter vector as
fðtÞ. The trajectory is obtained by the concatenation
approach discussed in the previous section. Suppose
that its samples are denoted by fi ¼ fðtiÞ, t0 < t1 < � � � tL.
The objective is to find a smooth curve gðtÞ that is the
best fit for the data, i.e., the following objective function
is minimized:
XL
i¼0
�iðgi � fiÞ2 þðtL
t0
gð2ÞðtÞ� 2
dt ð13Þ
where the �i is the weighting factor to control each
gi ¼ gðtiÞ for each target fi. The solution of the above
equation36 is
g ¼ I þ P�1CtC� ��1
f ð14Þ
Where I is a unit matrix, f ¼ ðf0; f1; . . . ; fLÞt, and
A ¼ 1
6
4 1 0
1 4 . ..
. .. . .
.1
0 0 4
2666664
3777775;
C ¼
1 �2 1 0
1 �2 1
. .. . .
.
0 1 �2 1
266664
377775
P ¼ diagð�0; �1; . . . ; �NÞ
Note that the control parameter �i depends on the
phonetic information. A large value implies that the
smoothed curve tends to be near the value fðtiÞ at time ti
and vice versa. For labial or labial–dental phonemes,
such as /p/, /b/, /m/, /f/, /v/, the value �i should be
set to large values, otherwise the smoothed target value
gi may be far away from the actual target value fi, which
will not make the lips close to each other.
Audiovisual Synchronization
We used the Festival text-to-speech (TTS)37 system or
the SONIC speech recognizer in forced-alignment
mode6 to derive time-aligned transcription to synthe-
sized or recorded speech, respectively. Note that Festi-
val is also a diphone-based concatenative speech
synthesizer that is very analogous to our approach to
concatenating divisemes. Diphones are represented by
short speech wave files for the transitions between the
middle of one phonetic segment to the middle of an-
other phonetic segment. In order to produce a visible
speech stream synchronized with the speech stream, our
animation engine first extracts the duration of each
diphone computed by TTS or the SONIC speech aligner.
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 495 Comp. Anim. Virtual Worlds 2004; 15: 485–500
The animation engine then creates a diviseme stream.
The diviseme stream consists of the concatenated divi-
semes corresponding to the diphones. The animation
engine loads the appropriate divisemes into the divi-
seme stream by looking for the corresponding diphones.
Usually the duration of a diviseme should be warped to
the duration of its corresponding diphone, because the
speech signal is used to control the synchronization
process. Suppose that the expected animation frame
rate is F per second, and the total duration of the audio
stream is T milliseconds. The total number of frames
will be about [F*T/1000]þ 1, and the duration between
two frames is C¼ 1000/F in milliseconds. There are two
approaches to synchronizing the visible speech and
auditory speech: one is synchronization with a fixed
frame rate; the other is synchronization with maximal
frame rate based on computer performance.
The synchronization algorithm for a fixed frame rate
is as follows and is shown in Figure 9(a):
1. Play the speech signal and render a frame of image
simultaneously, and get the start system time for
playing speech t0; when the rendering process for
the image is finished, get the time stamp t1.
2. If (t1� t0<C), wait for C� (t1� t0) period, then go to
1 and repeat the process; else if(t1� t0�C) go to 1
directly.
The synchronization algorithm with maximal frame rate
for variable frame rate is as follows and is shown in
Figure 9(b):
1. Play the speech signal and render a frame of image
simultaneously, and get the start system time for
playing speech t0; when the rendering process for
the image is finished, get the time stamp t1.
2. Get the animation parameters at time v ¼ ðt1 � t0Þ=C,
go to 1, repeat the process.
One disadvantage of this approach is that the anima-
tion engine is greedy, i.e., it uses most of the CPU cycles.
Nevertheless, it can produce a higher animation rate.
CoarticulationModellingof TongueMovement
The tongue also plays an important role in visible
speech perception and production. Some phonemes
not distinguished by their corresponding lip shapes
might be differentiated by tongue positions, such as /
f/ and /th/. Another role of the 3D tongue model is to
show positions of different articulators for different
phonemes from different orientations suning a semi-
transparent face to help people learn pronunciation.
Even though only a small part of the tongue is visible
during speech production, this visible part of informa-
tion can increase the visible speech intelligibility.
In addition, the tongue is highly mobile and deform-
able. For each phoneme, a tongue target38 was designed.
Tongue posture control consists of 24 parameters
manipulated by sliders in a dialogue box. One 3D ton-
gue model is shown in Figures 10(a) and 10(b). Because
motion capture data were not available for tongue
movement, an approach combining data-smoothing
techniques with heuristic coarticulation rules is pro-
posed to simulate tongue movement. The coarticulation
effects of tongue movement are quite different from
those of lip movements. The tongue targets correspond-
ing to some phonemes should be completely reached.
For example, the tongue targets of the following pho-
nemes should be completely reached: (1) tongue up and
down /t/, /d/, /n/, /l/; (2) tongue between teeth: /T/
Figure 8. A picture to show the synchronization between audio signal and video signal.
Figure 9. (a) Synchronization for fixed animation frame rate.
The segment of dash line represents the time interval for
waiting to achieve the expected time duration C between two
frames. (b) Synchronization with maximal frame rate rate.
There is no waiting time for each processing cycle.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 496 Comp. Anim. Virtual Worlds 2004; 15: 485–500
in Thank; /D/in Bathe; (3) lips forward: /S/ ship; /Z/
measure; /tS/ in chain, /dZ/ in Jane; (4) tongue back:
tongue back: /k/, /g/, /N/, /h/. The tongue targets of
other phonemes may not need to be completely reached.
Therefore, all phonemes are categorized into two classes
according to the criterion is the tongue target corre-
sponding to the phoneme completely reached or not.
For the two categories, different smoothing parameters
are applied to simulate the tongue movement.
We use the kernel smoothing approach proposed in
our previous study38 to model tongue movement. The
approach is described as follows. Suppose there is an
observation sequence yi ¼ �ðxiÞ to be smoothed. fxigni¼0
satisfies the following condition:
0 ¼ x0 < x1 < x2 < � � � xn�1 < xn ¼ 1
The weighted average of the observation sequence is
used as the estimator of �ðxÞ:
�ðxÞ ¼Xn
i¼0
yiwiðxÞ ð15Þ
wherePn
i¼1 wiðxÞ ¼ 1; wiðxÞ ¼ K�ðx � xiÞ=M0; M0 ¼Pni¼1 K�ðx � xiÞ. �ðxÞ is called as the Nadaraya–Wastson
estimator.38
For tongue movement modelling, the relationship
between time t and x is expressed as follows:
x ¼ ðt � t0Þ=ðtn � t0Þ; xi ¼ ðti � t0Þ=ðtn � t0Þ
where the interval ½t0; tn� represents a local window at
frame or time t and n is the size of the window. When
the sampling points fxigni¼0 are from one speech seg-
ment, i.e. all values of fyigni¼0 are equal, the morph target
can be completely reached. When the sampling points
are not from the same speech segment, the smoothed
target value is the weighted average of the sampling
points from different speech segments. Therefore, the
target value at the boundary of two speech segments is
smoothed according to the distributions of sampling
points in the two speech segments. The principle to
choose the kernel functions, the number of sampling
points in different phoneme segments, and the window
size n can be found in Ma and Cole.38 A tongue move-
ment sequence generated by the approach discussed
above is shown in Figure 11.
Conclusion
We have demonstrated a new technique to produce
highly realistic visible speech using motion capture,
motion mapping and motion concatenation. This was
done by recording primitive motions of a set of divi-
semes and concatenating them in an optimal way
through graph search. The system has been integrated
into reading tutors39 and the CSLU toolkit, which can be
downloaded via the following website: http://cslr.
colorado.edu/toolkit/main.html.
The new contributions of our work are:
1. Motion mapping. In motion mapping, we proposed an
approach to estimating a mapping function from the
visemes in the source to the visemes in the target.
Thus, the motion vectors in the source can be mapped
onto the motion vectors in the target. PCA is applied to
obtain a compact representation of motion parameters.
2. Motion concatenation. We propose a new approach to
motion vector blending and a new approach to
defining the concatenation cost and to searching the
optimal concatenated visible speech.
3. Smoothing algorithm. The spline smoothing algo-
rithm is proposed to smooth the motion trajectory
of the synthetic visible speech according to phonetic
knowledge.
4. Modelling coarticulation of tongue movement. An ap-
proach that combines kernel-smoothing techniques
with heuristic coarticulation rules is proposed to
model coarticulation during tongue movement.
Figure 10. (a) Side view of a 3D tongue model. (b) Top view
of the 3D tongue model.
Figure 11. An example of tongue movements in side view.
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 497 Comp. Anim. Virtual Worlds 2004; 15: 485–500
We believe this work is significant for its technical
contribution to visible speech synthesis. Observation of
the resulting visible speech synthesis reveals that it is
more natural looking than morphing between targets
with simple smoothing techniques. While the initial
results of the approach are encouraging, there is much
room for improvement. For example, the visible speech
corpus should be large so that more smoothing conca-
tenation visible speech can be obtained. Furthermore,
prosodic features of speech, including accent, stress,
tone and intonation, should be used to control visible
speech synthesis and facial expressions.
ACKNOWLEDGEMENTS
This work was supported in part by NSF CARE grant EIA-
9996075; NSF/ITR grant IIS-0086107; NSF/ITR Grant REC-
0115419; NSF/IERI (Interagency Education Research Initiative)
Grant EIA-0121201 and NSF/IERI Grant 1R01HD-44276.01. The
findings and opinions expressed in this article do not necessa-
rily represent those of the granting agencies.
References
1. Bregler C, Covell M, Slaney M. Video rewrite: drivingvisual speech with audio. In Proceedings of ACM SIGGRAPHComputer Graphics. ACM Press: New York, 1997; 353–360.
2. Ezzat T, Geiger G, Poggio T. Trainable video realisticspeech animation. In Proceedings of ACM SIGGRAPH Com-puter Graphics, San Antonio, TX, 2002; pp 388–398.
3. Ezzat T, Poggio T. MikeTalk: a talking facial display basedon morphing visemes. In Proceedings of Computer Anima-tion, Philadelphia, PA, 1998; pp 96–102.
4. Cosatto E, Graf HP. Sample-based synthesis of photo-realistic talking heads. In Proceedings of Computer Anima-tion, Philadelphia, PA, 1998; pp 103–110.
5. Graf HP, Cosatto E, Ezzat T. Face analysis for the synthesisof photo-realistic talking heads. In Proceedings of Face andGesture Recognition, 2000; pp 189–194.
6. Pellom B, Hacioglu K. Recent improvements in the SONICASR system for noisy speech: the SPINE task. In Proceed-ings of the IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), Vol. 1, Hong Kong, 6–10April 2003; pp 4–7.
7. Cole R, Massaro DW, Villiers J de, Rundle B, Shobaki K,Wouters J, Cohen M, Beskow J, Stone P, Connors P,Tarachow A, Solcher D. New tools for interactive speechand language training: using animated conversationalagents in the classrooms of profoundly deaf children. InProceedings of ESCA/SOCRRATES Workshop on Method andTool Innovations for Speech Science Education, 1999.
8. Parke F. Computer generated animation of faces. InProceedings of ACM National Conference, Boston, MA, 1972;pp 451–457.
9. Terzopoulos D, Waters K. Physically-based facial model-ing, analysis, and animation. Journal of Visualization andComputer Animation 1990; 1(4): 73–80.
10. Williams L. Performance-driven facial animation. In Pro-ceedings of ACM SIGGRAPH Computer Graphics 1990;24(4): 235–242.
11. Kleiser J. A fast, efficient, accurate way to representthe human face. State of the Art in Facial Animation. ACM,SIGGRAPH Computer Graphics Tutorials 1989; 22: 37–40.
12. Magnenat Thalmann N, Primeau E, Thalmann D. Abstractmuscle action procedures for human face animation. VisualComputer 1988; 3(5): 290–297.
13. Noh JY, Neumann U. Expression cloning. In Proceedings ofof ACM SIGGRAPH Computer Graphics, Los Angeles CA,2001; pp 277–288.
14. Chuang E, Deshpande H, Bregler C. Facial expressionspace learning. In Proceedings of Pacific Graphics, Beijing,2002; pp 68–76.
15. Guenter B, Grimm C, Wood D, Malvar H, Pighin F. Makingfaces. In Proceedings of SIGGRAPH Computer Graphics 1998;pp 55–66.
16. Kshirsagar S, Molet T, Magnenat-Thalmann N. Principalcomponents of expressive speech animation. In Proceedingsof Computer Graphics International 2001; pp 38–44.
17. Bregler C, Loeb L, Chuang E, Deshpande H. Turning to themasters: motion capturing cartoons. In Proceedings of SIG-GRAPH Computer Graphics 2002; pp 399–407.
18. Jackson PL. The theoretical minimal unit for visual speechperception: visemes and coarticulation. Volta Review 1988;90(5): 99–115.
19. Woodward MF, Barber CG. Phoneme perception in lip-reading. Journal of Speech and Hearing Research 1960; 3:212–220.
20. Hardcastle WJ, Hewlett N. Coarticulation: Theory, Data andTechniques. Cambridge University Press: Cambridge, UK,1999.
21. Woods JC. Lip-Reading: A Guide for Beginners (2nd edn).Royal National Institute for Deaf people: London, 1994.
22. Pelachaud C, Badler N, Steedman M. Linguistic issues infacial animation. In Proceedings of Computer Animation,Magnenat-Thalmann N, Thalmann D (eds). Springer:Berlin, 1991; 15–30.
23. Kent RD, Minifie FD. Coarticulation in recent speechproduction models. Journal of Phonetics 1977; 5: 115–135.
24. Cohen MM, Massaro DW. Modeling coarticulation in syn-thetic visual speech. In Proceedings of Computer Animation,Thalman NM, Thalman D (eds). Springer: Tokyo, 1993;139–156.
25. Lofqvist A. Speech as audible gestures. In Speech Productionand Speech Modeling, Hardcastle WJ, Marchal A (eds).Kluwer: Dordrecht, 1990; 289–322.
26. Nielson G. Scattered data modeling. IEEE ComputerGraphics and Applications 1993; 13(1): 60–70.
27. Vetter T, Poggio T. Linear object classes and image synth-esis from a single example image. IEEE Transactions on Pat-tern Analysis and Machine Intelligence 1997; 19(7): 733–742.
28. Blanz V, Vetter T. A morphable model for synthesis of 3Dfaces. In Proceedings of SIGGRAPH Computer Graphics,Los Angeles, 1999; pp 187–194.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 498 Comp. Anim. Virtual Worlds 2004; 15: 485–500
29. Kalberer GA, Gool LV. Face animation based on observed3D speech dynamics. In Proceedings of Computer Animation,2001; pp 20–27.
30. Pighin F, Szeliski R, Salesin DH. Modeling and animatingrealistic faces from images. International Journal of ComputerVision 2002; 50(2): 143–169.
31. Blanz V, Basso C, Poggio T, Vetter T. Reanimating faces inimages and video. Computer Graphics Forum 2003; 22(3):641–650.
32. Gertz EM, Wright SJ. Object-oriented software for quadra-tic programming. ACM Transactions on Mathematical Soft-ware 2003; 29: 58–81.
33. Bai ZJ, Demmel J, Dongarra J, Ruhe A, Vorst HVD.Templates for the Solution of Algebraic Eigenvalue Problems:A Practical Guide. Society for Industrial and AppliedMathematics: Philadelphia, PA, 2000.
34. Hartmann E. Parametric Gn blending of curves and sur-faces Blending functions. Journal of Visual Computer 2001;17(1): 1–13.
35. Bellman R. Dynamic Programming. Princeton UniversityPress: Princeton, NJ, 1957.
36. Feng G. Data smoothing by cubic spline filters. IEEE Trans-actions on Signal Processing 1998; 46(10): 2790–2796.
37. University of Edinburgh. The festival speech synthesis sys-tem: http://www.cstr. ed.ac.uk/projects/festival/ [2003].
38. Ma JY, Cole R. Animating visible speech and facial expres-sions. Visual Computer Journal 2004 (to appear).
39. Cole R, Van Vuuren S, Pellom B, Hacioglu K, Ma JY, Move-llan J, Schwartz S, Wade-Stein D, Ward W, Yan J. Percep-tive animated interfaces: first steps toward a newparadigm for human–computer interaction. Proceedings ofthe IEEE 2003; 91(9): 1391–1405.
Appendix:DivisemeTextCorpus
0. i:-w(dee-wet) 1. i:-9r(dee-rada) 2. i:-U(bee-ood) 3. i:-
ei(bee-ady) 4. i:-A(bee-ody) 5. i:-aI(bee-idy) 6. i:-T(deeth)
7. i:-S(deesh) 8. i:-k(deeck) 9. i:-l(deela) 10. i:-s(reset) 11.
i:-d(deed) 12. i:-I(bee-id) 13. i:-v(deev) 14. i:-m(deem) 15.
w-i:(weed) 16. w-9r(duw-rud) 17. u-U(boo-ood) 18.
w-ei(wady) 19. u-A(boo-ody) 20. w-aI(widy) 21. u-
T(dooth) 22. u-S(doosh) 23. u-k(doock) 24. u-l(doola)
25. u-s(doos) 26. u-d(doo-de) 27. u-I(boo-id) 28. u-
v(doov) 29. u-m(doom) 30. 9r-i:(far-eed) 31. 9r-u-
(far-oodles) 32. 9r-U(far-ood) 33. 9r-ei(far-ady) 34. 9r-
A(far-ody) 35. 9r-aI(far-idy) 36. 9r-T(dur-thud) 37. 9r-
S(dur-shud) 38. 9r-k(dur-kud) 39. 9r-l(dur-lud) 40.
9r-s(dur-sud) 41. 9r-d(dur-dud) 42. 9r-I(far-id) 43. 9r-
v(dur-vud) 44. 9r-m(dur-mud) 45. U-i(boo-eat) 46. U-
w(boo-wet) 47. U-9r(boor) 48. U-ei(boo-able) 49.
U-a(boo-art) 50. U-aI(boo-eye) 51. U-T(booth) 52.
U-S(bushes) 53. U-k(book) 54. U-l(pulley) 55. U-s(pussy)
56. U-d(wooded) 57. U-I(boo-it) 58. U-v(booves) 59.
U-m(woman) 60. ei-i:(bay-eed) 61. ei-w(day-wet) 62.
ei-9r(dayrada) 63. ei-U(bay-ood) 64. ei-A(bay-ody) 65.
ei-aI(bay-idy) 66. ei-T(dayth) 67. ei-S(daysh) 68. ei-
k(dayck) 69. ei-l(dayla) 70. ei-s(days) 71. ei-d(dayd) 72.
ei-I(bay-id) 73. ei-v(dayv) 74. ei-m(daym) 75. A-i:
(bay-idy) 76. A-w(da-wet) 77. A-9r(da-rada) 78. A-
U(ba-ood) 79. A-ei(ba-ady) 80. A-aI(ba-idy) 81. A-T(ba-
the) 82. A-S(dosh) 83. A-k(dock) 84. A-l(dola) 85.
A-s(velocity) 86. A-d(dod) 87. A-I(ba-id) 88. A-v(dov)
89. A-m(dom) 90. aI-i:(buy-eed) 91. aI-w(die-wet) 92.
aI-9r(die-rada) 93. aI-U(buy-ood) 94. aI-ei(buy-ady) 95.
aI-A(buy-ody) 96. aI-T(die-thagain) 97. aI-S(die-
shagain) 98. aI-k(die-kagain) 99. aI-l(die-la) 100. aI-
s(die-sagain) 101. aI-d(die-dagain) 102. aI-I(buy-id)
103. aI-v(die-vagain) 104. aI-m(die-magain) 105.
T-i:(theed) 106. T-w(duth-wud) 107. T-9r(duth-rud)
108. T-U(thook) 109. T-ei(thady) 110. T-A(thody) 111.
T-aI(thidy) 112. T-S(duth-shud) 113. T-k(duth-kud) 114.
T-l(duth-lud) 115. T-s(duth-sud) 116. T-d(duth-dud)
117. T-I(thid) 118. T-v(duth-vud) 119. T-m(duth-mud)
120. S-i:(sheed) 121. S-w(dush-wud) 122. S-9r(dush-
rud) 123. S-U(shook) 124. S-ei(shady ) 125. S-A(shody)
126. S-aI(shidy) 127. S-T(dush-thud) 128. S-k(dush-kud)
129. S-l(dush-lud) 130. S-s(dush-sud) 131. S-d(dush-
dud) 132. S-I(shid) 133. S-v(dush-vud) 134. S-m(dush-
mud) 135. k-i:(keed) 136. k-w(duk-wud) 137. k-9r
(duk-rud) 138. k-U(kook) 139. k-ei(back-ady) 140. k-
A(kody) 141. k-aI(kidy) 142. k-T(duk-thud) 143. k-
S(duk-shud) 144. k-l(duk-lud) 145. k-s(duk-sud) 146.
k-d(duk-dud) 147. k-I(kid) 148. k-v(duk-vud) 149. k-
m(duk-mud) 150. l-i:(leed) 151. l-w(dul-wud) 152.
l-9r(dul-rud) 153. l-U(fall-ood) 154. l-ei(fall-ady) 155.
l-A(fall-ody) 156. l-aI(fall-idy) 157. l-T(dul-thud) 158. l-
S(dul-shud) 159. l-k(dul-kud) 160. l-s(dul-sud) 161.
l-d(dul-dud) 162. l-I(fall-id) 163. l-v(dul-vud) 164. l-
m(dul-mud) 165. s-i:(seed) 166. s-w(dus-wud) 167. s-9r
(dus-rud) 168. s-U(sook) 169. s-ei(sady) 170. s-A(sody)
171. s-aI(sidy) 172. s-T(dus-thud) 173. s-S(dus-shud) 174.
s-k(dus-kud) 175. s-l(dus-lud) 176. s-d(dus-dud) 177.
s-I(sid) 178. s-v(dus-vud) 179. s-m(dus-mud) 180. d-i:
(deed) 181. d-w(dud-wud) 182. d-9r(dud-rud) 183.
d-U(dook) 184. d-ei(dady) 185. d-A(dody) 186. d-aI
(didy) 187. d-T(dud-thud) 188. d-S(dud-shud) 189. d-
k(dud-kud) 190. d-l(dud-lud) 191. d-s(dud-sud) 192. d-I
(did) 193. d-v(dud-vud) 194. d-m(dud-mud) 195. I-i:
(ci-eed) 196. I-w(ci-wet) 197. I-9r(ci-rada) 198. I-U(ci-
ood) 199. I-ei(ci-ady) 200. I-A(ci-ody) 201. I-aI(ci-idy)
202. I-T(dith) 203. I-S(dish) 204. I-k(dick) 205. I-l(dill)
206. I-s(dis) 207. I-d(did) 208. I-v(div) 209. I-m(dim)
210. v-i:(veed ) 211. v-w(duv-wud) 212. v-9r(duv-rud)
213. v-U(vook) 214. v-ei(vady) 215. v-A(vody) 216. v-
aI(vidy) 217. v-T(duv-thud) 218. v-S(duv-shud) 219.
v-k(duv-kud) 220. v-l(duv-lud) 221. v-s(duv-sud) 222.
ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 499 Comp. Anim. Virtual Worlds 2004; 15: 485–500
v-d(duv-dud) 223. v-I(vid) 224. v-m(duv-mud) 225. m-
i:(meed) 226. m-w(dum-wud) 227. m-9r(dum-rud) 228.
m-U(mook) 229. m-ei(mady) 230. m-A(monic) 231. m-aI
(midy) 232. m-T(dum-thud) 233. m-S(dum-shud) 234.
m-k(dum-kud) 235. m-l(dum-lud) 236. m-s(dum-sud)
237. m-d(dum-dud) 238. m-I(mid) 239. m-v(dum-vud).\
The phonetic symbols in the list are defined in the
following words: /i:/ week; /I/ visual; /9r/ read; /U/
book; /ei/ stable; /A/ father; /ai/ tiger; /T/ think; /S/
see; /w/ wish. Videos and utterances using our techni-
que can be viewed at the following website: http://
cslr.colorado.edu/� jiyong/corpus.html.
Authors’biographies:
Jiyong Ma received a PhD degree in Computer Sciencefrom Harbin Institute of Technology, China, 1999 andhis BS degree in Computational Mathematics fromHelongjiang University in 1984. Prior to joining theCSLR, he was a post-doctoral researcher at the Instituteof Computing Technology in the Chinese Academy ofSciences from March 1999 to February 2001. His re-search interests include computer animation, computervision, speech and speaker recognition, pattern recogni-tion algorithms and applied and computational mathe-matics. He has published more than 60 scientific papers.
Ronald Cole has studied speech recognition by human
and machine for the past 35 years, and has published
over 150 articles in scientific journals and published
conference proceedings. In 1990, Ron founded the
Center for Spoken Language Understanding (CSLU) at
the Oregon Graduate Institute. In 1998, Ron founded
the Center for Spoken Language Research (CSLR) at the
University of Colorado, Boulder.
Bryan Pellom received the BSc degree in Computer andElectrical Engineering from Purdue University, WestLafayette, IN, in 1994 and the MSc and PhD degrees inelectrical engineering from Duke University in 1996 and1998, respectively. From 1999 to 2002, he was a ResearchAssociate with the Center for Spoken Language Research(CSLR), University of Colorado, Boulder. His researchactivities were focused on automatic speech recognition,concatenative speech synthesis and spoken dialoguesystems. Since 2002, he has been a Research AssistantProfessor in the Department of Computer Science andwith the CSLR. His current research is focused in the areaof large vocabulary speech recognition.
Wayne Ward is a full-time research faculty member inthe Center for Spoken Language Research. He works inthe area of Spoken Language Processing and DialogueModeling for conversational computer systems, andinformation retrieval in question/answering systems.
Barbara Wise has a BA in Psychology with honors fromStanford, and an MA and PhD in Developmental Psy-chology from the University of Colorado in Boulder.Dr Wise developed the Linguistic Remedies program andclasses, where she shares her own and others’ researchknowledge and teaching ideas with professionals andparents concerned with children with reading difficul-ties. She has conducted teacher training workshops inmany towns in Colorado, as well as in Texas, California,Connecticut, Pennsylvania and Washington.
J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Copyright # 2004 John Wiley & Sons, Ltd. 500 Comp. Anim. Virtual Worlds 2004; 15: 485–500