Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of...

COMPUTER ANIMATION AND VIRTUAL WORLDS

Comp. Anim. Virtual Worlds 2004; 15: 485–500

Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cav.11* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Accurate automatic visible speech synthesisof arbitrary 3D models based onconcatenation of diviseme motioncapture data

By Jiyong Ma*, Ronald Cole, Bryan Pellom, Wayne Wardand Barbara Wise* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

We present a technique for accurate automatic visible speech synthesis from textual input.

When provided with a speech waveform and the text of a spoken sentence, the system

produces accurate visible speech synchronized with the audio signal. To develop the system,

we collected motion capture data from a speaker’s face during production of a set of words

containing all diviseme sequences in English. The motion capture points from the speaker’s

face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a

new utterance, the system locates the required sequence of divisemes, shrinks or expands

each diviseme based on the desired phoneme segment durations in the target utterance, then

moves the polygons in the regions of the lips and lower face to correspond to the spatial

coordinates of the motion capture data. The motion mapping is realized by a key-shape

mapping function learned by a set of viseme examples in the source and target faces. A

well-posed numerical algorithm estimates the shape blending coefficients. Time warping and

motion vector blending at the juncture of two divisemes and the algorithm to search the

optimal concatenated visible speech are also developed to provide the final concatenative

motion sequence. Copyright # 2004 John Wiley & Sons, Ltd.

Received: 9 April 2003; Revised: 2 December 2003

KEY WORDS: visible speech; visual speech synthesis; animated speech; coarticulationmodelling; speech animation; face animation

Introduction

Animating accurate visible speech is one of the most

important research areas in face animation because of its

many practical applications ranging from language

training for the hearing impaired, to films and game

productions, animated agents for human–computer

interaction, virtual avatars, model-based image coding

in MPEG4 and electronic commerce.

Motion capture technologies have been successfully

used in character body animation. By recording motions

directly from real actors and mapping them to character

models, realistic motions can be generated efficiently.

Whereas techniques based on motion capture for 3D

face animation have been proposed in the last decade,

accurate visible speech synthesis for any text is still a

challenge owing to complex facial muscles in lip regions

and significant coarticulation effects in visible speech.

Most techniques proposed in previous work require a

generic 3D face model to be adapted to the subject’s face

with captured facial motions. These techniques cannot be

applied to the case when the subject’s face differs signi-

ficantly from the 3D face model. The emphasis of our

work is on the issues relating to automatic visible speech

synthesis for 3D face models with different meshes.

The work described in this paper explores a new

approach to accurate visible speech synthesis by record-

ing facial movements from real actors and mapping

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Copyright # 2004 John Wiley & Sons, Ltd.

*Correspondence to: Jiyong Ma, Center for Spoken LanguageResearch, University of Colorado at Boulder, Campus Box 594,Boulder, Colorado 80309-0594, USA.E-mail: [email protected]

Contract/grant sponsor: NSF CARE; contract/grant number:EIA-9996075.Contract/grant sponsor: NSF/IIR; contract/grant numbers:IIS-0086107; REC-0115419.Contract/grant sponsor: NSF/IERI; contract/grant numbers:EIA-0121201; 1R01HD-44276.01.

them to 3D face models. In the proposed approach,

several tasks are implemented. These include: motion

capture, motion mapping, and motion concatenation.

In motion capture, a set of colour 3D markers is glued

onto a human face. The subject then produces a set of

words that cover important lip transition motions from

one viseme to another. Sixteen visemes are used in our

experiment. The motion capture system we designed

consists of two mirrors and a camcorder. The camcorder

records the video and audio synchronously. The audio

signal is used to segment video clips; thus the motion

image sequence for each diviseme is segmented. Com-

puter vision techniques such as camera calibration, 2D

facial marker tracking, and head pose estimation algo-

rithms are also implemented. The head pose is applied

to eliminate the influence of head motions on the facial

markers’ movement so that the reconstructed 3D facial

marker positions are invariant to the head pose.

Motion mapping is required because the source face

is different from the target face. A mapping function is

learned from a set of training examples of visemes

selected from the source face and designed for the target

face. Visemes for the source face are subjectively sele-

cted from the recorded images, while the visemes for the

target 3D face are manually designed according to their

appearances in the source face. They should visually

resemble those for the source face. For instance, a

viseme that models the /aa/ sound for the source face

should be very similar visually to the same viseme for

the target 3D face.

After the motions are mapped from the source face to

the target face, a motion concatenation technique is

applied to synthesize natural visible speech. Whereas

the concatenation approach is already used in several

image-based approaches1–5 to synthesize visible speech,

the concatenated objects are 2D images or image fea-

tures such as optical flows, etc. The concatenated objects

discussed in this paper are 3D trajectories of lip motions.

The overall approach proposed in this paper can be

applied to any 3D face model, either photo-realistic or

cartoon-like. In addition, to get relevant phonetic and

timing information of input text, we have integrated the

Festival speech synthesis system into our animation

engine; the system converts text to speech. At the

same time, our system uses the SONIC speech recogni-

tion engine developed by Pellom and Hacioglu6 to

force-align and segment pre-recorded speech (i.e., to

provide timing between the input speech and associated

text and/or phoneme sequence). The speech synthesizer

and forced-alignment system allow us to perform ex-

periments with various input text and speech wave files.

In the remainder of this article, the next section

presents a review of previous work. The third section

describes the system architecture of our work. The

fourth section demonstrates the approach to accurate

visible speech. The fifth section discusses modelling

coarticulation in 3D tongue movement; the final section

provides a summary.

RelatedWork

Spoken language is bimodal in nature: auditory and

visual. Between them, visual speech can complement

auditory speech understanding in noisy conditions. For

instance, most hearing-impaired people and foreign

language learners heavily rely on visual cues to enhance

speech understanding. In addition, facial expressions and

lip motions are also essential to sign language under-

standing. Without facial information, sign language un-

derstanding level becomes very low. Therefore, creating a

3D character that can automatically produce accurate

visual speech synchronized with auditory speech will

be at least beneficial to language understanding7 when

direct face-to-face communication is impossible.

Research in the past three decades has shown that

visual cues in spoken language can augment auditory

speech understanding, especially in a noisy environment.

However, automatically producing accurate visible

speech and realistic facial expressions for 3D computer

characters seems to be a non-trivial task. The reasons

include: 3D lip motions are not easy to control and the

coarticulation in visible speech is difficult to model.

Since F. Parke8 created the first 3D face animation

using the key-frame technique, researchers have de-

voted considerable effort to creating convincing 3D

face animation. The approaches include parametric-

based,8 physics-based,9 image-based,1–5 performance-

driven approach,10 and multi-target morphing.11

Although these approaches have enriched 3D face ani-

mation theory and practice, creating convincing visible

speech is still a time-consuming task. To create only a

short scenario of 3D facial animation in movies, it will

take a skilled animator several hours of repeatedly

modifying animation parameters to obtain the desired

animation effect. Although some 3D design authoring

tools such as 3Ds MAX or MAYA are available for

animators, they cannot automatically generate accu-

rate visible speech, and these tools require repeated

adjustment and testing to achieve more optimal anima-

tion parameters for visible speech, which is a

tedious task. Therefore, research that leads to automatic

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Copyright # 2004 John Wiley & Sons, Ltd. 486 Comp. Anim. Virtual Worlds 2004; 15: 485–500

production of natural visible speech and facial expres-

sions will have great theoretical and practical value.

The approaches most closely related to our approach

to visible speech animation and facial animation are the

image-based approach and the performance-driven ap-

proach. A third approach, the physics-based approach,

has the following disadvantages. In the physics-based

approach, a muscle is usually connected to a group of

vertices. This requires animators to manually define

which vertex is associated with which muscle and to

manually put muscles under the skin surface. Muscle

parameters are manually modified by trial and error.

These tasks are tedious and time-consuming. It seems

that no unique parameterization approach has proven

to be sufficient to create face expressions and viseme

targets with simple and intuitive controls. In addition, it

is difficult to map muscle parameters estimated from the

motion capture data to a 3D face model. To simplify the

physics-based approach, Magnenat Thalmann et al.12

proposed the concept of an abstract muscle procedure.

One challenging problem in physics-based approaches is

how to automatically obtain muscle parameters. Inverse

dynamics approaches that use advanced measurement

equipment may provide a scientific solution to the pro-

blem of obtaining facial muscle parameters.

The image-based approach aims at learning face

models from a set of 2D images instead of directly

modelling 3D face models. One typical image-based

animation system called Video Rewrite was proposed

by Bregler et al.1 In this approach, a set of triphone

segments is used to model the coarticulation in visible

speech. For speech animation, the phonetic information

in the audio signal provides cues to locate its corre-

sponding video clip. In the approach, the visible speech

is constructed by concatenating the appropriate visual

triphone sequences from a database. Cosatto and Graf4

and Graf et al.5 also proposed an approach analogous to

speech synthesis, in which the visible speech synthesis

is performed by searching a best path in the triphone

database using the Viterbi algorithm. However, experi-

mental results show that when the lip space is not

populated densely the animations produced may be

jerky. Recently, Ezzat and co-workers2,3 adopted ma-

chine learning and computer vision techniques to

synthesize visible speech from recorded video. In that

system, a visual speech model is learned from the video

data that is capable of synthesizing the human subject’s

lip motion not recorded in the original speech. The

system can produce intelligible visible speech. The

approach has two limitations: (1) the face model is not

3D; (2) the face appearance cannot be changed.

In a performance-driven approach, a motion capture

system is employed to record motions of a subject’s face.

The captured data from the subject are retargeted to a

3D face model. The captured data may be 2D or 3D

positions of feature points on the subject’s face. Most

previous research on performance-driven facial anima-

tion requires the face shape of the subject to closely

resemble the target 3D face model. When the target 3D

face model is sufficiently different from that of the

captured face, face adaptation is required to retarget

the motions. In order to map motions, global and local

face parameter adaptation can be applied. Before mo-

tion mapping, the correspondences between key ver-

tices in the 3D face model and the subject’s face are

manually labelled.13 Moreover, local adaptation is re-

quired for the eye, nose, and mouth zones. However,

this approach is not sufficient to describe complex facial

expressions and lip motions. Chuang et al.14 proposed

an approach to creating facial animation using motion

capture data and shape-blending interpolation. Here,

computer vision is utilized to track the facial features in

2D, while shape-blending interpolation is proposed to

retarget the source motion. Noh and Neumann13 pro-

posed an approach to transferring vertex motion from a

source face model to a target model. It is claimed that

with the aid of an automatic heuristic correspondence

search the approach requires a user to select fewer than

10 points in the model. In addition, Guenter et al.15

created a system for capturing both the 3D geometry

and colour shading information for human facial ex-

pression. Kshirsagar et al.16 utilized motion capture

techniques to obtain facial description parameters and

facial animation parameters defined in the MPEG4 face

animation standard. Recently, Bregler et al.17 developed

a technique to track the motion from animated cartoons

and retarget it on 3D models.

SystemArchitecture ofourWork

The research mentioned above was conducted either for

facial expression animation or for a specific 3D/2D

model. No detailed report discusses how to concatenate

3D facial points for visible speech synthesis based on

motion concatenation. In this paper, we propose a novel

approach for visible speech synthesis based on motion

capture techniques and concatenating lip motions. In

the system, motion capture is used to get the trajectories

of the 3D facial feature points on a subject’s face while

the subject is speaking. Then, the trajectories of the 3D

ACCURATE AUTOMATIC VISIBLE SPEECH SYNTHESIS* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


facial feature points are mapped to make the target 3D

face imitate the lip motion.

Our system differs from the prior work using image-

based methods. Instead, we capture motions of 3D facial

feature points, map them onto a 3D face model, and

concatenate motions to get natural visible speech. Com-

pared with image-based methods, an advantage of the

proposed approach is that the motion mapping can be

applicable to any 2D/3D character model. The second

novelty of the proposed approach is that we use a

concatenation approach to synthesize accurate visible

speech. Figure 1 illustrates the proposed system archi-

tecture for accurate visible speech synthesis.

The corpus consists of a set of primitive motion

trajectories of 3D facial markers reconstructed by a

motion capture system. The concept of a viseme and a

diviseme is presented in the next section. A set of viseme

images in the source face is subjectively selected, and

their corresponding 3D facial marker positions consti-

tute the viseme models in the source face. The viseme

models in the target 3D face are designed manually to

enable each viseme model in the target face to resemble

that in the source face. The mapping functions are

learned by the viseme examples in the source and target

faces. For each diviseme, its motion trajectory is com-

puted with the motion capture data and the viseme

models for the target face. When text is input to the

system, a phonetic transcription of the words is gener-

ated by a speech synthesizer, which also produces a

speech waveform corresponding to the text. If the text

is spoken by a human voice, a speech recognition

system is used in forced-alignment mode to provide

the time-aligned phonetic transcription. Time warping

is then applied to the diviseme motion trajectories so

that their time information conforms to the time require-

ments of the generated phonetic information. The Vi-

terbi algorithm is applied to find the best concatenation

path in the space of the diviseme instances. The output

is the optimal visible speech synchronized with the

auditory speech signal.

AccurateVisibleSpeech Synthesis

Visible Speech

Visible speech refers to the movements of the lips,

tongue, and lower face during speech production by

humans. According to the similarity measurement of an

acoustic signal, a phoneme is the smallest identifiable

unit in speech, while a viseme is a particular configura-

tion of the lips, tongue, and lower face for a group of

phonemes with similar visual outcomes. A viseme is an

identifiable unit in visible speech. In English, there

are many phonemes with visual ambiguity. For exam-

ple, phonemes /p/,/b/,/m/ appear visually the same.

These phonemes are grouped into the same viseme class.

Each viseme is usually classified using 70–75%18,19

within-class recognition rate. Phonemes /p/, /b/,/m/

Figure 1. System overview.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


and /th/,/dh/ are universally recognized visemes; the

remaining phonemes are not universally recognized

across languages owing to variations of lip shape in

different individuals. From a statistical point of view, a

viseme is a random vector, because a viseme observed at

different times or under different phonetic contexts may

vary in its appearances.

The basic underlying assumption of our visible

speech synthesis approach is that the complete set of

mouth shapes associated with human speech can be

reasonably approximated by a linear combination of a

set of visemes. Actually, this assumption has been

proven to be acceptable in most authoring tools for 3D

face animation. These systems use shape blending, a

special example of linear combination, to synthesize

visible speech. In this work, we chose 16 visemes from

images of the human subject. Each viseme image was

chosen at the point at which the mouth shape was

judged to be at its extreme shape. Phonemes that look

alike visually fall into the same viseme category. This

was done in a subjective manner, by comparing the

viseme images visually to assess their similarity. The 3D

feature points for each viseme are reconstructed by

our motion capture system. When synthesizing visible

speech from text, we map each phoneme to a viseme to

produce the visible speech. This ensures a unique

viseme target is associated with each phoneme. We

recorded sequences of nonsense words that contain all

possible motion transitions from one viseme to anther.

After the whole corpus was recorded and digitized, the

3D facial feature points were reconstructed. Moreover,

the motion trajectory of each diviseme was used as an

instance of each diviseme. It should be noted that

diphthongs need to be specially treated. Since a

diphthong, such as /ay/ in ‘pie’, consists of two vowels

with a transition between them, i.e., /aa/, /iy/, the

diphthong transition is visually simulated by a diviseme

corresponding to the two vowels.

The mapping from phonemes to visemes is many-to-

one; for instance, in cases where two phonemes are

visually identical, but differ only in sound, /p/,/b/,

/m/. It is also important to note that the mapping from

visemes to phonemes is also one-to-many. One pho-

neme may have different mouth shapes due to the

coarticulation effect. In visible speech synthesis, a key

challenge is to model coarticulation. Coarticulation20

relates to the observation that a speech segment is

influenced by its neighbouring speech segments during

speech production. The coarticulation effect from a

phoneme’s adjacent two phonemes is referred to as

the primary coarticulation effect of the phoneme. The

coarticulation effect from a phoneme’s two second

nearest neighbour phonemes is called the secondary

coarticulation effect. Coarticulation enables people to

pronounce speech in a smooth, rapid, and relatively

effortless manner. Coarticulation modeling is a very

important research topic for natural visible speech ani-

mation and speech production. Some methods for mod-

elling coarticulation in visible speech have been

proposed to increase the realism of animated speech.

When considering the contribution of a phoneme to

visible speech perception,21 there are three phoneme

classes. They are invisible phonemes, protected pho-

nemes, and ‘normal’ phonemes. An invisible phoneme

is a phoneme in which the corresponding mouth shape

is dominated by its following vowel, such as the first

segment in ‘car’, ‘golf’, ‘two’, ‘tea’. The invisible pho-

nemes include the phonemes /t/, /d/, /g/, /h/ and /k/.

In our implementation, lip shapes of all invisible pho-

nemes are directly modelled by motion capture data;

therefore, this type of primary coarticulation from the

adjacent two phonemes is very well modelled. The pro-

tected phonemes are those whose mouth shape must be

preserved in visible speech synthesis to ensure accurate

lip motion. These phonemes include /m/, /b/ and /p/

and /f/, /v/, as in ‘man’, ‘ban’, ‘pan’, ‘fan’ and ‘van’.

Pelachaud et al.22 tackle coarticulation using a look-

ahead model23 that considers articulatory adjustment on

a sequence of consonants followed by or preceded by a

vowel. On the other hand, Cohen and Massaro24 ex-

tended the Lofqvist25 gestural production model for

visible speech animation. In their model, dominance

functions are used to model the effects of coarticulation.

For each segment of speech, there is a dominance

function to control its effect on coarticluation. The

dominance function is defined as a negative exponent

function that contains five parameters to control the

dominance function.

The approach we propose in this paper is to directly

concatenate motions of 3D facial feature points for

English diphones/divisemes. For instance, as shown

in Figure 2 for the word ‘cool’, its phonetic transcription

is /kuwl/. The divisemes in the word are: /k-uw/,

/uw-l/. In order to synthesize the visible speech of the

word, the synthesis system will concatenate the two

motion sequences in motion capture data. In Figure 2,

the top picture depicts three visemes in the word ‘cool’,

while the bottom picture depicts the actual three key

frames of lip shapes mapping from the source face in

one motion capture sequence.

Compared with Cohen and Massaro’s model, our

approach models the visual transition from one


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


phoneme to another directly from motion capture data,

which is encoded for diphones as parameterized trajec-

tories. Therefore, a phoneme’s primary coarticulation

effect can be well modelled. Finally, because the tongue

movement cannot be directly measured in our motion

capture system, a special method is developed to tackle

the coarticulation effect of the tongue.

MotionCapture

Most motion capture systems for face animation are

based on optical capture. These systems are similar to

the optical systems used to capture body motions.

Reflective dots need to be glued onto the human face

at key feature positions, such as eyebrows, the outer

contour of the lips, the cheeks and chin. Our motion

capture system consists of a camcorder, two mirrors

and 31 facial markers in green and blue, as shown in

Figure 3. The video format is NTSC with the frame rate

29.97 frames per second. Each video clip was recorded

and synchronized with the audio signal.

A facial marker tracking system was designed to

automatically track the markers’ motions. A camera

calibration system was also designed. The 2-D trajec-

tories at two views are used to reconstruct the 3D

positions of facial makers as shown in Figure 4(a).

Moreover, a head pose estimation algorithm was de-

signed to estimate head poses at different times.

Figure 2. Two divisemes k-uw and uw-l in the word ‘cool’

represent the transition motions from viseme /k/ to the viseme

/uw/ and from viseme uw to l. The top picture shows three

visemes in the word ‘cool’. The bottom picture shows three key

frames of lip shapes in one motion sequence from one instance

of motion capture data.

Figure 3. Image captured by the mirror and camcorder-based

motion capture system.

Figure 4. (a) Thirty-one 3D dots constructed by the motion capture system; (b) Gurney’s 3D face mesh.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


A visual corpus of a subject speaking a set of nonsense

words was recorded. Each word was chosen so that it

visually instantiated motion transition from one viseme

to another in American English. A total of 16 visemes

were modelled. The mapping table from phonemes to

visemes is shown in Table 1. Theoretically speaking,

increasing the number of modelled visemes should lead

to more accurate synthetic visible speech. The motions

of a diviseme represent the motion transition from the

approximate midpoint of one viseme to the approxi-

mate midpoint of an adjacent viseme, as shown in

Figure 2. A speech recognition system operating in

forced-alignment mode was used to segment the divi-

seme speech segment (i.e., the speech recognizer is used

to determine the time location between phonemes and

to find the resulting diviseme timing information). Each

segmented video clip contained a sequence of images

spanning the duration of the two complete phonemes

corresponding to one diviseme. The diviseme text cor-

pus is listed in the Appendix. It is a subset of the text

corpus used for speech synthesis based on diphone

modelling.

LinearViseme Space

One important problem is how to map the motions of a

set of 3D facial feature points reconstructed from a

motion capture system onto a 3D face model. As shown

in Figure 4, the reconstructed facial feature points are

sparse, while the vertices in the 3D mesh of a face model

are dense. Therefore, many vertices in the 3D face model

have no corresponding points in the set of the recon-

structed 3D facial feature points. However, the move-

ments of vertices in the 3D face model have some type of

special correlations due to the physical constraints of

facial muscles. In order to estimate the movement

correlation among the vertices in the 3D face model, a

set of viseme targets is manually designed for the 3D

face model to provide learning examples. The set of

viseme targets is used as training examples to learn a

mapping from the set of 3D facial feature points in the

source face to the set of vertices in the target 3D face

model. For instance, as shown in Figures 5 and 6, there

1 /i:/week; /I_x/ roses2 /I/ visual; /&/above3 /9r/ read; /&r/ butter; /3r/ bird4 /U/ book; /oU/ boat5 /ei/ stable; /@/ bat; /^/ above; /E/ bet6 /A/ father; />/ caught; /aU/about; />i/boy7 /ai/ tiger8 /T/ think; /D/ thy9 /S/ she; /tS/church; /dZ/judge; /Z/ azure10 /w/wish; /u/ boot11 /s/sat; /z/resign12 /k/can; /g/gap; /h/high; /N/ sing; /j/ yes13 /d/debt14 /v/vice; /f/five15 /l/like; /n/knee16 /m/map; /b/bet;/p/pat17 /sil/ neutral expression

Table1. Mapping fromphonemestovisemes

Figure 5. Sixteen visemes based on captured images. The phonetic symbols are defined in the following words: /iy/ week; /ih/

visual; /uw/ spoon; /r/ read; /uh/ book; /ey/ stable; /aa/ father; /ay/ tiger; /th/ think; /sh/ she.


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


are 16 viseme targets for the source face and target face

respectively. Each mouth shape in the source face as

shown in Figure 5 should be mapped to its correspond-

ing mouth shape in the target face shown in Figure 6.

This problem can be seen as the scattered data inter-

polation.26 In this paper, an approach based on the

inverse problem of shape blending is proposed to con-

struct the approach for the scattered data interpolation

problem.

Linear combination of a set of images or graph pro-

totypes at different poses or views can efficiently ap-

proximate complex objects. This approach has been

widely used in computer vision and computer anima-

tion communities.16,27–31 Among the linear combination

of a set of prototypes, viseme blending is a special

example of the linear combination of a set of shapes.

The viseme blending approach has been proven to be a

very efficient way to synthesize visible speech in most

3D animation authoring tools such as 3Ds MAX and

MAYA. While this approach has been widely used in

animation authoring tools, the shape-blending coeffi-

cients or animation curves require manual definition by

animators, and this process can be quite time-consum-

ing. For an input text and speech, the animated visible

speech can be made to be very accurate by repeatedly

modifying the shape-blending coefficients. The question

is how to automatically obtain linear coefficients of a set

of visemes to approximate the mouth shape in a lip

motion trajectory. The model of a linear combination of

a set of prototypes is much like a linearly deformable

shape model that is capable of approximating shape

variations in a wide range.

Let Gi (i¼ 0, 1, 2, . . . , V� 1) be Si or Ti, where Si and Ti

represent viseme targets for the source face and the

target face respectively. {Gi} can span a linear subspace:�GjG ¼ �V�1

i¼0 wiGi

�, for the source face, S ¼ �V�1

i¼0 wiSi;

for the target face, T ¼ �V�1i¼0 wiTi, where {wi} are linear

combination coefficients or shape-blending coefficients.

We hope to find a mapping function fðSÞ that maps Si to

Ti, i.e., fðSiÞ ¼ Ti, and any observation vector S provided

by the motion capture system should be mapped to T in

the target face that is visually similar with S. Once the

coefficients are estimated with the observation data in

the source face, the observed vector S will be mapped to

T. The simplest form of mapping function is linear with

respect to S. In this condition, if S ¼ �V�1i¼0 wiSi then

fðSÞ ¼ T; this is because the following equation holds:

fðSÞ ¼ f��V�1

i¼0 wiSi

�¼ �V�1

i¼0 wifðSiÞ, if f is linear.

Suppose that there are N frames of observation vec-

tors SðtÞ, t¼ 1, 2, . . . , N in one observed motion capture

sequence. The shape-blending coefficients correspond-

ing to the tth frame are wiðtÞ, i¼ 0, 1, . . . , V� 1. The

robust shape-blending coefficients can be estimated by

minimizing the following fitting error:

minw

XN

t¼1

SðtÞ �XV�1

i¼0

wiðtÞSi

��

2

þ�XV�1

i¼0

w2i ðtÞ

0@

þ�XV�1

i¼0

wi t þ 1ð Þ � 2wi tð Þ þ wi t � 1ð Þð Þ2

!

with constraints li � wi tð Þ � hi; �V�1i¼0 wi tð Þ ¼ 1; wi 0ð Þ ¼

2wi 1ð Þ � wi 2ð Þ; wi N þ 1ð Þ ¼ 2wi Nð Þ � wi N � 1ð Þ; the con-

straint that the sum of the shape-blending coefficients is

one is to minimize expansion or shrinkage on the

polygon meshes when the mapping function is applied;

where li ¼ �"i; hi ¼ 1 þ �i; "i � 0 and �i � 0; "i and �i are

small positive parameters so that more robust and more

accurate shape-blending coefficients can be estimated

by solving the above optimization problem; where

w ¼ w tð Þf gNt¼1 and w tð Þ ¼ wi tð Þf gV�1

i¼0 ; � is a positive

regularization parameter to control the amplitude of

Figure 6. Sixteen visemes designed for Gurney’s model.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


shape-blending coefficients and � is a positive regular-

ization parameter to control the smoothness of the

trajectory of the shape-blending coefficients.

The above optimization problem involves convex

quadratic programming in which the objective function

is a convex quadratic function and the constraints are

linear. This problem can be solved by the primal–dual

interior-point algorithm.32

In order to reduce the computation load in determin-

ing the mapping function, PCA (principal component

analysis)33 can be applied. PCA is a statistical model that

decomposes high-dimensional data to a set of orthogo-

nal vectors. In this way, a compact representation of

high-dimensional data using lower-dimensional para-

meters can be estimated. Denote B ¼ �T1;�T2; � � � ;ð�TV�1Þ, � ¼ BBt, �Ti ¼ Ti � T0, �T ¼ T � T0, where

T0 denotes the neutral expression target. The eigenvec-

tors of the � are E ¼ �0; �1; . . . ; �3U�1ð Þ; �ik k ¼ 1; U is the

total number of vertices in the 3D face model. The

projection of T using M main components to approx-

imate it is as follows:

�T ffiXM�1

k¼0

�j�j

where the linear combination coefficients are �j ¼ �tj�T.

Usually the M is less than V after discarding the last

principal components. For each viseme targets, �Ti is

decomposed as the following linear combination by

PCA:

�Ti ffiXM�1

j¼0

�ij�j

where �ij ¼ �tj�Ti. Note that the coordinates of �Ti

under the orthogonal basis �j

� �M�1

0are �i0; �i1; . . . ;ð

�iM�1Þt. From the two equations above, we have

�T ¼XV�1

i¼0

wi�Ti ffiXM�1

j¼0

XV�1

i¼0

wi�ij�j ffiXM�1

j¼0

��j�j

��j ¼XV�1

i¼0

wi�ij ð1Þ

After the shape blending coefficients are estimated,

the mapping function is obtained. Thus, the motions of

3D trajectories of facial markers are mapped onto the

motions of vertices in the 3D face model.

TimeWarping

To blend motions at the juncture of two divisemes, the

time-scale of the original motion capture data must be

warped onto the time-scale of the target speech used to

drive the animation. Suppose that the duration of a

phoneme in the target speech stream ranges between

�0; �1½ �, and the time interval for its corresponding divi-

seme in motion capture data ranges between t0; t1½ �.Furthermore, the motion trajectory vector is denoted

as m tð Þ whose elements are ��j defined in equation (1).

The time interval should be transformed into �0; �1½ � so

that the motion trajectory defined in t0; t1½ � is embedded

within �0; �1½ �. Therefore we define the time-warping

function as

t �ð Þ ¼ t0 þ� � �0

�1 � �0t1 � t0ð Þ ð2Þ

In this way, the motion vector m tð Þ will be transformed

into the final time-warped motion vector, n �ð Þ as

follows:

n �ð Þ ¼ m t �ð Þð Þ ð3Þ

MotionVector Blending

The goal in blending the juncture of two adjacent

divisemes in a target utterance is to concatenate the

two divisemes smoothly. For two divisemes denoted by

Vi ¼ pi;0; pi;1

� �, Viþ1 ¼ piþ1;0; piþ1;1

� �respectively, where

pi;0; pi;1 represent the two visemes in Vi. pi;1; piþ1;0 are

different instances of the same viseme; they form the

juncture of divisemes Vi and Viþ1. Suppose that in a

speech segment the duration for the two visemes

pi;1; piþ1;0 should be embedded into the interval �0; �1½ �.The time-warping functions discussed above are used to

transfer the time intervals of the two visemes into �0; �1½ �.In addition, their transformed motion vectors are de-

noted by ni;1 �ð Þ ¼ mi;1 t �ð Þð Þ, niþ1;1 �ð Þ ¼ miþ1;1 t �ð Þð Þ, so

the time domains of the two time-warped motion vec-

tors are now the same. The juncture of the two divi-

semes is derived by blending the two time-aligned

motion vectors as the following:

hi �ð Þ ¼ fi �ð Þni;1 �ð Þ þ 1 � fi �ð Þð Þniþ1;1 �ð Þ ð4Þ

where the blending functions fi �ð Þ are chosen as para-

metric rational Gn continuous blending functions as the

following:34

bn;� tð Þ ¼ � 1 � tð Þnþ1

� 1 � tð Þnþ1þ 1 � �ð Þtnþ1;

t 2 0; 1½ �; � 2 0; 1ð Þ; n >¼ 0

ð5Þ

Other types of blending functions such as polynomial

blending functions exist and may also be available. For


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


example, p tð Þ ¼ 1 � 3t2 þ 2t3 is a C1 blending function,

while p tð Þ ¼ 1 � 6t5 � 15t4 þ 10t3� �

is a C2 blending

function. The blending function acts like a low-pass

filter to smoothly concatenate the two divisemes. The

blending functions are defined as

fi �ð Þ ¼ bn;�� 0

�1 � �0

� �ð6Þ

Trajectory Synthesis as aGraph Search

We assume that there is a set of diviseme motion

sequences for each diviseme, i.e., there are multiple

instances of lip motions for each diviseme. In the

following, we will discuss how best to concatenate these

sequences for any input text.

LipMotion Graph. As shown in Figure 7, the collec-

tion of diviseme motion sequences could be represented

as a directed graph. Each diviseme motion example is

denoted as a node in the graph. The edge represents the

transition from one diviseme to another. In this way, we

wish to search the optimal path in the graph that

constitutes the optimally concatenated visible speech.

To achieve this goal, we need to define the optimal

objective functions. The objective function is defined to

measure the degree of smoothness of the synthetic

visible speech. Theoretically speaking, the objective

function can be defined to minimize the following

degree of smoothness of the motion trajectory:

minpath

ðtL

t0

Vð2Þ tð Þ�� 2

dt ð7Þ

where VðtÞ is the concatenated lip motion for an input

text.

However, the above optimal problem (equation 7) is

hard to solve in mathematical aspects. Therefore, a

simplification is needed. Similar to concatenated speech

synthesis, we need to define two cost functions: a target

cost and a concatenation cost. The target cost is a

measure of distance between a given candidate’s fea-

tures and the desired target features. For example, if

observation data about lip motion is provided, the target

features can be lip height, lip width, lip protrusion and

speech features, etc. The target cost corresponds to the

node cost in the graph, while the concatenation cost

corresponds to the edge cost. The concatenation cost

represents the cost of the transition from one diviseme

to another. After the two costs have been defined, a

Viterbi algorithm can be used to compute the optimal

path. Because the basic unit used in this work is the

diviseme, the primary coarticulation effect can be mod-

elled very well. For an input text, its corresponding

phonetic information is known. When no observation

data of lip motion is provided for the target specifica-

tion, the target cost is defined to be zero. The other

reason we let the target cost be zero is that spectral

information extracted from the speech signal may not

provide sufficient information to determine a realistic

synthetic visible speech sequence. For example, the

acoustic features of the speech segments /s/ and /p/

in an utterance of the word ‘spoon’ are quite different to

those for the phoneme /u/, whereas the lip shapes of /

s/ and /p/ in this utterance are very similar to the

phoneme /u/.

For the concatenated cost, we define it as degree of

smoothness of the visual features at the juncture of the

two divisemes. For a divisemes sequence Vi ¼ ðpi;0; pi;1Þi¼ 1, 2, . . . ,N. The concatenated cost of the two units

Vi ¼ ðpi;0; pi;1Þ and Viþ1 ¼ ðpiþ1;0; piþ1;1Þ is defined as

CVi�Viþ1¼

ðkh

ð2Þi ð�Þk

2d�; i ¼ 1; 2; . . . ;N � 1 ð8Þ

where Vi is a diviseme lip motion instance, Vi 2 Ei, and

Ei is the set of diviseme lip motion instances. The hiðtÞ is

defined in equation (4), and the reason that we use the

integral of the squared hð2Þi ðtÞ is to measure the degree of

smoothness of the function at the juncture of two

divisemes. The total cost is

C ¼XN�1

i¼1

CVi�Viþ1ð9Þ

Figure 7. Graph of concatenating visible speech units by

Viterbi search algorithm. One state represents on divisime

motion sequence. The edge from one state to another repre-

sents one possible concatenation path. For the English word

‘county’ /kaunty/, there are four divisemes: kau/; /aun/; /nt/; /

ti:/, each node in same rectangle is an instance of the same

diviseme.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


The visible speech unit concatenation becomes the

following optimal problem:

minðV1;V2;...;VNÞ

C ¼XN�1

i¼1

CVi�Viþ1ð10Þ

subject to the constraints Vi 2 Ei.

Viterbi Search. The above problem can be solved by

searching the shortest path from the first diviseme to the

last diviseme, in which each node is a diviseme motion

instance. Cosatto and Graf4 proposed an approach to

visible speech synthesis based on Viterbi search; how-

ever, the cost definition proposed in this paper is

different from that proposed in Cosatto and Graf4 and

Graf et al.5

The distance between two nodes is the concatenation

cost. Dynamic programming offers an efficient solu-

tion.35 If we let Vi 2 Ei denote a node in stage i and let

dðViÞ be the shortest distance from node Vi 2 Ei to the

destination VN , dðVNÞ ¼ 0, we can obtain

dðViÞ ¼ minViþ12Eiþ1

�CVi�Viþ1

þ dðViþ1Þ�;

i ¼ N � 1;N � 1; . . . ; 1 ð11Þ

V�iþ1 ¼ arc min

Viþ12Eiþ1

�CVi�Viþ1

þ dðViþ1Þ�

ð12Þ

where CVi�Viþ1denotes the concatenated cost from node

Vi to Viþ1. This gives the recursive equations needed to

solve this problem. The best-concatenated diviseme

sequences will be fV�1 ;V

�2 ; . . . ;V

�Ng.

SmoothingAlgorithm

The concatenated trajectory still needs to be smoothed to

achieve accurate visible speech. Ezzat et al.2 proposed an

approach to making the trajectory smooth for visible

speech synthesis. The smoothed trajectory is found by

minimizing the sum of errors between target values and

the values on the trajectory and a smoothness term

computed from the derivative of the trajectory. In the

following, we propose a trajectory-smoothing technique

based on the spline function. Denote the synthetic

trajectory of one component of a parameter vector as

fðtÞ. The trajectory is obtained by the concatenation

approach discussed in the previous section. Suppose

that its samples are denoted by fi ¼ fðtiÞ, t0 < t1 < � � � tL.

The objective is to find a smooth curve gðtÞ that is the

best fit for the data, i.e., the following objective function

is minimized:

XL

i¼0

�iðgi � fiÞ2 þðtL

t0

gð2ÞðtÞ� 2

dt ð13Þ

where the �i is the weighting factor to control each

gi ¼ gðtiÞ for each target fi. The solution of the above

equation36 is

g ¼ I þ P�1CtC� ��1

f ð14Þ

Where I is a unit matrix, f ¼ ðf0; f1; . . . ; fLÞt, and

A ¼ 1

6

4 1 0

1 4 . ..

. .. . .

.1

0 0 4

2666664

3777775;

C ¼

1 �2 1 0

1 �2 1

. .. . .

.

0 1 �2 1

266664

377775

P ¼ diagð�0; �1; . . . ; �NÞ

Note that the control parameter �i depends on the

phonetic information. A large value implies that the

smoothed curve tends to be near the value fðtiÞ at time ti

and vice versa. For labial or labial–dental phonemes,

such as /p/, /b/, /m/, /f/, /v/, the value �i should be

set to large values, otherwise the smoothed target value

gi may be far away from the actual target value fi, which

will not make the lips close to each other.

Audiovisual Synchronization

We used the Festival text-to-speech (TTS)37 system or

the SONIC speech recognizer in forced-alignment

mode6 to derive time-aligned transcription to synthe-

sized or recorded speech, respectively. Note that Festi-

val is also a diphone-based concatenative speech

synthesizer that is very analogous to our approach to

concatenating divisemes. Diphones are represented by

short speech wave files for the transitions between the

middle of one phonetic segment to the middle of an-

other phonetic segment. In order to produce a visible

speech stream synchronized with the speech stream, our

animation engine first extracts the duration of each

diphone computed by TTS or the SONIC speech aligner.


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


The animation engine then creates a diviseme stream.

The diviseme stream consists of the concatenated divi-

semes corresponding to the diphones. The animation

engine loads the appropriate divisemes into the divi-

seme stream by looking for the corresponding diphones.

Usually the duration of a diviseme should be warped to

the duration of its corresponding diphone, because the

speech signal is used to control the synchronization

process. Suppose that the expected animation frame

rate is F per second, and the total duration of the audio

stream is T milliseconds. The total number of frames

will be about [F*T/1000]þ 1, and the duration between

two frames is C¼ 1000/F in milliseconds. There are two

approaches to synchronizing the visible speech and

auditory speech: one is synchronization with a fixed

frame rate; the other is synchronization with maximal

frame rate based on computer performance.

The synchronization algorithm for a fixed frame rate

is as follows and is shown in Figure 9(a):

1. Play the speech signal and render a frame of image

simultaneously, and get the start system time for

playing speech t0; when the rendering process for

the image is finished, get the time stamp t1.

2. If (t1� t0<C), wait for C� (t1� t0) period, then go to

1 and repeat the process; else if(t1� t0�C) go to 1

directly.

The synchronization algorithm with maximal frame rate

for variable frame rate is as follows and is shown in

Figure 9(b):

1. Play the speech signal and render a frame of image

simultaneously, and get the start system time for

playing speech t0; when the rendering process for

the image is finished, get the time stamp t1.

2. Get the animation parameters at time v ¼ ðt1 � t0Þ=C,

go to 1, repeat the process.

One disadvantage of this approach is that the anima-

tion engine is greedy, i.e., it uses most of the CPU cycles.

Nevertheless, it can produce a higher animation rate.

CoarticulationModellingof TongueMovement

The tongue also plays an important role in visible

speech perception and production. Some phonemes

not distinguished by their corresponding lip shapes

might be differentiated by tongue positions, such as /

f/ and /th/. Another role of the 3D tongue model is to

show positions of different articulators for different

phonemes from different orientations suning a semi-

transparent face to help people learn pronunciation.

Even though only a small part of the tongue is visible

during speech production, this visible part of informa-

tion can increase the visible speech intelligibility.

In addition, the tongue is highly mobile and deform-

able. For each phoneme, a tongue target38 was designed.

Tongue posture control consists of 24 parameters

manipulated by sliders in a dialogue box. One 3D ton-

gue model is shown in Figures 10(a) and 10(b). Because

motion capture data were not available for tongue

movement, an approach combining data-smoothing

techniques with heuristic coarticulation rules is pro-

posed to simulate tongue movement. The coarticulation

effects of tongue movement are quite different from

those of lip movements. The tongue targets correspond-

ing to some phonemes should be completely reached.

For example, the tongue targets of the following pho-

nemes should be completely reached: (1) tongue up and

down /t/, /d/, /n/, /l/; (2) tongue between teeth: /T/

Figure 8. A picture to show the synchronization between audio signal and video signal.

Figure 9. (a) Synchronization for fixed animation frame rate.

The segment of dash line represents the time interval for

waiting to achieve the expected time duration C between two

frames. (b) Synchronization with maximal frame rate rate.

There is no waiting time for each processing cycle.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


in Thank; /D/in Bathe; (3) lips forward: /S/ ship; /Z/

measure; /tS/ in chain, /dZ/ in Jane; (4) tongue back:

tongue back: /k/, /g/, /N/, /h/. The tongue targets of

other phonemes may not need to be completely reached.

Therefore, all phonemes are categorized into two classes

according to the criterion is the tongue target corre-

sponding to the phoneme completely reached or not.

For the two categories, different smoothing parameters

are applied to simulate the tongue movement.

We use the kernel smoothing approach proposed in

our previous study38 to model tongue movement. The

approach is described as follows. Suppose there is an

observation sequence yi ¼ �ðxiÞ to be smoothed. fxigni¼0

satisfies the following condition:

0 ¼ x0 < x1 < x2 < � � � xn�1 < xn ¼ 1

The weighted average of the observation sequence is

used as the estimator of �ðxÞ:

�ðxÞ ¼Xn

i¼0

yiwiðxÞ ð15Þ

wherePn

i¼1 wiðxÞ ¼ 1; wiðxÞ ¼ K�ðx � xiÞ=M0; M0 ¼Pni¼1 K�ðx � xiÞ. �ðxÞ is called as the Nadaraya–Wastson

estimator.38

For tongue movement modelling, the relationship

between time t and x is expressed as follows:

x ¼ ðt � t0Þ=ðtn � t0Þ; xi ¼ ðti � t0Þ=ðtn � t0Þ

where the interval ½t0; tn� represents a local window at

frame or time t and n is the size of the window. When

the sampling points fxigni¼0 are from one speech seg-

ment, i.e. all values of fyigni¼0 are equal, the morph target

can be completely reached. When the sampling points

are not from the same speech segment, the smoothed

target value is the weighted average of the sampling

points from different speech segments. Therefore, the

target value at the boundary of two speech segments is

smoothed according to the distributions of sampling

points in the two speech segments. The principle to

choose the kernel functions, the number of sampling

points in different phoneme segments, and the window

size n can be found in Ma and Cole.38 A tongue move-

ment sequence generated by the approach discussed

above is shown in Figure 11.

Conclusion

We have demonstrated a new technique to produce

highly realistic visible speech using motion capture,

motion mapping and motion concatenation. This was

done by recording primitive motions of a set of divi-

semes and concatenating them in an optimal way

through graph search. The system has been integrated

into reading tutors39 and the CSLU toolkit, which can be

downloaded via the following website: http://cslr.

colorado.edu/toolkit/main.html.

The new contributions of our work are:

1. Motion mapping. In motion mapping, we proposed an

approach to estimating a mapping function from the

visemes in the source to the visemes in the target.

Thus, the motion vectors in the source can be mapped

onto the motion vectors in the target. PCA is applied to

obtain a compact representation of motion parameters.

2. Motion concatenation. We propose a new approach to

motion vector blending and a new approach to

defining the concatenation cost and to searching the

optimal concatenated visible speech.

3. Smoothing algorithm. The spline smoothing algo-

rithm is proposed to smooth the motion trajectory

of the synthetic visible speech according to phonetic

knowledge.

4. Modelling coarticulation of tongue movement. An ap-

proach that combines kernel-smoothing techniques

with heuristic coarticulation rules is proposed to

model coarticulation during tongue movement.

Figure 10. (a) Side view of a 3D tongue model. (b) Top view

of the 3D tongue model.

Figure 11. An example of tongue movements in side view.


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


We believe this work is significant for its technical

contribution to visible speech synthesis. Observation of

the resulting visible speech synthesis reveals that it is

more natural looking than morphing between targets

with simple smoothing techniques. While the initial

results of the approach are encouraging, there is much

room for improvement. For example, the visible speech

corpus should be large so that more smoothing conca-

tenation visible speech can be obtained. Furthermore,

prosodic features of speech, including accent, stress,

tone and intonation, should be used to control visible

speech synthesis and facial expressions.

ACKNOWLEDGEMENTS

This work was supported in part by NSF CARE grant EIA-

9996075; NSF/ITR grant IIS-0086107; NSF/ITR Grant REC-

0115419; NSF/IERI (Interagency Education Research Initiative)

Grant EIA-0121201 and NSF/IERI Grant 1R01HD-44276.01. The

findings and opinions expressed in this article do not necessa-

rily represent those of the granting agencies.

References

1. Bregler C, Covell M, Slaney M. Video rewrite: drivingvisual speech with audio. In Proceedings of ACM SIGGRAPHComputer Graphics. ACM Press: New York, 1997; 353–360.

2. Ezzat T, Geiger G, Poggio T. Trainable video realisticspeech animation. In Proceedings of ACM SIGGRAPH Com-puter Graphics, San Antonio, TX, 2002; pp 388–398.

3. Ezzat T, Poggio T. MikeTalk: a talking facial display basedon morphing visemes. In Proceedings of Computer Anima-tion, Philadelphia, PA, 1998; pp 96–102.

4. Cosatto E, Graf HP. Sample-based synthesis of photo-realistic talking heads. In Proceedings of Computer Anima-tion, Philadelphia, PA, 1998; pp 103–110.

5. Graf HP, Cosatto E, Ezzat T. Face analysis for the synthesisof photo-realistic talking heads. In Proceedings of Face andGesture Recognition, 2000; pp 189–194.

6. Pellom B, Hacioglu K. Recent improvements in the SONICASR system for noisy speech: the SPINE task. In Proceed-ings of the IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), Vol. 1, Hong Kong, 6–10April 2003; pp 4–7.

7. Cole R, Massaro DW, Villiers J de, Rundle B, Shobaki K,Wouters J, Cohen M, Beskow J, Stone P, Connors P,Tarachow A, Solcher D. New tools for interactive speechand language training: using animated conversationalagents in the classrooms of profoundly deaf children. InProceedings of ESCA/SOCRRATES Workshop on Method andTool Innovations for Speech Science Education, 1999.

8. Parke F. Computer generated animation of faces. InProceedings of ACM National Conference, Boston, MA, 1972;pp 451–457.

9. Terzopoulos D, Waters K. Physically-based facial model-ing, analysis, and animation. Journal of Visualization andComputer Animation 1990; 1(4): 73–80.

10. Williams L. Performance-driven facial animation. In Pro-ceedings of ACM SIGGRAPH Computer Graphics 1990;24(4): 235–242.

11. Kleiser J. A fast, efficient, accurate way to representthe human face. State of the Art in Facial Animation. ACM,SIGGRAPH Computer Graphics Tutorials 1989; 22: 37–40.

12. Magnenat Thalmann N, Primeau E, Thalmann D. Abstractmuscle action procedures for human face animation. VisualComputer 1988; 3(5): 290–297.

13. Noh JY, Neumann U. Expression cloning. In Proceedings ofof ACM SIGGRAPH Computer Graphics, Los Angeles CA,2001; pp 277–288.

14. Chuang E, Deshpande H, Bregler C. Facial expressionspace learning. In Proceedings of Pacific Graphics, Beijing,2002; pp 68–76.

15. Guenter B, Grimm C, Wood D, Malvar H, Pighin F. Makingfaces. In Proceedings of SIGGRAPH Computer Graphics 1998;pp 55–66.

16. Kshirsagar S, Molet T, Magnenat-Thalmann N. Principalcomponents of expressive speech animation. In Proceedingsof Computer Graphics International 2001; pp 38–44.

17. Bregler C, Loeb L, Chuang E, Deshpande H. Turning to themasters: motion capturing cartoons. In Proceedings of SIG-GRAPH Computer Graphics 2002; pp 399–407.

18. Jackson PL. The theoretical minimal unit for visual speechperception: visemes and coarticulation. Volta Review 1988;90(5): 99–115.

19. Woodward MF, Barber CG. Phoneme perception in lip-reading. Journal of Speech and Hearing Research 1960; 3:212–220.

20. Hardcastle WJ, Hewlett N. Coarticulation: Theory, Data andTechniques. Cambridge University Press: Cambridge, UK,1999.

21. Woods JC. Lip-Reading: A Guide for Beginners (2nd edn).Royal National Institute for Deaf people: London, 1994.

22. Pelachaud C, Badler N, Steedman M. Linguistic issues infacial animation. In Proceedings of Computer Animation,Magnenat-Thalmann N, Thalmann D (eds). Springer:Berlin, 1991; 15–30.

23. Kent RD, Minifie FD. Coarticulation in recent speechproduction models. Journal of Phonetics 1977; 5: 115–135.

24. Cohen MM, Massaro DW. Modeling coarticulation in syn-thetic visual speech. In Proceedings of Computer Animation,Thalman NM, Thalman D (eds). Springer: Tokyo, 1993;139–156.

25. Lofqvist A. Speech as audible gestures. In Speech Productionand Speech Modeling, Hardcastle WJ, Marchal A (eds).Kluwer: Dordrecht, 1990; 289–322.

26. Nielson G. Scattered data modeling. IEEE ComputerGraphics and Applications 1993; 13(1): 60–70.

27. Vetter T, Poggio T. Linear object classes and image synth-esis from a single example image. IEEE Transactions on Pat-tern Analysis and Machine Intelligence 1997; 19(7): 733–742.

28. Blanz V, Vetter T. A morphable model for synthesis of 3Dfaces. In Proceedings of SIGGRAPH Computer Graphics,Los Angeles, 1999; pp 187–194.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


29. Kalberer GA, Gool LV. Face animation based on observed3D speech dynamics. In Proceedings of Computer Animation,2001; pp 20–27.

30. Pighin F, Szeliski R, Salesin DH. Modeling and animatingrealistic faces from images. International Journal of ComputerVision 2002; 50(2): 143–169.

31. Blanz V, Basso C, Poggio T, Vetter T. Reanimating faces inimages and video. Computer Graphics Forum 2003; 22(3):641–650.

32. Gertz EM, Wright SJ. Object-oriented software for quadra-tic programming. ACM Transactions on Mathematical Soft-ware 2003; 29: 58–81.

33. Bai ZJ, Demmel J, Dongarra J, Ruhe A, Vorst HVD.Templates for the Solution of Algebraic Eigenvalue Problems:A Practical Guide. Society for Industrial and AppliedMathematics: Philadelphia, PA, 2000.

34. Hartmann E. Parametric Gn blending of curves and sur-faces Blending functions. Journal of Visual Computer 2001;17(1): 1–13.

35. Bellman R. Dynamic Programming. Princeton UniversityPress: Princeton, NJ, 1957.

36. Feng G. Data smoothing by cubic spline filters. IEEE Trans-actions on Signal Processing 1998; 46(10): 2790–2796.

37. University of Edinburgh. The festival speech synthesis sys-tem: http://www.cstr. ed.ac.uk/projects/festival/ [2003].

38. Ma JY, Cole R. Animating visible speech and facial expres-sions. Visual Computer Journal 2004 (to appear).

39. Cole R, Van Vuuren S, Pellom B, Hacioglu K, Ma JY, Move-llan J, Schwartz S, Wade-Stein D, Ward W, Yan J. Percep-tive animated interfaces: first steps toward a newparadigm for human–computer interaction. Proceedings ofthe IEEE 2003; 91(9): 1391–1405.

Appendix:DivisemeTextCorpus

0. i:-w(dee-wet) 1. i:-9r(dee-rada) 2. i:-U(bee-ood) 3. i:-

ei(bee-ady) 4. i:-A(bee-ody) 5. i:-aI(bee-idy) 6. i:-T(deeth)

7. i:-S(deesh) 8. i:-k(deeck) 9. i:-l(deela) 10. i:-s(reset) 11.

i:-d(deed) 12. i:-I(bee-id) 13. i:-v(deev) 14. i:-m(deem) 15.

w-i:(weed) 16. w-9r(duw-rud) 17. u-U(boo-ood) 18.

w-ei(wady) 19. u-A(boo-ody) 20. w-aI(widy) 21. u-

T(dooth) 22. u-S(doosh) 23. u-k(doock) 24. u-l(doola)

25. u-s(doos) 26. u-d(doo-de) 27. u-I(boo-id) 28. u-

v(doov) 29. u-m(doom) 30. 9r-i:(far-eed) 31. 9r-u-

(far-oodles) 32. 9r-U(far-ood) 33. 9r-ei(far-ady) 34. 9r-

A(far-ody) 35. 9r-aI(far-idy) 36. 9r-T(dur-thud) 37. 9r-

S(dur-shud) 38. 9r-k(dur-kud) 39. 9r-l(dur-lud) 40.

9r-s(dur-sud) 41. 9r-d(dur-dud) 42. 9r-I(far-id) 43. 9r-

v(dur-vud) 44. 9r-m(dur-mud) 45. U-i(boo-eat) 46. U-

w(boo-wet) 47. U-9r(boor) 48. U-ei(boo-able) 49.

U-a(boo-art) 50. U-aI(boo-eye) 51. U-T(booth) 52.

U-S(bushes) 53. U-k(book) 54. U-l(pulley) 55. U-s(pussy)

56. U-d(wooded) 57. U-I(boo-it) 58. U-v(booves) 59.

U-m(woman) 60. ei-i:(bay-eed) 61. ei-w(day-wet) 62.

ei-9r(dayrada) 63. ei-U(bay-ood) 64. ei-A(bay-ody) 65.

ei-aI(bay-idy) 66. ei-T(dayth) 67. ei-S(daysh) 68. ei-

k(dayck) 69. ei-l(dayla) 70. ei-s(days) 71. ei-d(dayd) 72.

ei-I(bay-id) 73. ei-v(dayv) 74. ei-m(daym) 75. A-i:

(bay-idy) 76. A-w(da-wet) 77. A-9r(da-rada) 78. A-

U(ba-ood) 79. A-ei(ba-ady) 80. A-aI(ba-idy) 81. A-T(ba-

the) 82. A-S(dosh) 83. A-k(dock) 84. A-l(dola) 85.

A-s(velocity) 86. A-d(dod) 87. A-I(ba-id) 88. A-v(dov)

89. A-m(dom) 90. aI-i:(buy-eed) 91. aI-w(die-wet) 92.

aI-9r(die-rada) 93. aI-U(buy-ood) 94. aI-ei(buy-ady) 95.

aI-A(buy-ody) 96. aI-T(die-thagain) 97. aI-S(die-

shagain) 98. aI-k(die-kagain) 99. aI-l(die-la) 100. aI-

s(die-sagain) 101. aI-d(die-dagain) 102. aI-I(buy-id)

103. aI-v(die-vagain) 104. aI-m(die-magain) 105.

T-i:(theed) 106. T-w(duth-wud) 107. T-9r(duth-rud)

108. T-U(thook) 109. T-ei(thady) 110. T-A(thody) 111.

T-aI(thidy) 112. T-S(duth-shud) 113. T-k(duth-kud) 114.

T-l(duth-lud) 115. T-s(duth-sud) 116. T-d(duth-dud)

117. T-I(thid) 118. T-v(duth-vud) 119. T-m(duth-mud)

120. S-i:(sheed) 121. S-w(dush-wud) 122. S-9r(dush-

rud) 123. S-U(shook) 124. S-ei(shady ) 125. S-A(shody)

126. S-aI(shidy) 127. S-T(dush-thud) 128. S-k(dush-kud)

129. S-l(dush-lud) 130. S-s(dush-sud) 131. S-d(dush-

dud) 132. S-I(shid) 133. S-v(dush-vud) 134. S-m(dush-

mud) 135. k-i:(keed) 136. k-w(duk-wud) 137. k-9r

(duk-rud) 138. k-U(kook) 139. k-ei(back-ady) 140. k-

A(kody) 141. k-aI(kidy) 142. k-T(duk-thud) 143. k-

S(duk-shud) 144. k-l(duk-lud) 145. k-s(duk-sud) 146.

k-d(duk-dud) 147. k-I(kid) 148. k-v(duk-vud) 149. k-

m(duk-mud) 150. l-i:(leed) 151. l-w(dul-wud) 152.

l-9r(dul-rud) 153. l-U(fall-ood) 154. l-ei(fall-ady) 155.

l-A(fall-ody) 156. l-aI(fall-idy) 157. l-T(dul-thud) 158. l-

S(dul-shud) 159. l-k(dul-kud) 160. l-s(dul-sud) 161.

l-d(dul-dud) 162. l-I(fall-id) 163. l-v(dul-vud) 164. l-

m(dul-mud) 165. s-i:(seed) 166. s-w(dus-wud) 167. s-9r

(dus-rud) 168. s-U(sook) 169. s-ei(sady) 170. s-A(sody)

171. s-aI(sidy) 172. s-T(dus-thud) 173. s-S(dus-shud) 174.

s-k(dus-kud) 175. s-l(dus-lud) 176. s-d(dus-dud) 177.

s-I(sid) 178. s-v(dus-vud) 179. s-m(dus-mud) 180. d-i:

(deed) 181. d-w(dud-wud) 182. d-9r(dud-rud) 183.

d-U(dook) 184. d-ei(dady) 185. d-A(dody) 186. d-aI

(didy) 187. d-T(dud-thud) 188. d-S(dud-shud) 189. d-

k(dud-kud) 190. d-l(dud-lud) 191. d-s(dud-sud) 192. d-I

(did) 193. d-v(dud-vud) 194. d-m(dud-mud) 195. I-i:

(ci-eed) 196. I-w(ci-wet) 197. I-9r(ci-rada) 198. I-U(ci-

ood) 199. I-ei(ci-ady) 200. I-A(ci-ody) 201. I-aI(ci-idy)

202. I-T(dith) 203. I-S(dish) 204. I-k(dick) 205. I-l(dill)

206. I-s(dis) 207. I-d(did) 208. I-v(div) 209. I-m(dim)

210. v-i:(veed ) 211. v-w(duv-wud) 212. v-9r(duv-rud)

213. v-U(vook) 214. v-ei(vady) 215. v-A(vody) 216. v-

aI(vidy) 217. v-T(duv-thud) 218. v-S(duv-shud) 219.

v-k(duv-kud) 220. v-l(duv-lud) 221. v-s(duv-sud) 222.


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


v-d(duv-dud) 223. v-I(vid) 224. v-m(duv-mud) 225. m-

i:(meed) 226. m-w(dum-wud) 227. m-9r(dum-rud) 228.

m-U(mook) 229. m-ei(mady) 230. m-A(monic) 231. m-aI

(midy) 232. m-T(dum-thud) 233. m-S(dum-shud) 234.

m-k(dum-kud) 235. m-l(dum-lud) 236. m-s(dum-sud)

237. m-d(dum-dud) 238. m-I(mid) 239. m-v(dum-vud).\

The phonetic symbols in the list are defined in the

following words: /i:/ week; /I/ visual; /9r/ read; /U/

book; /ei/ stable; /A/ father; /ai/ tiger; /T/ think; /S/

see; /w/ wish. Videos and utterances using our techni-

que can be viewed at the following website: http://

cslr.colorado.edu/� jiyong/corpus.html.

Authors’biographies:

Jiyong Ma received a PhD degree in Computer Sciencefrom Harbin Institute of Technology, China, 1999 andhis BS degree in Computational Mathematics fromHelongjiang University in 1984. Prior to joining theCSLR, he was a post-doctoral researcher at the Instituteof Computing Technology in the Chinese Academy ofSciences from March 1999 to February 2001. His re-search interests include computer animation, computervision, speech and speaker recognition, pattern recogni-tion algorithms and applied and computational mathe-matics. He has published more than 60 scientific papers.

Ronald Cole has studied speech recognition by human

and machine for the past 35 years, and has published

over 150 articles in scientific journals and published

conference proceedings. In 1990, Ron founded the

Center for Spoken Language Understanding (CSLU) at

the Oregon Graduate Institute. In 1998, Ron founded

the Center for Spoken Language Research (CSLR) at the

University of Colorado, Boulder.

Bryan Pellom received the BSc degree in Computer andElectrical Engineering from Purdue University, WestLafayette, IN, in 1994 and the MSc and PhD degrees inelectrical engineering from Duke University in 1996 and1998, respectively. From 1999 to 2002, he was a ResearchAssociate with the Center for Spoken Language Research(CSLR), University of Colorado, Boulder. His researchactivities were focused on automatic speech recognition,concatenative speech synthesis and spoken dialoguesystems. Since 2002, he has been a Research AssistantProfessor in the Department of Computer Science andwith the CSLR. His current research is focused in the areaof large vocabulary speech recognition.

Wayne Ward is a full-time research faculty member inthe Center for Spoken Language Research. He works inthe area of Spoken Language Processing and DialogueModeling for conversational computer systems, andinformation retrieval in question/answering systems.

Barbara Wise has a BA in Psychology with honors fromStanford, and an MA and PhD in Developmental Psy-chology from the University of Colorado in Boulder.Dr Wise developed the Linguistic Remedies program andclasses, where she shares her own and others’ researchknowledge and teaching ideas with professionals andparents concerned with children with reading difficul-ties. She has conducted teacher training workshops inmany towns in Colorado, as well as in Texas, California,Connecticut, Pennsylvania and Washington.

J. MA ET AL.* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of...

Documents

Transcript of Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of...