HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads

HMM-based Synthesis of Emotional Facial Expressions during Speech in Synthetic Talking Heads

Nadia Mana and Fabio Pianesi ITC-irst

via Sommarive, 18 38050 Povo (Trento), Italy

{mana, pianesi}@itc.it

ABSTRACT One of the research goals in the human-computer interaction community is to build believable Embodied Conversational Agents, that is, agents able to communicate complex information with human-like expressiveness and naturalness. Since emotions play a crucial role in human communication and most of them are expressed through the face, having more believable ECAs implies to give them the ability of displaying emotional facial expressions.

This paper presents a system based on Hidden Markov Models (HMMs) for the synthesis of emotional facial expressions during speech. The HMMs were trained on a set of emotion examples in which a professional actor uttered Italian non-sense words, acting various emotional facial expressions with different intensities.

The evaluation of the experimental results, performed comparing the “synthetic examples” (generated by the system) with a reference “natural example” (one of the actor’s examples) in three different ways, shows that HMMs for emotional facial expressions synthesis have some limitations but are suitable to make a synthetic Talking Head more expressive and realistic.

Categories and Subject Descriptors G.3 [Mathematics of Computing]: Probability and Statistics – Markov Processes; I.2.6 [Artificial Intelligence]: Learning – Parameter Learning; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism – Animation; I.4.8 [Image Processing and Computer Vision]: Scene Analysis - Time-varying imagery.

General Terms Human Factors, Measurement, Performance, Experimentation.

Keywords Emotional Facial Expression Modeling, Face Synthesis, MPEG4 Facial Animation, Talking Heads, Hidden Markov Models.

1. INTRODUCTION In the last years, in the human-computer interaction (HCI) community there has been a growing interest in “Embodied Conversational Agents” (ECA), software agents with humanoid aspect which have the ability of interacting with users [6].

ECAs, including talking heads and virtual characters (or, more generally, human-like embodied interfaces), are becoming popular as front ends to web sites, and as part of many computer applications (e.g. virtual training environments, tutoring systems, storytelling systems, portable personal guides, edutainment systems, etc).

One of the research goals in HCI is making these Conversational Agents more believable, that is, able to communicate complex information through the combination and the tight synchronization of verbal and non-verbal signals, with human-like expressiveness and naturalness. In other words, the final goal is having agents that are not only able to perceive and understand what the user is saying, but also to answer, verbally and non-verbally, in an appropriate manner. These agents should be able to interact with the users as in a human-human conversation. Since emotions play a crucial role in human communication and most of them are expressed through the face, in order to have more believable ECAs one of the needed steps is to give them the ability of displaying emotional facial expressions.

Moreover, there is substantial evidence that resorting to well designed human-like systems, capable of modeling communicative functions in a suitable manner, may improve usability of such systems (see e.g. [6] and [22]). At the same time, many studies show that talking heads without natural face and emotional expressions are actually less preferred than disembodied voices. Video-captured natural human faces are, in turn, preferred over disembodied voices ([17], [18] and [19]). This all makes even more evident that emotional expressions are one of the key factors for achieving user acceptance of synthetic characters.

In the last few years several works, both rule-based and statistical, have obtained important results in the modelling of emotional facial expressions to be used in synthetic talking heads. Nevertheless, the generation of models for realistic facial animation remains critical and synthesis of expressive facial expressions is still an open problem. In fact, most of the rule-based systems suffer the drawback of a static generation, due to

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’06, November 2-4, 2006, Banff, Alberta, Canada. Copyright 2006 ACM 1-59593-541-X/06/0011...$5.00.

380

https://www.researchgate.net/publication/2470835_Experimental_Assessment_of_the_Effectiveness_of_Synthetic_Personae_for_Multi-Modal_E-Retail_Applications?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/3412141_Evaluating_humanoid_synthetic_agents_in_e-retail_applications?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/221515461_The_effects_of_animated_characters_on_anxiety_task_performance_and_evaluations_of_user_interfaces?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/2619810_Conversation_as_a_System_Framework_Designing_Embodied_Conversational_Agents?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==


https://www.researchgate.net/publication/228539115_An_italian_database_of_emotional_speech_and_facial_expressions?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

the fact that the set of rules and their combinations is limited. Stochastic systems are able to overcome this limitation by using a generative and dynamic paradigm. However, most of works on synthesis using statistical/machine learning approaches are restricted to speech and lip movements.

This paper presents a system based on Hidden Markov Models (HMMs) for the synthesis of emotional facial expressions during speech. HMMs were trained on a set of examples in which a professional actor uttered Italian non-sense words, acting various emotional facial expressions with different intensities. The dynamics of every facial expression was captured by means of an opto-electronic system using reflective markers placed on the actor’s face.

The performance of the model has been evaluated by making a qualitative, quantitative and visual comparison between a reference “natural example” (one of the actor’s examples) and “synthetic examples” (generated by the system).

The rest of the paper is organized as follows: Section 2 is a short survey of some of the major rule-based and HMM-based works on modelling and synthesis of emotional facial expressions. Section 3 introduces the proposed modelling, describing the training data set, the defined HMMs, and the resulting synthesis. In Section 4 experimental results and their evaluation are described. Section 5 points out possible future work, while Section 6 summarise the present work and draws some final remarks.

2. RELATED WORK Many works have addressed the modeling of emotional facial expressions by symbolic (rule-based) approaches. Among these, there is the system presented by Cassell et al. [7]. It is a rule-based automatic system that generates expressions and speech for multiple conversation agents. The system is based on FACS (Facial Action Coding System) [13], a system for measuring visually discernible facial movements by decomposing facial expressions into component actions (called action units). Such coding is used to denote static facial expressions and to generate new facial expressions as combination of basic action units. The system rules, inferred by a large database of examples, guide the facial expression generation. The resulting facial expressions look good but the system generation is static: the system generates always the same expressions in any context.

This drawback is partially overcome by [9]. As the previous work, the modelling is based on rules and FACS. However, here the rule set is significantly enlarged and enriched with more details. Furthermore, the system models variable emotion intensity. When applied, it chooses different emotion generation rules according to the communication context and creates the corresponding facial expressions. Another worth of the system is its capability of generating new expressions by combing different emotion expressions. For example, the “worried” facial display is given by a non-uniform combination of “surprise" and “sadness" facial displays.

A similar modelling is proposed by Bui et al. [5] but they use a fuzzy rule-based system to map representations of the emotional state of an animated agent onto muscle contraction values for specific facial expressions. In their implementation, firstly they take into account the continuous changes in emotion expressions

depending on the intensity by which the expressions are felt. Secondly, as in [9], they found a way to specify combinations of expressions (i.e. blends), according to the FACS system. Using a fuzzy-rule based approach they assure better results because their expressions are smoother and therefore more natural.

Few works addressed the modelling of emotional facial expressions by statistical/machine learning approaches, and in particular using HMMs. These models are widely used in speech synthesis but not so common in facial expression synthesis.

One of the first HMM-based models for facial expression synthesis was proposed by Brand [2]. It is based on an entropy minimization algorithm from training audio/video data for speech synthesis. The novel facial motions related to speech production are synthesized by sampling from this generative model.

Cohen, Garg, and Huang [8] conducted a study into emotion recognition using Multilevel Hidden Markov Models (MHMM) which both automatically segment and categorize emotions correctly from live video footage. They follow the idea that emotion can be recognized exploiting the fact that sequential changes in facial expression give rise to unique temporal sequences for each emotion.

3. MODELING AND SYNTHESIS The dynamics of the facial expressions through time can be formally described as a discrete-time sequence of random feature vectors, drawn from a probability distribution in a properly defined feature space. The modelling of such dynamics has been accomplished by facing a random-process modelling problem in which some kinds of statistical inference are carried out from a corpus of training data samples. As in [2], HMMs have been used.

3.1 Training Data A portion of an Italian database of emotional facial expressions during speech [[17] has been used as learning set. This set consisted of 882 examples of emotional facial expressions, played by a young professional actor. In the examples the actor uttered Italian non-sense words, containing the seven basic visemes of Italian [16] (namely /b/, /d/, /L/, /dZ/, /l/, /n/ and /v/), with different emotional expressions and intensities.

To ensure an easier detection of the starting and ending point of the emotional expression, the actor uttered two additional Italian words, “chiudo” (lit. I close) and “punto” (point), respectively before and after the short word played emotionally. Each word was acted with seven emotional states, corresponding to the Ekman’s set [12] – Anger, Disgust, Fear, Happiness, Sadness, and Surprise – plus the additional state ‘Neutral’. An example of each emotional state is shown in Figure 1. Apart ‘Neutral’, each emotion was acted with three different intensity levels (Low, Medium, High).

Figure 1. Ekman’s set of emotion, plus neutral expression

“anger” “disgust” “neutral” “happiness” “fear” “sadness” “surprise”

381

https://www.researchgate.net/publication/221082180_Evaluation_of_Synthetic_Faces_Human_Recognition_of_Emotional_Facial_Displays?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==


https://www.researchgate.net/publication/2312222_Italian_Consonantal_Visemes_Relationships_Between_SpatialTemporal_Articulatory_Characteristics_And_Coproduced_Acoustic_Signal?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/220935796_Generation_of_Facial_Expressions_from_Emotion_Using_a_Fuzzy_Rule_Based_System?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/2488613_Emotion_Recognition_from_Facial_Expressions_using_Multilevel_HMM?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/45599472_ANIMATED_CONVERSATION_Rule-based_Generation_of_Facial_Expression_Gesture_Spoken_Intonation_for_Multiple_Conversational_Agents?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/209436215_An_Argument_For_Basic_Emotions?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/225488717_Xface_Open_Source_Toolkit_for_Creating_3D_Faces_of_an_Embodied_Conversational_Agent?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/225488717_Xface_Open_Source_Toolkit_for_Creating_3D_Faces_of_an_Embodied_Conversational_Agent?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==


https://www.researchgate.net/publication/299275742_Xface_Open_source_toolkit_for_creating_3D_faces_of_an_embodied_conversational_agent?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==


https://www.researchgate.net/publication/243779859_EMFACS-7_Emotional_Facial_Action_Coding_System?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

The dynamics of every facial expression was captured with high precision by means of an opto-electronic system (Elite system [14]), using 28 reflective markers glued on the actor’s face. The system tracked the movement of the 28 markers over time in the 3D space through the xyz coordinates per each marker, frame by frame (as depicted in Figure 2).

Figure 2. Output of the tracking system

3.2 Modeling and Learning Formally, an HMM is a pair of stochastic processes: a hidden Markov chain and an observable process which is a probabilistic function of the states of the former. This means that observable events in the real world (xyz marker coordinates in our case) are modeled with (possibly continuous) probability distributions, that are the observable part of the model, associated with individual states of a discrete-time, first-order Markovian process. The semantics of the model (conceptual correspondence with physical phenomena) is usually encapsulated in the hidden part. More precisely, an HMM is defined by:

1. A set S of Q states, S = {S1, …., SQ}, which are the distinct values that the discrete, hidden stochastic process can take.

2. An initial state probability distribution, i.e. π = {Pr (Si ⎢t = 0), Si ∈ S}, where i is a discrete time index.

3. A probability distribution that characterizes the allowed transitions between states, that is aij = {Pr (Sj at time t ⎢Si at time t-1), Si ∈ S} where the transition probabilities are assumed to be independent of time t.

4. An observation or feature space F, which is a discrete or continuous universe of all possible observable events (usually a subset of ℜd, where d is the dimensionality of the observations).

5. A set of probability distributions (referred to as emission or output probabilities) that describes the statistical properties of the observations for each state of the model: bx = {bi(x) = Pr (x ⎢Si), Si ∈ S, x ∈ F}

In our modeling different families of HMMs were defined for each individual emotional state. Then, within each such a family, different left-to-right HMMs were introduced for modelling individual visemes. This modelling includes long left-to-right

hidden Markov chains, including 21 states. The input observations were represented by the vectors of XYZ marker coordinates (i.e. 28 markers x 3 coordinates). Mixture of Gaussian probability density functions (pdfs) have been used to model the emission probabilities. In particular, eight diagonal-covariance Gaussians have been used. Training samples, clustered by emotional states (without any distinction of intensity) were used, along with the Baum-Welch algorithm [21], to estimate the model parameters. Due to the continue nature of the feature space (namely, the space of XYZ vectors), a CDHMM (continuous density HMM) [15] was preferred.

3.3 Synthesis For synthesis purposes the defined HMMs have been used according to a generative approach. That means that, once trained, the model is used to generate sequences of feature vectors according to the probability laws that are described by the parameters of the model itself (see Figure 3).

Figure 3. System Overview

Figure 3. System Overview

The generated XYZ coordinates are then converted in FAPs (Facial Animator Parameters), according to the MPEG-4 standard [11], to be animated by any Talking Head, MPEG-4 compliant.

In the MPEG-4 standard two sets of parameters describe and animate the 3D facial model: facial animation parameter set (FAPs) and facial definition parameters (FDPs). FAPs define the facial actions, while FDPs define the shape of the model.

FAPs have to be calibrated prior to use them on a specific face model. For this reason, FAPs are expressed in normalized units called FAPUs (Facial Animation Parameter Units) which are defined as fractions of distances between key facial features (e.g. eye-nose separation). Only FAPUs are specific to the actual 3D face model that is used, while FAPs are independent. That means they can drive different face models, regardless of geometry. As a result, by coding a face model using MPEG-4 standard developers can freely exchange face models without the problem of calibration and parameterization for animation, and FAPs can be used for different 3D facial models.

When the model has been characterized with its FDPs (namely the model shape has been defined), the animation is obtained by specifying the FAP-stream, i.e. the values of FAPs frame by frame

382

https://www.researchgate.net/publication/3038069_Elite_A_Digital_Dedicated_Hardware_System_for_Movement_Analysis_Via_Real-Time_TV_Signal_Processing?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/222476456_MPEG-4_Audiovideo_and_synthetic_graphicsaudio_for_mixed_media?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/245123229_A_Tutorial_on_Hidden_Markov_Models_and_Selected_Applications_in_Speech_Recognition?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

FAPs are based on the study of minimal perceptible actions and are closely related to muscle action. In MPEG-4 they are 68. Of these 68 values, the first 2 are high level parameters representing visemes and emotions. The remaining 66 are low level FAPs, dealing with specific regions on the face (e.g. bottom of chin, left corner lip, right corner of left eyebrow, etc.).

Figure 4. An example of FAP-stream

As shown in Figure 4, in a FAP-stream the relevant information for animation are distributed on two lines: the first line indicates which point is specifically activated in that moment (activation or not is expressed by 0 and 1), while the second one contains the target values, in terms of differences from the previous (frame before) target values. When a FAP is activated (i.e. when its value is not null), the feature point on which the FAP acts is moved in the direction indicated by the FAP itself (up, down, left, right, etc). Therefore, the 68 FAPs values, specified in the FAP-stream frame by frame, cause the facial animation by defining the deformation between two frames of animation.

4. EXPERIMENTAL RESULTS The performance of the model has been evaluated by comparing “synthetic examples” (generated by the system) with a reference “natural example” (one of the actor’s examples) on the following grounds: (i) qualitative, by 2D graphs representing the evolution of the spatial position of markers over time; (ii) quantitative, via similarity measures (by Dynamic Time Warping function); and (iii) visual, by means of actual facial animation on synthetic talking heads.

4.1 Qualitative Evaluation In order to represent graphically the dynamics of a facial expression, i.e. the marker motions frame by frame, we converted the XYZ coordinate values for each marker in a unique value. For this purpose we used the 3D vector module. Since x,y,z represent the distance on the three Cartesian axis of a marker with respect to its rest position, the 3D vector module is given by:

( )222|| zyxV ++= (1)

Applying this equation to all marker coordinates we have been able to get a visual representation of marker behaviour in time, i.e. their dynamics. Just for the sake of clarity, we do not show the dynamics of all 28 markers1 and we focus our attention on only three markers. In particular, we have chosen the marker 2 (m2),

1 Otherwise graphs would be unreadable.

marker 18 (m18) and marker 24 (m24), corresponding respectively to the markers placed on the middle of the left eyebrow, on the left corner of the mouth and on the middle of the lower lip (as depicted in Figure 5). Empirically we have seen that these three markers are the most significant in capturing facial expression changes and differences in relation to different emotional expressions.

Figure 5. Displacement of markers 2, 18 and 24

By using this graphical representation, the “prototypical behaviour” of each emotional expression becomes evident.

In Figure 6 we can see, for example, that the prototypical behaviour of the “surprise” expression (risen eyebrows and open mouth) is well captured by marker 2 and marker 24, having respectively a peak in correspondence to the maximum value of the surprise expression (approximately at frame 316 for m2 and frame 406 for m24).

Figure 6. Dynamics of “Surprise”

More in general, by graphs it is also possible to analyze how emotions effect the speech production. This becomes particularly evident by comparing the dynamics of emotional expressions for a specific viseme with the dynamics of the same viseme uttered without emotion (neutral state).

Figure 7 shows, for example, the dynamics of marker 2 for “aba” viseme in neutral and surprise condition. As it is evident by the

2

18

24

ACTOR - SURPRISE (Medium)

150

160

170

180

190

200

210

220

230

240

1 46 91 136 181 226 271 316 361 406 451 496 541 586

frames

mill

imet

ers

ACTOR_m2ACTOR-m18ACTOR_m24

383

graph, facial expression dynamics are strongly affected by the emotion (in the example note the prominence of the central peak – approximately at frame 300). We can see that the emotional expression is more intense than the neutral one and tends to be longer.

Figure 7. Dynamics of Marker 2 (Surprise vs Neutral)

A first evaluation of the experimental results was done from a qualitative point of view using this 2D graphical representation. The evaluation has been done comparing graphs of the generated examples (“synthetic examples”) with those of the actor (“natural examples”), choosing randomly one of the actor’s examples for each viseme-emotion from the training set.

The graphs show that HMMs have been able to learn and reproduce the marker behaviour, typical of each emotion expression. We can see an example in Figure 8 where the HMMs’ performance in the case of “surprise” is shown by markers 2, 18 and 24.

HMM - SURPRISE

150

160

170

180

190

200

210

220

230

240

1 50 99 148 197 246 295 344 393 442 491 540 589 638

frames

mill

imet

ers

HMM_m2HMM_m18HMM_m24

Figure 8. “Synthetic” surprise expression

Focusing for example on marker 2 and comparing HMMs with an actor’s example (see Figure 9), we note that the synthetic expression presents the peak, typical of the risen eyebrow in surprise, even if with less intensity (it is lightly lower).

But: a) actor and HMMs curves are not aligned (HMMs one looks a little bit contracted); b) HMMs curve is definitely piecewise.

SURPRISE - ACTOR vs HMMs

205

210

215

220

225

230

235

1 54 107 160 213 266 319 372 425 478 531 584 637

frames

mill

imet

ers

ACT-SUR_m2HMM-SUR_m2

Figure 9. Comparison of “Natural” and “Synthetic” surprise expression (by marker 2)

4.2 Quantitative Evaluation In addition to a qualitative evaluation accomplished by graphical representations of the marker behaviours, we attempted to perform also a quantitative evaluation, i.e. an objectives measurement of the quality of the generated examples. To get this measurement we had to compare a “synthetic sequence”, generated by HMMs, with the corresponding “natural sequence”, i.e. one of the sequences generated by the actor.

If two sequences (X and Y) are aligned and have the same length their comparison is simply done by calculating their distance, point by point, frame by frame (see Figure 10).

Figure 10. Comparison of two aligned sequences Yet, two temporal series not necessarily have the same length and are aligned. In our scenario the sequences may have different duration and shapes (that is particularly evident in Figure 9). In these cases the comparison shown in Figure 10 is not possible, while a more sophisticated comparison (see Figure 11) is needed.

Figure 11. A more sophisticated comparison of

two aligned sequences

To accomplish this kind of comparison, we resorted to a Dynamic Programming technique, and in particular to a Dynamic Time Warping (DTW) algorithm [23] able to calculate matching distances by recovering optimal alignments between the two sequences (as depicted in Figure 12).

time

X

Y time

NEUTRAL vs SURPRISE (marker 2)

200

205

210

215

220

225

1 51 101 151 201 251 301 351 401 451 501 551 601 651

frames

mill

imet

ers

NEU_m2SUR-M_m2

384

https://www.researchgate.net/publication/37705096_Time_Warps_String_Edits_and_Macromolecules_The_Theory_and_Practice_of_Sequence_Comparison_Reading_MA_Addison-Wesley?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

Figure 12. Comparison of two sequences by DTW

In particular, the DTW-distance between two time series X1 . . . XM and Y1 . . . YN is equal to D(M,N), calculated by dynamic programming as follows:

),()1,1(

),1()1,(

min),( ii yxdjiDjiD

jiDjiD +

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−−−−

= (2)

Still using the 2D representation of the marker dynamics over time given by (1), we measured the similarity between natural and synthetic examples by calculating the distance between the two sequences. A preliminary study was conducted. Within this, the evaluation was limited to examples of “aba” viseme with Neutral, Happiness and Surprise expressions. We randomly took an example of each emotional expression from the set of the actor’s examples and used it as reference for the comparison. At the same way, we randomly took 5 synthetic examples, generated by the system, per each emotional expression, and compared each one with the reference example. Table 1 shows the average results of this comparison for marker 2, 18 and 24.

Table 1. Similarity measures by DTW

m2 m18 m24 m2 m18 m24NEU 46,34 55,377 53,6075 9,8722 18,659 19,532HAP 54,924 58,613 61,2587 11,683 24,621 23,953SUR 48,56 43,98 48,165 13,7106 22,806 21,2948

HMM ACTOR

Note that the HMMs’ performance is quite uniform. In fact, there are no significant behaviour differences between an emotional state and another and neither between m2, m18 and m24 markers with respect to a specific emotional state. Finally, just to prove that there is a certain variability also within examples of the same actor, we compared the reference examples with other examples of the actor (see ACTOR column on the right).

4.3 Visual Evaluation Another possible evaluation of the experimental results can be accomplished by facial animation. Indeed, by animating generated marker motions on a synthetic Talking Head it is possible to evaluate the quality of the generated expressions with respect to the prototypical expressions. That means to evaluate if the animated expression reproduces significantly the corresponding human emotional expression. In other words, by animation we can check if the synthetic expression captures the salient characteristics of the prototypical

expression and if the animation confirms the emotional facial expression curves seen graphically.

For this task we used Alice, a textured 3D-model of a female face, developed by ITC-irst, using the Xface toolkit [2]. Exploiting Alice’s capabilities in performing different facial expressions (see Figure 13), we had the possibility of testing directly our results.

Figure 13. Different emotional facial expressions by Alice

By animating the generated examples it is possible to have an immediate evaluation of the quality of the results, using emotional expressions on human face as reference (see Figure 14).

Figure 14. Surprise expression on the actor and Alice’s face

We globally noted that the piecewise behaviour is not always perceptible during animation. Although the HMM-based examples seem to be not completely correct because graphically they are not smooth (Figure 8), their animation is definitively acceptable. This is probably due to the fact that facial animation shows 25 frames per second. Hereby human eyes cannot see certain details. For the same reason, examples that graphically looked not aligned or contracted with respect to the actor’s ones (see Figure 9) from the animation viewpoint are definitively believable.

5. Discussion and Future Work The adoption of a statistical paradigm provides us with a adaptive model, that is capable of learning the natural features of facial expressions (as a function of emotions and their dynamics) from examples collected in the field, instead of requiring a hard-coded, fixed set of designer-provided rules. Hereby, the generated synthetic sequences are turned out to be more natural and the corresponding animation more realistic than a rule-based system. The phenomenon of “piecewise behaviour”, visible graphically, depends on the fact that the nature of the generated sequence obeys the probabilistic structure (and assumptions) of the HMM: little random fluctuations around the average of Gaussian emission distributions are generated within individual states of the Markov chain (as a consequence of the stationarity assumption, encapsulated in the adoption of a static pdf to model the state emission), while more sharp changes in the output values occur whenever a transition between a pair of adjacent states takes place.

“anger” “disgust” “fear” “happiness” “neutral” “sadness” “surprise”

385


From the facial animation point of view, this means that the talking head displays nearly flat intervals fulfilled with little, chaotic oscillations, in alternative with sudden changes in the facial expression that, in turn, remains flat for another interval, and so on. The resulting animation is expected to be jerky but this phenomenon is not always perceivable by human eyes. Therefore, the final animation turns out to be quite realistic and believable.

Another positive aspect is that this modelling is easily extendable. Defining HMM families for each viseme, emotion and intensity (Low, Medium and High), it is possible to extend the modelling to emotions with different intensities. While concatenating “basic HMMs” (i.e HMMs modelling single emotions) it is possible to model the transitions between an emotional state to another during speech (e.g. from “surprise” to “happiness”).

Furthermore, the modelled expressions are whole-face and not limited to lips or eyebrows, as most of the synthesis systems. Moreover, the modeling is dynamic: every time the synthesis system produces a different expression for each emotional state. As for humans, we have never two identical emotional expressions2.

Concerning future work, it will follow several directions. First of all, the modeling could be extended by adding new basic emotions. The extension could also regard more examples of visemes in different vowel contexts, thus reaching a wider coverage of the Italian language.

Secondly, an extensive evaluation based on the three types of evaluation (qualitative, quantitative and visual) will be accomplished on all examples of visemes, emotions and intensity. Furthermore, the evaluation work will be completed by accomplishing a subject evaluation. More specifically, we would like to replicate the experiments conducted by Costantini et al. in 2004 ([9] and [3]), in which the adequacy of facial displays in the expression of some basic emotional states is measured through a recognition task. In the experiments, different facial expressions were displayed to a number of subjects by means of talking heads. The synthetic faces animated “natural examples”, i.e. parameters related to emotional facial expressions extracted from the actor’s examples. In a new evaluation framework, we could compare “natural examples” and “synthetic examples” through several synthetic talking heads.

Finally, we are going to apply this modeling to a talking head and use this as synthetic character, able to communicate in an expressive and realistic way. That means to develop a multimodal interface in which audio and video synthesis are combined. In this perspective, an emotional text-to-speech synthesizer could be integrated to an MPEG-4 synthetic face so that the synthetic character will be able to take in input any text (opportunely marked with the information concerning the emotional expression to be associated to that text) and to produce the corresponding audio and visual emotional expressions.

2 This dynamic aspect distinguishes this work from [4]. Brand’s

system is audio-driven: given the same audio, it will produce always the same expression.

6. Conclusion In this paper we presented a HMM-based synthesis of emotional facial expressions during speech, to be used in talking heads.

We aimed at investigating a novel, viable solution to this problem by statistical approach and to overcome intrinsic limitation of the more traditional rule-based approaches.

In the light of the three-fold evaluation, we can argue that HMMs for emotional facial expressions synthesis have some limitations. As showed by graphs, they produce “synthetic examples” with piecewise and contracted motions. Nevertheless, these examples are suitable to a talking head application. As proved by similarity measures, the generated examples well reproduce the prototypical marker trajectory of specific expressions. Furthermore, during animation piecewise and contracted movements are generally not perceptible by humans. Hereby, we can conclude that the proposed approach is suitable to emotional facial expression synthesis and it is promising in an application-oriented perspective.

7. REFERENCES [1] Bowman, B., Debray, S. K., and Peterson, L. L. Reasoning

about naming systems. ACM Trans. Program. Lang. Syst., 15, 5 (Nov. 1993), 795-825.

[2] K. Balci. Xface: Open Source Toolkit for Creating 3D Faces of an Embodied Conversational Agent. In Proceedings of Smart Graphics, 2005.

[3] J. Beskow, L. Cerrato, P. Cosi, E. Costantini, M. Nordstrand, F. Pianesi, M. Prete, and G. Svanfeldt. Preliminary Cross-cultural Evaluation of Expressiveness in Synthetic Faces. In E. Andrè, L. Dybkiaer, W. Minker, and P. Heisterkamp, editors, Affective Dialogue Systems ADS '04, Springer Verlag, 2004.

[4] M. Brand. Voice Puppetry. In Proceedings of ACM SIGGRAPH’99, pp. 21-28, 1999.

[5] T.D. Bui, D. Heylen, M. Poel, and A. Nijholt. Generation of Facial Expressions from Emotion Using a Fuzzy Rule Based System. In Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI 2001), Adelaide, Australia, December 2001.

[6] Cassell J., Bickmore T., Cambell L., Vilhjalmsson H., and Yan H. Human Conversation as a System Framework: Designing Embodied Conversational Agents. In J. Cassel, J. Sullivan, S. Prevost, and E. Churchill, editors, Embodied Conversational Agents , MIT Press, 22-63, 2000.

[7] J. Cassell, V. Pelachuad, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone. ANIMATED CONVERSATION: Rule-based Generation of Facial Expression, Gesture & Spoken Intonation for Multiple Conversational Agents. In Proceeding of SIGGRAPH’94, 1994.

[8] I. Cohen, A. Garg, and T. Huang. Emotion recognition from facial expressions using multilevel HMM, 2000.

386

https://www.researchgate.net/publication/221082160_Preliminary_Cross-Cultural_Evaluation_of_Expressiveness_in_Synthetic_Faces?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==























https://www.researchgate.net/publication/2347953_Reasoning_About_Naming_Systems?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==











[9] E. Costantini, F. Pianesi, and P. Cosi. Evaluation of Synthetic Faces: Human Recognition of Emotional Facial Displays. In E. Andrè, L. Dybkiaer, W. Minker, and P. Heisterkamp, editors, Affective Dialogue Systems ADS '04, Springer-Verlag, 2004.

[10] F. deRosis, C. Pelachaud, I. Poggi, V. Carofiglio, and B.D. Carolis. From Greta’s Mind to her Face: Modelling the Dynamics of Affective States in a Conversational Embodied Agent. International Journal of Human-Computer Studies. Special Issue on Applications of Affective Computing in HCI, 59 (1) pp. 81-118, 2003.

[11] P. Doenges, F. Lavagetto, J. Ostermann, I.S. Pandzic, and E. Petajan. MPEG-4: Audio/Video and Synthetic Graphics/Audio for Mixed Media. In Image Communications Journal, 5(4), May 1997.

[12] P. Ekman. An Argument for Basic Emotions. In N.L. Stein, and K. Oatley, editors, Basic Emotions, pp 169-200, 1992.

[13] P. Ekman, and W. Friesen. Manual for the Facial Action Coding System. Consulting Psychologists Press, 1978.

[14] G. Ferrigno, and A. Pedotti. ELITE: A Digital Dedicated Hardware System for Movement Analysis via Real-Time TV Signal Processing. In IEEE Transactions on Biomedical Engineering, BME-32, pp 943-950, 1985.

[15] X. D. Huang, Y. Ariki, and M. Jack, M. Hidden Markov Models for Speech Recognition. Edinburgh University Press, Edinburgh, 1990.

[16] E. Magno Caldognetto, C. Zmarich, P. Cosi and F. Ferrero. Italian Consonantal Visemes: Relationships Between Spatial/temporal Articulatory Characteristics and Coproduced Acoustic Signal. In Proceedings of AVSP-97, Tutorial & Research Workshop on Audio-Visual Speech Processing: Computational & Cognitive Science

Approaches, Rhodes (Greece), pp. 5-8, 26-27 September 1997.

[17] N. Mana, P. Cosi, G. Tisato, F. Cavicchio, E. Magno and F. Pianesi. An Italian Database of Emotional Speech and Facial Expressions. Proceedings of “Workshop on Emotion: Corpora for Research on Emotion and Affect”, in association with “5th International Conference on Language, Resources and Evaluation (LREC2006), Genoa, Italy, May 2006.

[18] McBreen, H., Jack, M. (2001). Evaluating Humanoid Synthetic Agents in e-retail Applications. IEEE Transactions on Systems, Man and Cybernetics, vol. 31 (5), 2001.

[19] McBreen, H. M., Shade, P., Jack, M. A., & Wyard, P. J. (2000). Experimental Assessment of the Effectiveness of Synthetic Personae for Multi-Modal E-Retail Applications. In Proceedings of the Fourth International Conference on Autonomous Agents, Barcelona, Spain, 2000.

[20] S. Narayanan and A.Alwan. Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall Press, 2004.

[21] L.R. Rabiner. A tutorial on Hidden Markov Models and Se-lected Applications in Speech Recognition. In Proceedings of the IEEE, 77(2), pp. 257-286, 1989.

[22] R. Rickenberg and B. Reeves. The Effects of Animated Characters on Anxiety, Task Performance, and Evaluations of User Interfaces. In Proceedings of CHI 2000, 2000.

[23] D. Sankoff and J.B. Kruskal. Time warps, string edits, and macromolecules: The theory and practice of sequence com-parison. Addison-Wesley Publishing Company, Reading, MA, 1983.

387




















































https://www.researchgate.net/publication/229439073_Text_to_speech_synthesis_New_paradigms_and_advances?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==

https://www.researchgate.net/publication/229439073_Text_to_speech_synthesis_New_paradigms_and_advances?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==






https://www.researchgate.net/publication/220108591_From_Greta's_mind_to_her_face_modelling_the_dynamics_of_affective_states_in_a_conversational_embodied_agent?el=1_x_8&enrichId=rgreq-1baeb3b6f69bb248d838e6cbdbdcea04-XXX&enrichSource=Y292ZXJQYWdlOzIyMTA1MjIyNjtBUzoxMDI1MzM4Mjk3NTg5NzdAMTQwMTQ1NzM3ODU4MA==






HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads

Documents

Transcript of HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads