Representing Affective Facial Expressions for Robots and Embodied Conversational Agents by Facial...

Noname manuscript No.(will be inserted by the editor)

Representing Affective Facial Expressions for Robotsand Embodied Conversational Agents by FacialLandmarks

Caixia Liu · Jaap Ham · Eric Postma ·Cees Midden · Bart Joosten · MartijnGoudbeek

Received: date / Accepted: date

Abstract Affective robots and embodied conversational agents require con-vincing facial expressions to make them socially acceptable. To be able tovirtually generate facial expressions, we need to investigate the relationshipbetween technology and human perception of affective and social signals. Fa-cial landmarks, the locations of the crucial parts of a face, are importantfor perception of the affective and social signals conveyed by facial expres-sions. The goal of our study is to determine to what extent facial landmarksare sufficient to represent facial expressions of emotions. If facial landmarkscontain sufficient information, it should be possible to recognize and gener-ate emotional expressions based on the landmarks only. This study focusseson the human recognition of emotional expressions from landmark sequences.Using face-analysis software, we digitally extracted landmark sequences fromfull-face videos of acted emotions. The landmark sequences were presented tosixteen participants who were instructed to classify the sequences according tothe emotion represented. The results of the experiment revealed that partici-pants were able to recognize the emotion reasonably well from the landmarksequences, suggesting that landmarks contain information about the expressedemotions. The implications of our findings for the virtual generation of facialexpressions in robots and embodied conversational agents are discussed. Weconclude by stating that landmarks provide a sufficient basis for the virtualgeneration of emotions in humanoid agents.

Caixia Liu · Jaap Ham · Cees MiddenHuman-Technology Interaction Group, Department of Industrial Engineering and Innova-tion Sciences, Eindhoven University of Technology, Eindhoven, The NetherlandsE-mail: {c.liu, j.r.c.ham, c.j.h.midden}@tue.nl

Caixia Liu · Eric Postma · Bart Joosten · Martijn GoudbeekTilburg center for Cognition and Communication, Tilburg University, Tilburg, The Nether-landsE-mail: {e.o.postma, b.joosten, m.b.goudbeek}@tilburguniversity.edu

2 Caixia Liu et al.

Keywords Robots · Embodied Conversational Agents · Emotion · Facialexpression · Facial landmarks · FaceTracker · Perception

1 Introduction

Socially-aware robots and embodied conversational agents (ecas) require ac-curate recognition and generation of facial expressions. The automatic recog-nition of facial expressions is an active field of research in affective computingand social signal processing [1]. The virtual generation of facial expressions isstill in its infancy.

Human faces have only a limited range of movements. Facial expressionsmay be fairly subtle changes in the proportions and relative positions of thefacial muscles. Emotional facial expressions convey social signals which aredominant units of social communication among humans. For example, a smilecan indicate approval or happiness, while a frown can signal disapproval or un-happiness. The six basic human emotions are anger, disgust, fear, happiness,sadness, and surprise. Carroll and Russell [2] defined two emotional dimensionsthat can be read from faces: arousal and valence. The dimensions of arousaland valence are orthogonal and emotions are organized in a circular patternforming the circumplex model. The distance between two emotional expres-sions on the circle defines their dissimilarity. The six basic emotions can becharacterised in terms of levels of arousal and valence. For instance, the facialexpression of sadness is associated with a low level of arousal and a negativevalence. Figure 1 displays the arousal-valence space and the locations of thesix basic emotions on the circumplex.

Fig. 1: Schematic representation of the circumplex model (redrawn after [2]).

Representing Affective Facial Expressions by Facial Landmarks 3

1.1 Facial expressions in robots and ecas

The virtual generation of facial expressions in robots and ecas is a rapidlydeveloping field of research. In the domain of robotics, a notable exampleis Kismet [3], a socially-aware robot created by Breazeal at MIT. Kismetis equipped with a facial features (ears, eyes, eyebrows, and lips) that arecontrolled by software to generate facial expressions. The circumplex modelguided the development of the control algorithms. Figure 2 gives an impressionof Kismet’s appearance. Kismet was shown to be successful in establishing so-cial contact with human observers. Due to the limited degrees of freedom ofKismet’s facial features, its facial expressions are not as complex as those ofhumans.

In ecas, richer facial expressions may be generated by cloning the expres-sions of actors by means of motion capture (especially in the movies industry)or by means of real-time digital cloning of human facial expressions (fromvideo) onto the face of an avatar [4]. Recent work on the generation of re-alistic facial expressions relies on three-dimensional models of human headsand on models of the muscles controlling the face. An alternative and moreperceptually motivated approach is to define a representation space of facialexpressions that can be extracted from perceivable features of faces.

Fig. 2: Impression of Kismet [3].

1.2 Representing facial expressions

In a famous study by Johansson [5], participants watched videos of lights at-tached to the joints of walking people against an otherwise black background.Participants were able to identify familiar persons from their gait as reflectedin the dynamics of the lights, e.g. their gender and the nature of the taskwhich they were engaged in. Apparently, the dynamics of light sources at in-formative locations (the joints) contain sufficient information to represent gait.Similarly, tiny light sources attached to the informative locations of the facemay be sufficient to represent facial expressions. This idea was put to the test

4 Caixia Liu et al.

Table 1: Bassili‘s emotion recognition results for full faces/dot-patterned facesexpressed as percentages correct recognition (reproduced from [6])

Reported emotionHappiness Sadness Fear Surprise Anger Disgust

Happiness 31/31 6/13 0/6 38/31 19/6 6/13Displayed Sadness 13/0 56/25 0/25 0/13 0/0 31/37emotion Fear 0/0 0/19 69/6 13/25 6/13 12/37

Surprise 0/0 0/6 6/0 94/75 0/6 0/13Anger 6/0 0/13 0/0 0/6 50/6 44/75Disgust 0/6 0/6 0/19 0/19 6/6 88/57

by Bassili [6]. In his study, the faces of some participants (the actors) werepainted black with about 100 small white dots of paint superimposed, form-ing an approximately uniform grid of white dots covering the entire face. Theactors were instructed to sequentially express the six basic emotions (happi-ness, sadness, surprise, fear, anger and disgust). The remaining participants(the raters) were instructed to identify the emotions. The experimental set-ting ensured that the raters could only see the dots and not the texture of thesurrounding black skin. The results showed that the raters could identify thesix emotional expressions reasonably well. Table 1 contains a reproduction ofthe results obtained by Bassili. The rows list the emotions displayed by theactors and the columns the emotions reported by the raters. The entries inthe table represent the percentages of correct recognition of the emotions forthe fully displayed faces (left of the slash) and their white-dot representations(right of the slash).

1.3 Goals of our study

In the current study, we used FaceTracker software [7] to extract facial-landmarksequences from full-face videos. The extracted landmarks were visualized asimage sequences of white dots against a dark background. The image sequenceswere converted to videos. With the videos so obtained, we investigate the accu-racy of perception of expressed emotions from the dynamics of the landmarks.Our main research question is how well participants recognize the emotions offacial expressions in the facial-landmark videos as compared to their recogni-tion of emotions in their full-face equivalents.

The goal of our study is twofold. One goal is to replicate and extendBassili’s study by using landmark-extraction software rather than painteddots. The other goal is to determine if the facial landmarks generated byface-tracking software contain sufficient information to be able to recognizeemotions and, in future work, to generate emotions in robots or in ecas. Inaddition, we aim at relating our results to the circumplex model by investigat-ing and comparing valence judgments and arousal judgments that participantsmake about full-face videos and facial-landmark videos.


Fig. 3: Digital extraction landmarks by FaceTracker. The left image showsthe face of one of the authors and a grid of landmarks superimposed. Theright image is an illustration of the extracted landmarks as they appear in thelandmark video.

1.4 Four hypotheses

In line with Bassili’s [6] study, we formulate the following four hypotheses.First, we hypothesize that people can relatively accurately identify emotionalfacial expressions based on original full-face videos (H1). Second, we hypothe-size that people can also relatively accurately identify emotional facial expres-sion based on facial-landmark videos (H2). Third, we hypothesize that peoplecan identify emotional facial expressions better based on full-face videos, thanbased on facial-landmark videos (H3). Fourth, we expect that participants’judgments about the valence levels and arousal levels of emotional facial ex-pressions displayed by full-face videos will show a strong correlation with thosedisplayed by facial-landmark videos (H4).

2 Digital extraction of facial landmarks

To be able to present participants with facial-landmark videos representingfacial expressions, we need information about the locations and movementsof the crucial parts of a face while that face is displaying facial expressions.These locations might be calculated based on models of human faces andfacial expressions [8,9], but they can also be extracted from real human facialexpressions [4].

As stated in the previous section, we used FaceTracker software [7,10] (seeFigure 3) to extract this information from full-face videos of actors displayingbasic emotions (happiness, sadness, fear, disgust and anger).

FaceTracker operates in real time and returns a mesh of interconnectedlandmarks that are located at the contours of the eyes, at the nose, at themouth and at other facial parts. In our experiment, 66 landmarks were usedcovering the prominent locations of the face (see Figure 3).

6 Caixia Liu et al.

3 Methods

3.1 Participants and design

Sixteen healthy adult participants participated in the study. None of themwere familiar with the purpose of the study. All participants were students atEindhoven University of Technology with Dutch as their native language. Theiraverage age was 25.6 years (SD = 10.46). Each participant was presented withten full-face videos (two actors, one male and one female, each expressing fiveemotions in five separate videos), and also with ten facial-landmark videos(based on two different actors, one male and one female, each expressing fiveemotions in five separate videos). The two actors in the videos from which thelandmark videos were extracted were different from the two actors displayedin the full-face videos to prevent recognition of a previously-seen expressionor actor.

The block of full-face videos and the block of landmark videos were pre-sented one after the other. Both the order of presentation of the blocks and thegender of the actors featuring in the videos were counterbalanced. The designwas a within-subject design (each participant was confronted with both con-ditions namely full-face videos and facial-landmark videos) and the dependentvariable was recognition accuracy or classification performance.

So, even though one participant saw different actors for the full-face videosand the facial-landmark videos, overall, all participants saw the same set ofactors for the full-face videos and the facial-landmark videos. As a consequence,the final results could not be influenced by differences in acting skills of thedifferent actors.

3.2 Stimulus materials

The stimulus materials were based on the GEMEP corpus [?,11], featuringfull-face videos of actors exhibiting emotional expressions. For the full-facevideos, we used four actors (two male and two female). Of each actor, we usedfive short videos (average length is 2 seconds), each showing the face and uppertorso of an actor, while the actor acted as if he or she experienced an assignedemotion. That is, for each actor, we used five videos representing five differentemotions, including happiness, sadness, fear, disgust, or anger.

Based on these full-face videos, we constructed the facial-landmark videosby applying the FaceTracker software to generate 66 landmarks consisting oflocations indicating eyebrows, eyes, nose, mouth and the face profile, basedon which we could construct a facial-landmark video with white points on ablack background. Each facial-landmark video was based on one full-face video.So, we employed four (different actors) times five (different emotions) full-facevideos of actors expressing emotions, and four times five facial-landmark videosof the same emotions.


Within each video (full-face video or facial-landmark video), we arrangedvideo segments such that the emotion expression was displayed three times.Each participant was shown full-face videos of two actors, for each of whichfive videos were shown (expression the five basic emotions), and also threetimes five facial-landmark videos based on the other two actors in our set offour actors (to prevent interference as a result of identification of the actor).

3.3 Procedure

Participants participated individually, in a cubicle that contained a desktopcomputer and a display. All instructions and stimulus materials were shownon the computer display, and the experiment was completely computer con-trolled. Each participant was instructed that he or she would be shown twentyshort videos of facial expressing emotions, and that sometimes it would bethe full-face video, and sometimes it would be the facial-landmark video. Eachvideo was shown to the participants three times. Also, participants were in-structed that after each video they would be asked three questions about theemotion expressed by the face in the video. Each of these three questions wasexplained. Each participant was presented ten full-face videos, and, on differentscreens, also ten facial-landmark videos (see Figure 4). Half of the observerswere presented with the full-face videos first and the facial-landmark videossecond. The other observers were presented with the facial-landmark videosfirst and the full-face videos second. Each set of five emotions displayed thosefive videos in a different random order.

3.4 Questions

For each of the videos, participants were first shown the video and then, onthe next page, asked the three questions. In the first question, the participantwas asked to identify the emotions expressed in the video by selecting one ofsix options (’happiness’, ’sadness’, ’fear’, ’disgust’, ’anger’ and ’don’t know’).In the second and third questions, the participant was asked to rate the va-lence level of the expressed emotion (1 = negative, to 7 = positive, or ’don’tknow’), and the arousal level of the expressed emotion (1 = low arousal, to7 = high arousal, or ’don’t know’). All the questions could only be answeredone by one and a participant could not return to an earlier question. Basedon the full-face video, we constructed the facial-landmark video by employingFaceTracker software to generate 66 landmarks. These landmarks are the loca-tions of eyebrows, eyes, nose, mouth, and the face profile. Each facial-landmarkvideo was based on one full-face video. We construct a facial-landmark videowith white points on a black background.

8 Caixia Liu et al.

Fig. 4: An example of a frame of a full-face video and a frame of a facial-landmark video

Table 2: Participant’s identification of emotion displayed on full-face videos orfacial-landmark videos.

Reported emotionHappiness Sadness Fear Disgust Anger Don’t know

Happiness 91/53 0 / 3 3 / 6 0 / 9 0 / 0 6 /28Displayed Sadness 0 / 3 66/38 6 /16 13/ 6 0 / 9 16/28emotion Fear 0 /22 13/9 78/ 9 0 / 3 9 /16 0 /41

Disgust 0 / 9 29/16 3 / 9 65/13 0 /16 3 /38Anger 0 /16 0 / 0 0 / 9 0 / 6 100/56 0 /13

Fig. 5: Recognition accuracies (percent correct) for full-face videos. Blue bars:Bassili’s results. red bars: our results.

4 Results

The results of our experiment are listed in Table 2 and visualized in Figures 5and 6. In Table 2 has the same format as Table 1. The numbers representthe percentages correct recognition for full-facce videos (left of the slash) andlandmark videos (right of the slash). Each row represents the responses of 16participants in each of the two conditions.

Figures 5 and 6 show the same results. Figure 5 is a bar plot showing thepercentages correct recognition for each emotion for Bassili’s experiment (blue


Fig. 6: Recognition accuracies (percent correct) for landmark videos. Blue bars:Bassili’s results. red bars: our results.

bars) and our experiment (red bars). Figure 6 show the same results for thelandmark videos.

Overall, in our experiment better recognition accuracies were obtained thanin Bassili’s experiment, except for the emotional expression of disgust. Presum-ably these differences are due to the different ways that faces are representedand the use of professional actors in our data set.

The experimental hypotheses were primarily targeted at simple effects,which are emphasized below. The participants’ identification of emotion dis-played on full-face videos or facial-landmark videos was shown in Table 2.

4.1 Discussion of Hypothesis 1

Our first hypothesis (H1) stated that people can relatively accurately iden-tify emotional facial expressions based on original full-face videos. The aver-age percentage correct classification across the five emotions was 80%, andranged from 65% (disgust) to 100% (anger), where 16.7% accuracy would beexpected by chance. The average accuracy level differed significantly from thatexpected by chance, χ2(1) = 39.75, p < .0001. These results suggested thatparticipants relatively accurately identified facial expressions based on full-facevideos showing full faces of human actors supporting our first hypothesis.


The second hypothesis (H2) stated that people can also relatively accuratelyidentify emotional facial expression based on facial-landmark videos. The meanaccuracy level for the recognition of emotions displayed in the facial-landmarkvideos was 33.8% and ranged from 9% (fear) to 56% (anger). This averageaccuracy level was greater than expected by chance, χ2(1) = 3.81, p < .05,hence supporting our second hypothesis.

10 Caixia Liu et al.

Table 3: Participant’s evaluation of the valence and arousal levels of the facialemotional expressions.

Reported emotionValence (Negative - Positive) Arousal (Low - High)

Happiness 6.4/4.9 5.8/4.6Displayed Sadness 2.4/3.2 3.2/2.8emotion Fear 2.0/3.8 5.7/4.1

Disgust 1.9/3.6 5.7/4.3Anger 1.3/2.8 6.6/5.8


The third hypothesis was that people can identify emotional facial expressionsbetter based on full-face videos, than based on facial-landmark videos (H3).In line with Bassili but now using computer-generated facial-landmark videos,results supported our third hypothesis. That is, a comparison of the overallaccuracy rate on the full-face videos and facial-landmark videos revealed thatthe former was significantly higher than the latter, χ2(1) = 22.20, p < .001.These results suggest that facial expressions represented by landmarks can beuseful in the differentiation of emotions, but the addition of other kinds ofinformation might provide considerable help in the task.

Furthermore, because the emotions displayed by our actors may not havebeen optimally recognizable, we ascertained whether the structure of errorsunder full-face videos condition and facial-landmark videos conditions weresimilar. The correlation between responses given for full-face videos and facial-landmark videos (as shown in each cell of Table 2) was r = .77; p < .001. Thisindicated that participants’ errors were far from random, and the errors theymade in the facial-landmark videos were similar to those they made in thefull-face videos.


Our fourth and final hypothesis (H4) read that participants’ judgments aboutthe valence levels and arousal levels of emotional facial expressions displayedby full-face videos will show a strong correlation with those displayed by facial-landmark videos (H4). Our results also supported this hypothesis. That is, thecorrelation between valence evaluations given for full-face videos and facial-landmark videos (as shown in Table 3) was r = .91; p < .05, and for partici-pants arousal evaluations of the full-face videos and the facial-landmark videoswas r = .93; p < .05.

As can be seen in Table 3, the numbers represent the average levels of par-ticipants’ valence and arousal evaluations when shown videos of the emotiondescribed by the row label; Numbers on the left of the slash are for responsesto the full-face videos, and those to the right are for the response to the facial-


landmark videos. Each row represents the responses of sixteen participants ineach of the two conditions.

The two plots in Figure 7 illustrate the location of each emotional facialexpression in a 2-dimensional space of valence (x-axis) and arousal (y-axis).The left plot represents the result of the full-face videos and the right plot rep-resents the result of facial-landmark videos. Figure 7 suggests that emotionalfacial expressions of full-face videos and facial-landmark videos are judgedto be in the same coordinate area of this 2-dimensional space of valence andarousal. The results of valence and arousal levels for full-face videos and facial-landmark videos are similar. The results shown in Figure 7 may be comparedto the “ideal” of the circumplex model shown in Figure 1. Clearly, there isa better match for the full-face videos (top of Figure 7) than for the facial-landmark videos (bottom of Figure 7).

5 Discussion and conclusion

We found participants to be able to recognize emotional facial expressions fromseverly reduced landmark representations. At the same time, emotion recog-nition for full-face videos was better than for the reduced landmark videos.Apparently, the landmarks form a sufficient but not complete representationof emotional facial expressions. This may be due to the limited number oflandmarks, but a more likely explanation is the lack of texture and color in-formation in the landmark videos. It is known that eyebrows and color areimportant for facial recognition [13]. The same may apply to emotional ex-pression recognition.

In line with earlier findings, our results suggested that participants couldaccurately identify emotional facial expressions of facial-landmark videos (thoughless accurately than those expressed by full-face videos). This leads to the con-clusion that digitally-extracted facial landmarks form an appropriate represen-tation for generating realistic (recognizable) facial expressions of emotions inrobots and ecas.

Acknowledgements We wish to express our gratitude to Ruud Mattheij, and Peter Rui-jten, and the Persuasive Technology Lab Group at TU/e for the fruitful discussions aboutthis work.

References

1. Vinciarelli, A., Pantic, & Bourlard, H.: Social Signal Processing: Survey of an EmergingDomain. Image and Vision Computing, 27, 12, 1743-1759. (2009).

2. Carroll, J., & Russell, J.A. Do facial expressions signal specific emotions? Judging emo-tion from the face in context. Journal of Personality and Social Psychology, 70, 205-218.(1996).

3. Breazeal, C.: Designing sociable robots. MIT Press, Cambridge. (2002).4. Saragih, J. M., Lucey, S., Cohn, J. F., & Court, T.: Real-time avatar animation from a

single image. Automatic Face & Gesture. (2011).

12 Caixia Liu et al.

Fig. 7: The location of each emotional expression evaluated in the full-face videos (top) and facial-landmark videos (bottom), represented on a 2-dimensional space of valence (x-axis) and arousal (y-axis)


5. Johansson, G.: Visual motion perception. Scientific American. 232, 76-88. (1975).6. Bassili, J. N.: Facial motion in the perception of faces and of emotional expression. Journal

of Experimental Psychology: Human Perception and Performance. 4, 373-379. (1978).7. Saragih, J., Lucey, S., & Cohn, J.: Deformable model fitting by regularized landmark

mean-shift. International Journal of Computer Vision. 91, 200-215. (2011).8. Alexander, O., Rogers, M., Lambeth, W., Chiang, M., & Debevec, P.: Creating a photo-

real digital actor: the digital Emily project. 2009 Conference for Visual Media Production.176-187. (2009).

9. Yang, C., & Chiang, W.: An interactive facial expression generation system. SpringerScience Business Media. Mutimed Tools Appl. (2007).

10. Lucey, P., Lucey, S., & Cohn, J. F. Registration invariant representations for expressiondetection. 2010 International Conference on Digital Image Computing: Techniques andApplications. i, 255-261. (2010).

11. Banziger, T., Mortillaro, M., & Scherer, K. R.: Introducing the Geneva MultimodalExpression corpus for experimental research on emotion perception. Emotion. Advanceonline publication. doi:10.137/a0025827. (2011).

12. Banziger, T., & Scherer, K. R.: Introducing the Geneva Multimodal Emotion Portrayal(GEMEP) corpus. In K. R. Scherer, T. Banziger, & E. B. Roesch (Eds.), Blueprint foraffective computing: A sourcebook (pp. 271-294). Oxford, England: Oxford universityPress. (2010).

13. Sinha, P., Balas, B., Ostrovsky, Y., & Russell, R. Face Recognition by Humans: NineteenResults All Computer Vision Researchers Should Know About. Proceedings of the IEEE,94, 11, 1948-1962.

14. Lucey, S., Wang, Y., Saragih, J., & Cohn, J.: Non-rigid face tracking with enforcedconvexity and local appearance consistency constraint. International Journal of Imageand Vision Computing. 28, 781-789. (2010).

15. Saragih, J., Lucey, S., & Cohn, J.: Face alignment through subspace constrained mean-shifts. IEEE International Conference on Computer Vision. 1034-1041. (2009).

16. Saragih, J., Lucey, S., & Cohn, J.: Deformable model fitting with a mixture of localexperts. International Conference on Computer Vision. (2009).

17. Fong, T., Nourbakhsh, I., & Dautenhahn, K.: A survey of socially interactive robots.Socially Interactive Robots. Robotics and Autonomous Systems. 42, 143-166. (2003).

18. Xiao, J., Chai, J., & Kanade, T.: A closed-form solution to non-rigid shape and motionrecovery. International Journal of Computer Vision. 2, 233-246. (2006).

19. Russell, J.A. Reading emotions from and into faces: Resurrecting a dimensional-contextual perspective. In J.A. Russell, & J.M. Fernandez-Dols (Eds.), The Psychology ofFacial Expressions (pp. 295-320). New York, NY: Cambridge University Press. (1997).

20. Aviezer, H., Hassin, R.R., Bentin, S., & Trope, Y. Putting Facial Expressions Back inContext. In N. Ambady & J.J. Skowronsky (Eds.), First Impressions (pp. 255-286). NewYork, NY: The Guildford Press, (2008).

Representing Affective Facial Expressions for Robots and Embodied Conversational Agents by Facial...

Documents

Transcript of Representing Affective Facial Expressions for Robots and Embodied Conversational Agents by Facial...