Differential optical flow applied to automatic facial expression recognition

Neurocomputing 74 (2011) 1272–1282

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

jvruiz@

antonio

juanjose

journal homepage: www.elsevier.com/locate/neucom

Differential optical flow applied to automatic facial expression recognition

A. Sanchez, J.V. Ruiz, A.B. Moreno, A.S. Montemayor, J. Hernandez, J.J. Pantrigo �

Departamento de Ciencias de la Computacion, Universidad Rey Juan Carlos, c/ Tulipan, s/n, 28933 Mostoles-Madrid, Spain

a r t i c l e i n f o

Available online 19 October 2010

Keywords:

Facial expression analysis

Optical flow methods

Facial feature points

Expression motion extraction

Support vector machines

12/$ - see front matter & 2010 Elsevier B.V. A

016/j.neucom.2010.07.017

esponding author.

ail addresses: [email protected] (A. Sanch

alumnos.urjc.es (J.V. Ruiz), belen.moreno@urj

[email protected] (A.S. Montemayor), javier.herna

[email protected] (J.J. Pantrigo).

a b s t r a c t

This work compares systematically two optical flow-based facial expression recognition methods. The

first one is featural and selects a reduced set of highly discriminant facial points while the second one is

holistic and uses much more points that are uniformly distributed on the central face region. Both

approaches are referred as feature point tracking and holistic face dense flow tracking, respectively. They

compute the displacements of different sets of points along the sequence of frames describing each facial

expression (i.e. from neutral to apex). First, we evaluate our algorithms on the Cohn–Kanade database for

the six prototypic expressions under two different spatial frame resolutions (original and 40%-reduced).

Later, our methods were also tested on the MMI database which presents higher variabilities than the

Cohn–Kanade one. The results on the first database show that dense flow tracking method at original

resolution slightly outperformed, in average, the recognition rates of feature point tracking method (95.45%

against 92.42%) but it requires 68.24% more time to track the points. For the patterns of MMI database, using

dense flow tracking at the original resolution, we achieved very similar average success rates.

& 2010 Elsevier B.V. All rights reserved.

1. Introduction

Automatic facial expression analysis (AFEA) is becoming anincreasingly important research field from automatic face recogni-tion due to its multiple applications: human–computer intelligentinterfaces (HCII), video games, human emotion analysis, talkingheads or educational software, among others [1,2]. Due to thepractical interest of this area, many papers have been published onfacial expression recognition in the last 15 years [3–6]. Differentsurvey works describing the state-of-the-art on expressionrecognition from the computer vision perspective haveperiodically appeared [2,7–10]. AFEA systems typically deal withthe recognition and classification of facial expression data whichare given by one of these two kinds of patterns: facial action unitsand emotion expressions.

�
Facial action units (AUs): correspond to subtle changes in localfacial features related to specific facial muscles (i.e. lip cornerdepressor, inner brow raiser, etc.). They form the facial actioncoding system (FACS) which is a classification of the possiblefacial movements or deformations without being associated tospecific emotions. Descriptions of the AUs were initially
ll rights reserved.

ez),

c.es (A.B. Moreno),

[email protected] (J. Hernandez),

presented in Ekman and Friesen [11]. These AUs can appearindividually or in combination with other.
� Emotion expressions: are facial image feature configurations of
the most prototypic basic emotions (disgust, fear, joy, surprise,sadness and anger, respectively) which are universal in differentraces and cultures. Each emotion has a correspondence with thegiven prototypic facial expression features [11].

Many papers in the AFEA literature perform the analysis orrecognition of expressions by considering both types of patterns[2,10]. Examples of works related to the recognition of some AUsare Lien et al. [12] or Tian et al. [13]. Examples of works which dealwith the emotion expressions feature analysis are Lyons andBudynek [14] or Wen and Huang [15]. There are also paperswhich consider the classification of facial expressions using bothAUs and emotion expressions [16]. Our work only uses the emotionexpression feature patterns extracted from facial expressions invideo sequences for their classification.

In order to train and evaluate the proposed AFEA systems, thedevelopment of robust algorithms with respect to the individualdifferences in expressions needs from databases containingvariabilities of subjects with respect to races, ages, genders, illumi-nation conditions, poses, partial occlusions of the face, etc. Someexamples of relevant facial expression databases for such task are:Cohn–Kanade database [17], Japanese woman database (JAFFE) [4],MMI database [18], Belfast Naturalistic Emotional database [19].Different survey papers [1,2,7,20] summarize the main facialexpression analysis techniques and algorithms. In general, manydifferent approaches in this field consider three typical stages in an

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2010.07.017

mailto:[email protected]






dx.doi.org/10.1016/j.neucom.2010.07.017

Fig. 1. Flow diagram of the proposed facial expression recognition approach.

A. Sanchez et al. / Neurocomputing 74 (2011) 1272–1282 1273

AFEA system [20]: (a) face detection and normalization, (b) featureextraction and (c) expression classification.

A work by Tian et al. [20] classifies the facial features that modelthe facial changes into: (a) deformation features and (b) movementfeatures. Deformation features do not have the information of thepixel movement into account, and they can be obtained from staticimages. Movement features are focused on the facial movementsand they are applied to video sequences. The most relevanttechniques which use these features are: (a) movement models[21], (b) difference images [20], (c) marker trackers [22], (d) featurepoint tracking [12,23] and (e) dense optical flow [24]. The last twomethods have been extensively experimented and compared inthis work. The integration of optical flow with movement modelsincreases the system stability and improves the facial movementinterpretation and related facial expression analysis.

Dense optical flow computed in rectangular regions was used inMase [25] to estimate the activity of 12 facial muscles. Among thefeature point tracking based methods a representative work is Tianet al. [13] where lip, eye, eyebrows and cheeks models wereproposed and the feature point tracking was performed formatching the model contours to the facial features. A spatial–temporal description which integrated dense optical flow, featurepoint tracking and high gradient component analysis in the samehybrid system was proposed in Lien et al. [12], where hiddenMarkov models (HMMs) were used for the recognition of 15 AUs.They also needed a manual detection of points in the neutral face.

Also related to our approach, different papers have recentlyproposed hybrid techniques to increase the robustness of theoptical flow tracking of facial points in video sequences. Tai andHuang [23] combined cross-correlation with optical flow to beapplied between successive frames on neighboring regions of thetracked points, with the goal to achieve an improved estimation ofthe feature point positions. Shin and Chun [26] used the facial pointdisplacements as input parameters of a HMMs to effectivelyrecognize the basic facial expressions in real time. Other twoclosely related applications of tracking the movement of subsets offacial points are imitation of facial expressions [27] and video-based automatic speechreading [28].

The rest of the paper is organized as follows. Section 2 describesthe two proposed optical flow-based expression recognitionsystems. Section 3 presents the different experiments performedusing two different facial expression databases (Cohn–Kanade andMMI, respectively) for testing and comparing the proposed video-based expression classification methods. Finally, Section 4 statesthe conclusions of this work.

2. Proposed facial expression recognition system

The proposed recognition system uses two tracking strategiesbased on differential optical flow: the first one is feature-based andconsiders the movement of 15 feature points while the second oneis holistic and uses the displacements of the facial points which aredensely and uniformly placed on a grid centered on the central faceregion. Our system consists of four modules: pre-processing,feature point tracking, dense flow point tracking and facialexpression classification. Fig. 1 represents the components of ourfacial expression recognition system. The following subsectionsdetail the involved stages in each module.

2.1. Pre-processing

The pre-processing module is different for the two methods:In the feature point tracking method, we first select manually the

two inner eye corner points which are used to define a referencevector ~nv for normalizing the modules of point global-displacements

vectors and to correct the orientation angles of these vectors. Themodule of ~nv and the angle of this vector with the horizontal axis~w ¼ ð1,0ÞT are, respectively, computed as

j ~nvj ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffinv2

xþnv2y

q, y ~nv ,~w ¼ cos�1

~nv � ~w

j ~nvJ~wj

� �ð1Þ

Finally, the global displacement of any feature point~p is computed bydividing j~pj by j ~nvj. Its normalized angle is computed by adding to y~p(i.e. orientation of vector~p) the positive or negative value correspond-ing to y ~nv ,~w . This way, two normalized features are obtained for eachof the tracked feature points. With this pre-processing, the influenceof some small variability in scale and pose is reduced.

In our feature point tracking approach, the 15 feature pointstracked (the two eye inner corners are also marked but they areonly used for face normalization purposes) are extracted from thefirst frame of the sequence and no face region segmentation stage isrequired. This set of feature points is represented in Fig. 2, wherethe points are located in the following facial regions: four in theeyes region (two points on each eye placed on the upper and lowereyelids, respectively), four on the eyebrow regions (two points foreyebrow which are the innermost and outermost ones), four on themouth (which correspond to both corners and the upper and lower-lip middle points), and other three detectable points (one is placedon the chin, and the other two are symmetrically placed one oneach cheek). Table 1 describes the acronyms of points in Fig. 2. Theselection of this specific subset of points is based both ontheir significant displacements when the different expressionsare produced and also on their high intensity contrast withrespect to their surrounding regions which makes them moredistinguishable. These points are a subset of the MPEG-4 featurepoints standard [29]. It defines a face model in its neutral state andalso a set of 84 feature points whose movements are used torecognize facial expressions and to animate the face model. Similarsubsets of points have been selected in other works like [1,23,30] or

A. Sanchez et al. / Neurocomputing 74 (2011) 1272–12821274

[31]. Note that eye inner-corner points (REI and LEI, respectively)are only used for face normalization in the feature point tracking(FPT) method. These points are marked with the computer mouse

Fig. 2. The 15 selected facial feature points.

Table 1Acronyms for feature points in Fig. 2.

Acronym Point description

RBO Right eyeBrow Outer corner

RBI Right eyeBrow Inner corner

LBO Left eyeBrow Outer corner

LBI Left eyeBrow Inner corner

RET Right Eye Top point

REB Right Eye Bottom point

LET Left Eye Top point

LEB Left Eye Bottom point

RC Right Cheek point

LC Left Cheek point

MC Middle Chin point

RCM Right Corner of the Mouth

ULC Upper-Lip Center

LCM Left Corner of the Mouth

LLC Lower-Lip Center

Fig. 3. Dense flow track

in the first frame of the expression and then they are automaticallytracked along the rest of frames in the video sequence describingthe expression using Lucas–Kanade optical flow algorithm [32].

In the dense flow tracking method, face normalization requires tolocate manually five considered facial points placed near the facesymmetry axis, to compute the face angle normalization and toobtain a rectangular bounding box containing the central face usinga heuristic procedure (additionally, only for the MMI database, if arelevant global head motion is present in video sequences, thenoise tip point is used to detect and subtract the effect of headmotion). This bounding box is then split into three regionsconsidering standard facial proportions. The dense flow trackingpre-processing procedure is illustrated in Fig. 3, where both inneredge corners are also used for pre-processing.

2.2. Feature point tracking

Expressions are recognized in the presented methods using thesum of displacement vectors for each of the considered facialfeature points from each frame to the next one in any expressionvideo-sequence. This task can be organized into two sub-stages:feature point location at each frame and optical flow application tothese feature points.

The Lucas–Kanade algorithm [32] for computing the opticalflow has been used to estimate the displacement of facialpoints. This algorithm is one of the most popular differential(or gradient-based) methods for motion estimation computing ina video sequence. It approximates the motion between two frameswhich are taken at times t and tþdt at every pixel position assuminga brightness constancy. The global video sequence displacementvector for each considered point in any type of facial expression iscomputed by applying the Lucas–Kanade algorithm between pairs ofconsecutive frames. These corresponding inter-frame displacementvectors are then added to obtain the global displacement vectorcorresponding to each point along the expression (i.e. from neutralface to apex). More formally, let (p1,y,pN) be the respective spatialimage positions of a generic feature point p at frames 1,y,N. Theglobal displacement vector ~vp of point p is

~vp ¼XN�1

i ¼ 1

~vpi ¼XN�1

i ¼ 1

ðpiþ1�piÞ ð2Þ

whose module j~vp j (in pixels) and angle y ~vpare

j~pj ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip2

xþp2y

q, y~p ¼ tan�1 py

px

� �ð3Þ

ing pre-processing.

Fig. 4. (Left) Result of applying feature optical flow on the considered subset of 15

points, and (right) application of dense flow method.


where px and py are the respective x- and y-components of ~vp . Next,this module and angle are both normalized for each tracked point (asit was explained in the pre-processing subsection) using Eq. (1). Inthis way, two normalized features are extracted for each of the 15facial points considered. Once this task is applied, a global featurevector ~v of 31 components is created for each facial expressionsequence, containing the pair of displacement features of the pointsand also a natural number T (TA ½0, . . . ,5�) is added to the end of thevector to code the labeling of the corresponding expression:

~v ¼ ½j~p1j,y~p1,j~p2j,y~p2

, . . . ,j~pNj,y~pN,T� ð4Þ

In this vector, j~pij represents the normalized module and y~piis the

normalized angle of the vector ~pi of accumulated displacementcorresponding to a given feature point. The whole set of these featurevectors, corresponding to all considered facial video sequences, isproperly partitioned into two disjoint files (training and test files)and they are used by the SVM to classify the considered types ofexpressions.

The left image of Fig. 4 corresponds to the last frame of a surpriseexpression and shows the trajectories (i.e. the bright lines, with theinitial point positions in dark and the corresponding final ones inbrighter color) followed by each tracked feature point for a sampleface of Cohn–Kanade database.

2.3. Dense flow point tracking

The accurate extraction of the considered feature points maybecome difficult and exhausting when these points are markedmanually on the first frame of many video sequences (each of thesepoints could be considered as correctly selected if they are placedinside a reduced specific region around its ‘‘perfect’’ position). Thatis why we have also experimented the tracking of a grid ofuniformly distributed points around the central facial region.This region of interest is automatically extracted using the pro-posed dense flow pre-processing. Since the images in the con-sidered databases have a certain spatial resolution (640�480 inCohn–Kanade and 720�576 in MMI, respectively) and the neigh-boring pixels in the face along consecutive frames present a hightemporal correlation, the application of the Lucas–Kanade algo-rithm to each point of the considered facial region becomescomputationally very expensive and not useful at all. Therefore,we first applied a two-level Gaussian pyramid to reduce 1/16 thewhole number of points. Next, we placed the grid of 30�20 on thecentral face region and the vertices are the face points to be tracked.With these two reductions, the optical flow algorithm is nowcomputed on only 600 points instead of on approximately theinitial 60,000 points per face. Doing so, the facial movement vectorsbetween frames now become smoother and equally representative,but the workload is reduced significantly. Now, the global

displacement vector of each analyzed point is computed, and theircorresponding modules and angles are obtained in the same waythan the feature point tracking method (in this case, no individualnormalization of displacement points is required after applying thepre-processing for dense flow tracking as explained previously forthis situation). In consequence, a global displacement vector of1201 features (i.e. the module and angle of the 600 tracked pointswith the labeling of the expression) is extracted as pattern for eachexpression video sequence.

The right image of Fig. 4 corresponds to the last frame of ananger expression and shows the trajectories followed by all trackedpoints for a sample face of the MMI database.

2.4. Facial expression classification

Support vector machines (SVM) have been used as a classificationmethod for many face recognition systems [33,34]. Moreover, they hasbeen successfully applied to facial expression classification [35–37].So, we adopted it as a good alternative for the considered patternrecognition problem. SVM is a supervised learning technique used forboth classification and regression problems and it is derived from thestatistical learning theory. Some interesting properties of SVM asclassifier are: (1) its ability to work with high-dimensional data and (2)a high generalization performance without the need of a prioriknowledge, even when the dimension of the input space is verylarge. The problem that SVM tries to solve is finding the maximal-margin separating hyperplane with respect to the training test thatcorrectly classifies the data points, using specific kernel functions, byseparating the points into two classes as much as possible. SVM havealso been generalized to find the set of optimal separating hyperplanesfor a multi-class problem [38]. In SVM classification, it is possible tochoose the specific kernel mapping suited to the application. However,in many works, some common types of kernel functions are used, inspecial the Gaussian, linear, or polynomial ones. Excellent intro-ductions to this powerful machine learning technique are found inCristianini and Shawe-Taylor [39] and Vapnik [40].

We used the SVMTorch package [41] for our facial classificationexperiments. As usual, SVMTorch requires from a training and atesting stage. During the training, a set of the SVM parameters (i.e.those related with the type of kernel used by the classifier) weresystematically adjusted. Once the SVM has been trained, we use thetest set of facial expression sequences to compare the performanceof the two considered facial expression recognition approaches:feature point and dense flow tracking, respectively. We adopted theone-versus-all multi-class classification technique implemented inSMVTorch. In this approach, binary classifiers are trained todistinguish each type of expression from the other ones, andthe binary classifier with the highest output determines towhich class the expression is assigned. Parameter selections ofSVM have been systematically adjusted by line search for linear,polynomial and Gaussian kernels, and those parameters producingbest classification results were selected. In general, we usedapproximately 75% of the image sequences to train the SVM andthe remaining 25% for testing purposes. A k-fold cross-validationstage was also used in the classification system.

3. Experimental results

This section describes and analyzes the facial expressionrecognition experiments on the two considered databases.

3.1. Facial expression databases and experimental framework

An important aspect to take into account when designing anyrecognition system (in our case, a facial expression classification


one) is the right choice of the database where to test it. If is desirableto find standard databases (i.e. commonly used by many research-ers in the field) and these datasets should also allow to handle thecurrent difficulties of the application considered. Some of the actualdifficulties in face recognition systems are described in a recentsurvey by Kumar [2]. They include the possibility of working onpeople with different skin colors, the ability to handle moderatepresence of occlusions, the robustness against different lightingconditions, and the recognition of spontaneous expressions, amongothers. The Cohn–Kanade database is actually the most widely usedfor recognizing facial expressions [10] and it provides variability onskin colors of the subjects as well as small changes in illuminationconditions. The MMI database [18] is a newer database thatcontains both static images and videos from both posed andspontaneous expressions (including possible head movements).By quoting to some of their developers: ‘‘MMI database is, to ourknowledge, the most comprehensive dataset of facial behaviorrecordings to date’’ [10]. It is mainly for these reasons why we haveused these two databases in our experiments.

In the Cohn–Kanade facial expression database [17], image dataconsist of approximately 500 frame-sequences of about 100 differentsubjects. Ages of the included subjects ranges from 18 to 30 years, 65%of them were female, 15% were African–American and 3% Asian orLatino. Sequences of frontal images representing posed facialexpressions always start with the neutral face and finish with the

Fig. 5. Two sample expression sequen

Fig. 6. Two sample expression se

expression at its highest intensity or apex (the corresponding framesare in this way incrementally numbered). These sequences werecaptured with a video camera at 12 frames/s in VGA resolution(640�480), 8-bit grayscale precision, and stored in jpeg format.

To demonstrate the robustness of our methods, we have alsotested them on a more complex dataset as it is the MMI database[18]. It contains videos from facial behavior recordings includingboth posed and spontaneous expressions (including possible headmovements). Background of frames is natural lighting but somesamples present a variable background with non-controlledillumination. This database is composed of 52 different subjectsof both sexes (48% female), aged from 19 to 62 years, with either aEuropean, Asian, or South American ethnicity. It includes a morepronounced subject-to-subject variations than Cohn–Kanadedatabase (i.e. people wearing glasses versus those not wearingglasses). Participants were trained to display 79 series of facialexpressions which includes the six prototypic emotions. Imagesequences have neutral faces at the beginning and the end (for ourexperiments, we extracted the corresponding subsequence fromneutral face to the expression apex), and they are digitized into720� 576 spatial resolution. As the original data in this dataset arecolor images, we converted them to 8-bit grayscale images for ourexperiments. The average number of frames per sequence was 18.

Figs. 5 and 6, respectively, show some frames corresponding totwo different sequences on both considered databases. As it can be

ces from Cohn–Kanade database.

quences from MMI database.

Fig. 7. Software tool visual interface.


noticed, the second database presents larger variability regardingpartial face occlusions (i.e. the presence of glasses, beard or cap).

Most of the components of the proposed expression recognitionsystem were programmed under MATLAB in a Pentium 4, 2.2 GHzwith 1 GB of RAM. Fig. 7 presents the interface of the developedtool. The application has three main components. On the leftcolumn, we can select the point tracking method, follow thecorresponding steps for the specific optical-flow application, andassign the corresponding expression labeling to the displayed face.The central column shows the analyzed video sequence (bothreduced and magnified) and it also shows the displacements oftracked points for both optical-flow methods. The right part of theinterface provides some auxiliary functionality.

Next, we describe the experiments carried out on theCohn–Kanade database and next those on the MMI database.

3.2. Experimentation on the Cohn–Kanade database

In the Cohn–Kanade database, there is a considerable number ofsubjects for which the number of expression sequences is veryreduced (i.e. only two expressions per subject). When some of theclasses are represented by significantly less number of instancesthan the other ones, the learning performance of classificationalgorithms hinder. This situation has been described as the classimbalance problem in Ertekin et al. [42]. Due to this, in ourexperiments we have considered the same numbers of samplesper class, both in the training and in test sets. A total of 246expression sequences were selected from the database. Thesesequences came from 41 random subjects with the 6 emotionsper subject.

To compare the feature point (FPT) and the dense flow point(DFT) tracking methods, several types of experiments were carriedout. First, a global comparison between the methods at theconsidered image resolutions (1:1 and 1:0.4, respectively) is

presented in Experiment 1. For this, we also applied a k-foldcross-validation procedure and the corresponding results are alsopresented. Next, we study in Experiment 2 the discriminant powerof the FPT method to recognize the different expressions using onlythe points belonging to one facial region (or a subset of them). Theglobal success recognition rates for each of the six prototypicexpressions are shown in Experiment 3. Finally, we also include acomparison to other related works using optical flow-basedapproaches on this same database.

3.2.1. Experiment 1: global comparison of the four proposed optical

flow-based methods

By considering the two optical flow point tracking methods andthe two different spatial image resolutions, four methods havebeen designed:

�
Method 1: Feature point tracking using the original imageresolution in the Cohn–Kanade database (FPT 1:1). � Method 2: Feature point tracking by reducing to 40% the original
image resolution (FPT 1:0.4).
� Method 3: Dense flow tracking using the original resolution
(DFT 1:1).
� Method 4: Dense flow tracking by reducing to 40% the original
image resolution (DFT 1:0.4).

As it was pointed out, a total of 246 image sequences of facialexpressions were used for experimentation. The SVMTorch toolwas trained using three common types of kernels (polynomial,Gaussian and sigmoid, respectively) and we have adjusted theircorresponding parameters by a line search procedure for a betterperformance. Average expression recognition success rates for thewhole set of experiments using these kernels were: 57.20%with polynomial kernel, 77.65% with Gaussian kernel and 41.67%with sigmoid kernel, respectively. Consequently, as the best


results were achieved using the Gaussian function, the resultspresented in this paper on both Cohn–Kanade and MMI datasets areall obtained with this type of kernel (using different values ofparameter std).

Next, we compare the best recognition results for the fourapproaches (FPT 1:1, FPT 1:0.4, DFT 1:1 and DFT 1:0.4, respectively)using a test set of 66 random expression sequences. Table 2presents the best recognition rates for each type of basicemotion using the four methods. Best average recognitionresults for the six types of facial expressions are presented in thelast row of this table. Values of SVMTorch parameters for Gaussiankernel: std (standard deviation) and c (trade-off value betweentraining error and margin) are also shown for the four methods inTable 2.

Best average facial expression recognition results were achievedwith the dense flow tracking method at the original frameresolution (95.45% of success). The difference between using thissame method at original video resolution and reducing it to 40% ofits original size is negligible (95.20% of correct recognition).However, the application of Lucas–Kanade algorithm to this secondcase reduces its computation time by 85% in average (209.29 s forDFT 1:1 and 31.55 s for DFT 1:0.4, respectively). Using the FPTmethod, the average success recognition rate at the originalresolution is 92.42% compared to 87.37% by reducing the frameresolution to 40%. In the FPT case, reducing the spatial resolutionimproves the Lucas–Kanade computation time by 60% (124.4 s forFPT 1:1 and 49.77 s for DFT 1:0.4, respectively). The best recognizedexpression for the DFT 1:1 method is ‘‘sadness’’ with a 100% ofsuccess rate and the worst recognized one is ‘‘fear’’ with a 90.91% ofsuccess using our test set. Fig. 8 resumes the average comparativetimes of the four methods.

In general, it seems obvious that the proposed dense flowtracking approach is superior in accuracy to the correspondingfeature point tracking one since the number of tracked points is

Table 2Best recognition results obtained by the four methods for each type of expression.

Facial expression FPT 1:1 FPT 1:0.4 DFT 1:1 DFT 1:0.4

std¼300 std¼700 std¼7000 std¼2500

c¼10 std¼700 c¼100 c¼100

Joy 96.97 90.91 93.94 95.45

Surprise 98.48 92.42 96.97 98.48Sadness 90.91 87.88 100.00 92.42

Anger 86.36 78.79 95.45 93.94

Disgust 92.42 90.91 95.45 96.97Fear 89.39 83.33 90.91 93.94

Average 92.42 87.37 95.45 95.20

Fig. 8. Average execution times considered optical flow-based methods.

much higher and consequently it uses more data. For a more faircomparison between these two methods, the trade-off betweencorrect expression classification and computational complexity is abetter criterion to consider. In that sense, our experimental resultsshow that for this database the dense flow tracking method atoriginal resolution slightly outperforms in average the recognitionrates of the feature point tracking method (only 3%, that is 95.45%versus 92.42%) but it requires 68.24% more time to track the pointsof an expression video sequence.

Since the number of samples (i.e. video sequences describingfacial expressions) to train and test our method was not very large,we used k-fold cross-validation [43] to estimate the performance ofthe learned model from available data using our algorithm (i.e. tomeasure the generalization capability of our algorithm). Table 3,which has the same appearance of Table 2, presents the averagefourfold cross-validation results for our optical flow configurationsand each type of expression. It can be observed that the averageexpression recognition rate decreases slightly (about 2.6%, exceptfor FPT 1:0.4 case for which it remains similar).

3.2.2. Experiment 2: discriminant power of the different facial

landmarks

We have analyzed for the FPT method at 1:1 resolution howdiscriminative different facial regions are. Again, a properly tunedGaussian kernel for the SVM classifier was used. The aim is todetermine how well the six prototypic expressions are recognizedusing only the points belonging to one facial region or to subsets ofseveral regions.

Fig. 9 presents the global classification results for the consideredsubsets of points of the facial regions. From the four groups of facialpoints belonging to only one face region, the mouth was the oneproviding the best expression-recognition results (in average, a71.21%). This is probably caused by the more-differentiatedmovements of the mouth muscles (and consequently thecorresponding facial points) for each facial expression asillustrated in Fig. 10. On the other hand, the displacements ofeye points in expressions are smaller and these are poorlydiscriminated (in average, only 43.94%). In our experiments, thecombination of mouth, eyebrows and cheek points provided thebest recognition results (an average rate of 81.82%).

Table 3Results achieved after applying fourfold cross-validation.

Expression FPT 1:1 FPT 1:0.4 DFT 1:1 DFT 1:0.4

Joy 86.36 84.09 97.73 95.45

Surprise 95.46 90.91 97.73 88.64

Sadness 95.45 90.91 100.00 97.73

Anger 86.36 88.64 88.64 88.64Disgust 95.45 90.91 95.45 93.18

Fear 81.82 77.28 77.28 79.55

Average 89.77 87.12 92.81 90.53

Table 4Confusion matrix for the DTF 1:1 method.

Joy Surprise Sadness Anger Disgust Fear

Joy 11 0 0 0 0 0

Surprise 0 11 0 0 0 0

Sadness 0 0 11 0 0 0

Anger 0 1 0 8 2 0

Disgust 0 0 0 0 11 0

Fear 4 1 0 0 1 5

Fig. 9. Discriminant ability of facial regions containing the considered feature points.

Fig. 10. Positions of the four considered mouth feature points for each of the six

expressions.

Fig. 11. Discriminant power of respective subsets of the six basic expressions using

the DFT 1:1 method.

Table 5Success rate produced by our FPT 1:1 method compared to other two related feature

point-tracking approaches.

Seyedarabi et al. Tai and Huang Our method

Joy 100.00 100.00 96.97

Surprise 96.20 96.66 98.48Sadness 87.10 89.42 90.91Anger 89.60 90.06 86.36

Disgust 95.70 96.70 92.42

Fear 81.20 84.00 89.39Average 91.60 92.80 92.42


3.2.3. Experiment 3: discriminant power of the six prototypic facial

expressions

Table 4 shows the corresponding confusion matrix of the sixtypes of expressions using the 66 test expression sequences for theDFT 1:1 method. By analyzing these results according to the type ofexpression being recognized, we observe that for four emotions(‘‘joy’’, ‘‘surprise’’, ‘‘sadness’’ and ‘‘disgust’’) the system detectedcorrectly all the 11 corresponding test sequences. The ‘‘disgust’’expression is correctly recognized in 8 of the 11 test images(72.73%) while the worst results correspond to the ‘‘fear’’expression achieving only 45.45% of correct recognition in thisdatabase (see Fig. 11).

3.2.4. Comparison to other related works on Cohn–Kanade database

In general, it is difficult to fairly compare the results of our facialexpression recognition methods with other related works. Todelimit the problem, we state as comparative framework those

works using optical flow-based approaches which work onsequences of images (describing the considered six basic expres-sions) and which use this same database. In this context, a work byLien et al. [12] also applied feature and dense flow tracking but theirrecognition approach but it was based on the facial action codingsystem (FACS) to recognize action units (AU) and they consideredthe point displacements of the upper face region (above both eyebrows). Their recognition was performed using hidden Markovmodels (HMM) as classification tool. They reported only theaverage expression recognition rate for the feature tracking(85%) and for the dense flow tracking (93%), respectively.

More recently, Seyedarabi et al. [30] and Tai and Huang [23]have presented optical flow algorithms for tracking facial featurepoints on the Cohn–Kanade database. In the work by Seyedarabi etal., a set of 21 facial points were tracked using cross-correlationoptical flow, and the extracted feature vectors were used to classifythe expressions with both RBF neural networks and fuzzy inferencesystems (FIS). Best results were achieved with RBF classifiers thatproduced a 91.6% of success rate. The method presented by Tai andHuang combines the application of optical flow to 17 facial pointswith mathematical models that establish distance relationsbetween groups of tracked points. Expressions were classifiedwith an ELMAN neural network producing an average expressionrecognition rate of 92.8%. Table 5 compares the results for each typeof basic expression reported by the two mentioned papers with ourresults (presented by the last column of Table 1). As it can benoticed, our average expression recognition result is quite similarto the corresponding ones reported by the other two methods(however, we tracked two points less than in Tai and Huang [23]and six ones less than in Seyedarabi et al. [30]). Besides, ourapproach produced the best recognition rates for three of the sixexpressions (‘‘surprise’’, ‘‘sadness’’ and ‘‘fear’’, respectively).

Table 6Recognition results produced by DFT 1:1 for each basic expression with and without

considering fourfold cross-validation.

Expression DFT 1:1

(with cross-validation)

DFT 1:1

(without cross-validation)

Joy 97.73 100.00Surprise 78.79 100.00Sadness 95.45 100.00Anger 70.46 100.00Disgust 91.67 90.91

Fear 62.12 81.82Average 82.70 95.45


3.3. Experimentation on the MMI database

We noticed that some subjects in the MMI database moved theirheads when producing the corresponding facial expressions (i.e.some people raised their heads in the ‘‘surprise’’ expression). Thisspurious head movement affects the correct classification ofexpressions. When directly applying optical flow approaches onthese MMI sequences, the head motion is also included in the facialpoint displacements caused by the expression. As stated in Lienet al. [44], removing completely the effects of head movement fromthe video sequence would be very difficult. If this movement issmall and more or less uniform, a solution is to apply an affine or aperspective transformation which allows to align the images sothat face position and orientation are kept relatively constantacross the subjects, and these factors do not affect significantly tofeature extraction of expressions. We applied the perspectivetransformation proposed in Lien et al. [44] for such purpose. In arelated context, Chaudhry [45] has also considered for this purposeseveral points from different facial regions which are invariant toexpressions to reduce the head motion. Once this effect isdiminished, the normalized local motions produced by the facialexpressions of considered points (as it was explained in Section 2)are used as features for classification.

For the sake of brevity, in this subsection we only present someresults in relation with the dense flow tracking method at originalresolution (DFT 1:1) since it produced the averaged higher recogni-tion rates for the set of expressions. Again, due to the number ofconsidered instances was not very big, a k-fold cross-validationapproach was also applied (good generalization results wereachieved for k¼4). In our experiments on the MMI database, atotal of 96 expression sequences were selected. These sequencescome from 16 random subjects with 6 emotions per subject. Wealso considered the same number of patterns per class since theclass imbalance problem [42]. As for the Cohn–Kanade database,the criteria followed to select the instances for training and testingwere to choose random sequences (which could be labeled as one ofthe 6 basic emotions) but with the aim of having approximately thesame number of patterns per class both in the training and testingsubsets. Recognition results on the MMI dataset produced by DFT1:1 method with and without cross-validation are presented inTable 6.

We also tested on the MMI database how parameter std ofGaussian kernel influenced the average expression recognition

1000 2000 3000 4000 5000 6000 7000 8000 90004.5

5

5.5

6

6.5

7

7.5

8

std

clas

sific

atio

n er

ror (

%)

Fig. 12. Influence of parameter std in average expression recognition error for DFT

1:1 (without cross-validation).

error for the DFT method (without cross-validation). Fig. 12 showsthat for values of stdA ½3000,7000�, the average expressionrecognition error was reduced to 4.65% (Table 6).

4. Conclusion and future work

We have developed and analyzed a semi-automatic facialexpression classification system that uses sequences of framesdescribing the dynamics of the expressions. Two types of opticalflow point-tracking approaches were compared: the first is featuraland is based on the tracking of a reduced set of distinguishablefacial points while the second is holistic and is based on the trackingof a more numerous set of points that are regularly distributed inthe central region of the face. These approaches have been,respectively, called as feature point tracking and dense flowtracking, and they were tested at two different frame resolution(the original one and the 40%-reduced one). Experiments werecarried out on two of the most relevant databases for facialexpression analysis: the Cohn–Kanade and the MMI databases.From our tests, we can conclude that dense optical flow methodusing SVM as classifier produced on the Cohn–Kanade databaseonly around 3% of better recognition results than the equivalentfeature point tracking approach (95.45% versus 92.42%). Identicalrecognition result was achieved using DFT at original frameresolution on the MMI database but, in this case, it was necessaryto include an additional pre-processing on such expressionsequences where the global head motion was more pronounced.In general, the DFT method also offered two additional advantages:similar recognition results for the two considered frame resolu-tions and fewer points to be marked manually. In particular, in theDFT method we only need to mark 5 points for the pre-processingstage (or 6 when the global head movement need to be corrected),while the FPT method requires to mark 17 points per face: 2 for pre-processing (both eyes inner corners) and the 15 ones to be tracked.However, a natural disadvantage of the DFT method is that itpresents a much higher processing time specially working at theoriginal frame resolutions (68.24% more average time to track thepoints of an expression video sequence for the Cohn–Kanadedatabase).

As future work, a foreseen improvement to achieve a fullyautomatic video-based facial expression recognition system is theautomatic detection of the considered pre-processing and featurepoints in the first frame of the sequence. Another research line is toadapt our system to classify additional types of expressions (i.e.‘‘worry’’, ‘‘boredom’’, etc.) as well as some different degrees ofintensities in each basic expression.

Acknowledgements

This research has been partially supported by the Spanishprojects TIN2008-06890-C02-02 and URJC-CM-2008-CET-3625.


The authors wish to thank J.F. Cohn for supplying us theCohn–Kanade Facial Expression database.

References

[1] B. Fasel, J. Luttin, Automatic facial expression analysis: a survey, PatternRecognition 36 (1) (2003) 259–275.

[2] B.V. Kumar, Face expression recognition and analysis: the state of the art,Internal Report, Columbia University, USA, 2009.

[3] Y. Yacoob, L.S. Davis, Recognizing human facial expression from long imagesequences using optical flow, IEEE Transactions on Pattern Analysis andMachine Intelligence 18 (6) (1996) 636–642.

[4] M. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, Coding facial expressions withgabor wavelets, in: Proceedings of the Third IEEE International Conference onAutomatic Face and Gesture Recognition (FG’98), 1998, pp. 200–205.

[5] Y. Zhu, L.C.de. Silva, C.C. Ko, Using moment invariants and HMM in facialexpression recognition, Pattern Recognition Letters 23 (1–3) (2002) 83–91.

[6] G. Zhao, M. Pietikainen, Dynamic texture recognition using local binarypatterns with an application to facial expressions, IEEE Transactions on PatternAnalysis and Machine Intelligence 29 (6) (2007) 915–928.

[7] M. Pantic, L.M. Rothkrantz, Automatic analysis of facial expressions: the state ofthe art, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (12)(2000) 1424–1445.

[8] C.C. Chibelushi, F. Bourel, Facial expression recognition: a brief tutorialoverview, CVonline: On-Line Compendium of Computer Vision, 2003.

[9] S. Krinidis, I. Buciu, I. Pitas, Facial expression analysis and synthesis: a survey,in: Proceedings of the HCI’03, vol. 4, 2003, pp. 1432–1436.

[10] Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, A survey of affect recognitionmethods: audio, visual, and spontaneous expressions, IEEE Transactions onPattern Analysis and Machine Intelligence 31 (1) (2009) 39–58.

[11] P. Ekman, W. Friesen, The Facial Action Coding System: A Technique for theMeasurement of Facial Movement, Consulting Psychologists Press, 1978.

[12] J.J. Lien, T. Kanade, J.F. Cohn, C.C. Li, Automated facial expression recognitionbased on FACS action units, in: Proceedings of the Third IEEE InternationalConference on Face and Gesture Recognition, 1998, pp. 390–395.

[13] Y.L. Tian, T. Kanade, J. Cohn, Recognizing action units for facial expression analysis,IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2) (2001) 1–19.

[14] M. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facialimages, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (12)(1999).

[15] Z. Wen, T. Huang, Capturing subtle facial motions in 3d face tracking,Proceedings of the ICCV, 2003.

[16] G. Littlewort, M.S. Bartlett, I. Fasel, J. Susskind, J. Movellan, Dynamics of facialexpression extracted automatically from video, Image and Vision Computing24 (2006) 615–625.

[17] T. Kanade, J.F. Cohn, Y. Tian, Comprehensive database for facial expressionanalysis, in: Proceedings of the Fourth IEEE International Conference onAutomatic Face and Gesture Recognition (FG’00), 2000, pp. 46–53.

[18] M. Pantic, M.F. Valstar, R. Rademaker, L. Maat, Web-Based Database for FacialExpression Analysis, in: Proceedings of the 13th ACM International Conferenceon Multimedia, 2005, pp. 317–321.

[19] E. Douglas-Cowie, R. Cowie, M. Schroder, The description of naturally occurringemotional speech, in: Proceedings of the 15th International Conference ofPhonetic Sciences, 2003, pp. 2877–2880.

[20] Y. Tian, T. Kanade, J.F. Cohn, Facial expression analysis, in: S.Z. Li, A.K. Jain (Eds.),Handbook of Face Recognition, Springer2004, pp. 247–275.

[21] Y. Zhang, Q. Ji, Q., Facial expression understanding in image sequences usingdynamic and active visual information fusion, in: Proceedings of the Interna-tional Conference on Computer Vision, 2006, pp. 113–118.

[22] J.P. Knecht, de, M. van Breukelen, Real time marker tracking, Technical Report,University of Delft, The Netherlands, 2009.

[23] S. Tai, H. Huang, Facial expression recognition in video sequences, in: LectureNotes in Computer Science, vol. 5553, Springer, 2009, pp. 1026–1033.

[24] I. Essa, A. Pentland, Coding, analysis, interpretation, recognition of facialexpressions, IEEE Transactions on Pattern Analysis and Machine Intelligence19 (1999) 757–763.

[25] K. Mase, Recognition of facial expression from optical flow, IEICE Transactions74 (10) (1991) 3474–3483.

[26] G. Shin, J. Chun, Spatio-temporal facial expression recognition using opticalflow and HMM, in: R. Lee (Ed.), Software Engineering, Artificial Intelligence,Networking and Parallel/Distributed Computing SCI, vol. 149, Springer2008,pp. 27–38.

[27] X. Cao, B. Guo, Real-time tracking and imitation of facial expression, in:Proceedings of the Second International Conference on Image and GraphicsSPIE, vol. 4875, 2002, pp. 910–918.

[28] J. Chen, Y. Laprie, M.O. Berger, A robust lip tracking system for the acoustic toarticulatory inversion, in: Proceedings of the 6th IASTED International Con-ference on Signal and Image Processing, 2004.

[29] MPEG (Moving Pictures Expert Group), International Standard on Coding of Audio-Visual Objects: The MPEG-4 Overview, Technical Report, ISO/IEC 14496, 2002,Available in: /http://mpeg.chiariglione.org/standards/mpeg-4/mpeg-4.htmS.

[30] H. Seyedarabi, A. Aghagolzadeh, S. Khanmohammadi, Recognition of six basicfacial expressions by feature-points tracking using RBF neural network and

fuzzy inference system, in: Proceedings of the 2004 IEEE InternationalConference on Multimedia and Expo, 2004, pp. 1219–1222.

[31] D. Vukadinovic, M. Pantic, Fully automatic facial feature point detection usinggabor feature boosted classifiers, in: Proceedings of the IEEE InternationalConference on Systems, Man and Cybernetics, 2005, pp. 1692–1698.

[32] B.D. Lucas, T. Kanade, An iterative image registration technique with anapplication to stereo vision, in: Proceedings of the 7th International JointConference on Artificial Intelligence, 1981, pp. 674–679.

[33] B. Heisele, P. Ho, T. Poggio, Face recognition with support vector machines:global versus component-based approach, in: Proceeding of the Eighth IEEEInternational Conference on Computer Vision, vol. 2, 2001, pp. 688–694.

[34] G. Shakhnarovich, Moghaddam, Face recognition in subspaces, in: S.Z. Li,A.K. Jain (Eds.), Handbook of Face Recognition, Springer, 2004, pp. 141–168.

[35] C. Shan, S. Gong, P.W. McOwan, Facial expression recognition based on local binarypatterns: a comprehensive study, Image and Vision Computing 27 (2009) 803–816.

[36] M. Valstar, M. Pantic, Fully automatic facial action unit detection and temporalanalysis, in: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshop, 2006.

[37] M.S. Bartlett, et al., Recognizing facial expressions: machine learning and applicationto spontaneous behavior, in: Proceedings of the IEEE International Conference onComputer Vision and Pattern Recognition, vol. 2, 2005, pp. 568–573.

[38] Y. Lee, Y. Lin, G. Wahba, C.C. Li, Multicategory support vector machines,Technical Report, University of Wisconsin, USA, 2001.

[39] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines,Cambridge University Press, 2000.

[40] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995.[41] R. Collobert, S. Bengio, SVMTorch: support vector machines for large-scale

regression problems, Journal of Machine Learning Research 1 (2001) 143–160.[42] S. Ertekin, J. Huang, C.L. Giles, Active learning for class imbalance problem, in:

Proceedings of the Annual ACM Conference on Research and Development inInformation Retrieval, 2007, pp. 823–824.

[43] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., John WileySons, 2001.

[44] J.J. Lien, T. Kanade, J.F. Cohn, C.C. Li, Detection, tracking, and classification of actionunits in facial expression, Robotics and Autonomous Systems 31 (2000) 131–146.

[45] F.A. Chaudhry, Improvement of Facial Expression Recognition through theEvaluation of Dynamic and Static Features in Video Sequences, Master Thesis,Otto-von-Guericke University Magdeburg, Germany, 2009.

A. Sanchez received his B.E. (1986) and Ph.D. (1990)degrees both in Computer Science from the TechnicalUniversity of Madrid, Spain. He is an Associate Professorof the Department of Computing at University Rey JuanCarlos, Madrid. His current research interests involveComputer Vision, Metaheuristics, Handwriting Recog-nition and Face Recognition. He is a member of IEEEComputer Society, and the Spanish Association forPattern Recognition and Image Analysis.

J.V. Ruiz received his M.Sc. in Computer Science in 2007from Rey Juan Carlos University at Madrid (Spain). Heactually works at IE Business School in Madrid. Hisresearch interests are Computer Vision and PatternRecognition.

A.B. Moreno is currently an Associate Professor of theDepartment of Computing at University Rey Juan Carlos,Madrid (Spain). She received Physics degree in theAutomatic Computing speciality from the ComplutenseUniversity of Madrid (1993), and her Ph.D. degree fromthe Polytechnic University of Madrid, Spain (2004). Sheis a Research Fellow in the Artificial Vision and BiometryGroup (GAVAB) of the University Rey Juan Carlos(Spain). Her major research interest includes computervision in general and face recognition, three-dimen-sional vision and facial expression analysis in particular.

http://mpeg.chiariglione.org/standards/mpeg-4/mpeg-4.htm


A.S. Montemayor was born in 1975, in Madrid, Spain.
He received his Ms. degree in Applied Physics atUniversidad Autonoma de Madrid in 1999 and Ph.D.degree at Universidad Rey Juan Carlos in 2006. He iscurrently Associate Professor at Universidad Rey JuanCarlos and main leader of the CAPO research line at theGAVAB group. His research interests include soft-com-puting, computer vision, image and video processingand real-time implementations.
J. Hernandez was born in 1984 in Bogota, Colombia. Hereceived his degree in Computer Science in 2008 fromthe Rey Juan Carlos University where he is currentlyAssistant Professor. His research interests include visualtracking problems and computer vision.

J.J. Pantrigo was born in 1975 in Caceres, Spain. He is
currently Associate Professor at Universidad Rey JuanCarlos. He received his Ms. degree in FundamentalPhysics at Universidad de Extremadura in 1998 andPh.D. at Universidad Rey Juan Carlos in 2005. From 1998to 2002 he was working in the Biomechanic Group atUniversidad de Extremadura. His research interestsinclude high dimensional state tracking problems,computer vision, metaheuristics and hybridapproaches.

Differential optical flow applied to automatic facial expression recognition

Documents

Transcript of Differential optical flow applied to automatic facial expression recognition