Visual learning of texture descriptors for facial expression recognition in thermal imagery

www.elsevier.com/locate/cviu

Computer Vision and Image Understanding 106 (2007) 258–269

Visual learning of texture descriptors for facial expressionrecognition in thermal imagery

Benjamın Hernandez c, Gustavo Olague a,d,*, Riad Hammoud b,Leonardo Trujillo a, Eva Romero a

a Departamento de Ciencias de la Computacion, Division de Fısica Aplicada, Centro de Investigacion

Cientıfica y de Educacion Superior de Ensenada, Ensenada B.C., Mexicob Delphi Electronics and Safety, Kokomo, IN, USA

c Instituto de Astronomıa, Universidad Nacional Autonoma de Mexico, Ensenada B.C., Mexicod Departamento de Informatica, Universidad de Extremadura en Merida, Espana

Received 6 December 2005; accepted 15 August 2006Available online 4 January 2007

Communicated by James Davis and Riad Hammoud

Abstract

Facial expression recognition is an active research area that finds a potential application in human emotion analysis. This work pre-sents an illumination independent approach for facial expression recognition based on long wave infrared imagery. In general, facialexpression recognition systems are designed considering the visible spectrum. This makes the recognition process not robust enoughto be deployed in poorly illuminated environments. Common approaches to facial expression recognition of static images are designedconsidering three main parts: (1) region of interest selection, (2) feature extraction, and (3) image classification. Most published articlespropose methodologies that solve each of these tasks in a decoupled way. We propose a Visual Learning approach based on evolutionarycomputation that solves the first two tasks simultaneously using a single evolving process. The first task consists in the selection of a set ofsuitable regions where the feature extraction is performed. The second task consists in tuning the parameters that defines the extraction ofthe Gray Level Co-occurrence Matrix used to compute region descriptors, as well as the selection of the best subsets of descriptors. Theoutput of these two tasks is used for classification by a SVM committee. A dataset of thermal images with three different expressionclasses is used to validate the performance. Experimental results show effective classification when compared to a human observer, aswell as a PCA-SVM approach. This paper concludes that: (1) thermal Imagery provides relevant information for FER, and (2) thatthe developed methodology can be taken as an efficient learning mechanism for different types of pattern recognition problems.� 2006 Elsevier Inc. All rights reserved.

Keywords: Facial expression recognition; Evolutionary computation; Co-occurrence matrix; Support vector machine

1. Introduction

Determining what human beings think and feel bysimply analyzing their facial gestures is not a trivial task.Human beings require years of constant interaction withmany people to build efficient discriminative mental models

1077-3142/$ - see front matter � 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.cviu.2006.08.012

* Corresponding author. Fax: +52 646 1750593.E-mail addresses: [email protected] (G. Olague), riad.hammoud@

delphi.com (R. Hammoud).

of how we express our emotions. We are then able tocompare our learned models with what we perceive fromanother person, and infer his or hers emotional state.Our models are constructed from a varied set of contextu-ally relevant information. This allows us to build richmental representations. Nevertheless, generalizing the rep-resentations that we build is extremely difficult, leading usto fine tune our models to people we interact with the most.The difficult nature of this problem is closely related to thehigh level of complexity of a human being’s psyche andemotional state. For us, simply expressing our emotional

mailto:[email protected]

mailto:riad.hammoud@ delphi.com

mailto:riad.hammoud@ delphi.com

Fig. 1. Flow diagram of the commonly used FER approach.

B. Hernandez et al. / Computer Vision and Image Understanding 106 (2007) 258–269 259

state with words can lead us to describe a complex blendof different human emotions. Extrapolating this processto human-computer interactive systems, where a man-made system will need to determine a person’s emotionalstate, is an extremely hard and unsolved problem, stillbeyond current state-of-the-art systems. Furthermore, acomprehensive solution to this problem would have toconsider different information channels that providecontext dependent multi-modal signals. Possible usefulsources of such information include voice, physiologicalsignal patterns, hand and body movements, and facialexpressions. Judging from the previous list it is evidentthat visual cues in general and facial gestures in particu-lar, provide important information in order to infer apersons emotional state. This implies that an artificialsystem will greatly benefit from machine vision tech-niques that extract relevant visual information fromimages of a person’s face. This problem is known asthe facial expression recognition (FER) problem incomputer vision literature.

Current research published on FER systems hasfocused on the use of a single information channel, specif-ically still images or video sequences taken with visualspectrum sensors. However, as noted by many researchers[1,2] facial analysis systems relying on visual spectruminformation are highly illumination and pose dependent.Recent works [3–6] have shown the possible usefulnessof studying FER beyond the visual spectrum. These refer-ences suggest that given the difficulty of the FER prob-lem, new research in this area should look for new andimaginative ways of applying FER systems based on pre-viously unused sources of information, such as Infra-Redsignals. Nevertheless, we suggest that an interesting prob-lem arises for human designers that attempt to replicateany human process, i.e. emotion detection, in an AI sys-tem. Since the goal of such a system amounts to solvingan instance of a more general human process, humandesigners could be biased in their exploration of possiblesolution paths open to them. They might be eager to sim-ulate this process as it is theorized to be carried out byhumans. They can also be influenced by their own indi-vidual intuitions derived from personal knowledge andexperience. This can lead researchers to emphasize theuse of information commonly understood to be relevantto us humans. Furthermore, when the final output ofthe AI system is of primary importance, it should be pos-sible to use any type of information available to it, even ifit is strongly suggested that humans do not indeed usethis type of information, i.e. thermal signals. On the otherhand, most state-of-the-art techniques for FER follow thesame basic information processing approach used bymany facial analysis systems [7], see Fig. 1. This approachcan be divided into three main parts:

• Region of interest selection.• Feature extraction.• Image classification.

First, a region of interest (ROI) is established, wherefeature extraction will be performed. State-of-the-art tech-niques have used holistic methods where the ROI is theentire face [8], and modular or facial feature basedapproaches, where information is extracted from specificfacial regions [9–11]. Second, dimensionality reduction ofthe selected ROIs is done by a Feature extraction proce-dure. Common feature extraction techniques include prin-ciple component analysis (PCA) [12,11], Gabor filteranalysis [11], and hybrid methods [8,10]. Finally, a trainedclassifier uses the extracted feature vectors from each ROIin order to assign the images to one of the training classes.Popular classifiers include Neural Networks, Hidden Mar-kov Models and Support Vector Machines. This basicapproach has lead researches to concentrate on fine tuningeach of these steps in a decoupled manner. Ignoring possi-bly exploitable dependencies between them. Basically ahuman designer must answer each of the followingquestions.

• What could be the best set of facial areas that should be

analyzed? For example, some researchers have usedholistic and facial feature regions to perform featureextraction. However, some questions arise: Are thefacial regions that are normally used by researchers suf-ficient and non-redundant? Furthermore, could otherfacial regions help in providing additional emotionalinformation?

• How do we determine what features are relevant for emo-

tional classification?• How classification is performed on an image after extract-

ing a set of descriptive feature vectors?

In the case of working with visual spectrum images, theanswer to these questions might seem evident and highlyefficient solutions have been proposed [7]. On the otherhand, Infra-Red signals could be unnatural to humans,

https://www.researchgate.net/publication/3669535_Comparison_of_visible_and_infra-red_imagery_for_face_recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/3920064_Thermal_imaging_for_anxiety_detection?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/220142754_Method_for_detecting_transitions_of_emotional_states_using_a_thermal_facial_image_based_on_a_synthesis_of_facial_expressions?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/222573706_Luettin_J_Automatic_Facial_Expression_Analysis_A_Survey_Pattern_Recognition_361_259-275?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/256075302_Eigenfaces_for_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/221618274_Representing_Face_Images_for_Emotion_Classification?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/4207523_Automatic_Feature_Localization_in_Thermal_Images_for_Facial_Expression_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/3880948_Effect_of_sensor_fusion_for_recognition_of_emotional_states_using_voice_face_image_and_thermal_image_of_face?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/49711743_Classifying_Facial_Actions?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5



https://www.researchgate.net/publication/221597712_Automatic_Facial_Feature_Extraction_and_Facial_Expression_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/220013217_Automatic_Classification_of_Single_Facial_Images?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/283477585_Automatic_Facial_Expression_Analysis_A_Survey?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5



260 B. Hernandez et al. / Computer Vision and Image Understanding 106 (2007) 258–269

which increase the difficulty of attempting to make the cor-relation between this type of information and our ownmental process. Answering these questions might not bestraightforward. Consequently, the appropriate way inwhich the Infra-Red information could be applied maynot be as evident. We believe that when confronted withsuch a task, an appropriate and coherent approach is touse modern metaheuristics such as evolutionary computa-tion (EC). EC techniques help guide the emergence of noveland interesting ways of solving non-linear optimizationand search problems [13,14]. Our proposed method per-forms Visual Learning on two main tasks of the FER prob-lem: the first task is the extraction of relevant facial regions,and the second defines the feature extraction process.Visual learning is the process [15] in which an artificialsystem autonomously acquires knowledge from trainingimages for solving a given visual task. In this work, visuallearning is implemented using the EC paradigm as a learn-ing engine. We believe that using visual learning to developa solution for the FER problem will emulate the spirit ofthe human learning process. Using EC as a learning engine,we can search a wider space of possible solutions in orderto appropriately answer the above mentioned questions.The system will be able to try different ways of implement-ing the first two steps of the classic approach shown inFig. 1. Also, due to the reliance on thermal imagery, webelieve that visual learning will help in acquiring a deeperunderstanding of how this information could be utilized.

The remainder of this paper is organized as follows: Sec-tion 2 presents a brief discussion of relevant research in thearea of FER and visual learning algorithms. Section 3 givesa general overview of our approach and highlights itsimportant contributions. Section 4 is a brief overview ofrelevant basic theory. Section 5 gives a technical discussionof the implementation details of our approach; as well as adescription of the OTCBVS database [16]. Section 6describes the experimental results, and finally Section 7 pre-sents conclusions and suggestions for possible future workrelated to this work.

2. Related work

Human facial analysis has been the subject of manyresearch projects in computer vision, primarily due to itsimplications on the development of real human-computerinteractive systems. Much work has been done on develop-ing face detection [17,18] and face recognition techniques[19]. Kong et al. [20] provide a comprehensive and recentreview of various facial analysis techniques.

The FER problem has also received much researchattention, as can be noted by survey by Pantic and Rothk-rantz [21], and more recently by Fasel and Luettin [7]. FERis centered around extracting emotional content fromvisual patterns on a person’s face. Techniques for both stillimages and image sequences have been developed. Facialexpression recognition systems that use image sequencesor video as input offer great insight into the way in which

the human face generates different kinds of gestures.Recently important work in this area has been published,including [22–24]. However, video sequences are notalways available in every situation to be used as input forthe FER system. Consequently, FER systems that use stillimages are a reliable alternative. A wide variety of differenttechniques for FER systems that use still images have beenproposed [8–11,21].

Despite a high interest in FER research, a minimalamount of attention has been given to study it beyondthe visual spectrum. Because of the success of Infra-Redimages in other facial analysis problems [1,2,20], its appli-cability to FER is now gaining interest as we will explainnext. Sugimoto et al. [5,6] use predefined ROIs, corre-sponding to areas surrounding the nose, mouth, cheekand eye regions on the face, using simulated annealingand template matching for appropriate localization. Fea-tures are computed by generating differential imagesbetween the average ‘‘neutral’’ face and a given test imagefollowed by a discrete cosine transform. Classification isperformed using a backpropagation trained neural net-work. Pavlidis et al. [4] use a variant of the classic FERproblem. The authors use high definition thermal imageryto estimate the changes in regional facial blood flow. Theirsystem was used as an anxiety or lie detecting mechanism,to be used similarly to a polygraph without the subjectbeing aware of the test. Facial regions were also handpicked by the system designer. Trujillo et al. [3] proposea local and global automatic feature localization procedureto perform FER in thermal images, using interest pointclustering to estimate facial feature localization and PCAfor dimensionality reduction. The authors use a SVMCommittee for image classification. In contrast to theapproach proposed in this paper, the above mentionedtechniques use a priori assumptions of useful facial regionscommon in most FER systems. We propose to modify thetraditional method due to the limited amount of knowledgeon how thermal imagery can be used in the FER problem.Also, due to the limited number of related work, combinedwith the minimal amount of testing results and the use ofdifferent image datasets; direct comparison between report-ed methods are not straightforward. For example, the workpresented by Pavlidis et al. is not comparable due to thekinf of information being extracted. Only our previouswork uses the same dataset. Therefore, we decide to com-pare with our PCA-SVM approach, as well as classificationperformed by a group of persons.

As previously mentioned, our method will rely on avisual learning approach using EC as a learning and searchengine. Recently, applying EC based learning algorithms topattern recognition related problems has produced promis-ing and competitive results. EC techniques, when com-pared to hand coded techniques, show an unpredictablestructure that would be difficult for a human mind todesign. Approaching a visual task in this way could providenew insights and a greater understanding of the problemdomain. These techniques are commonly used as a feature





https://www.researchgate.net/publication/220660094_Robust_Real-Time_Face_Detection?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/3193199_Automatic_Analysis_of_Facial_Expressions_The_State_of_the_Art?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/3193219_Recognizing_Action_Units_for_Facial_Expression_Analysis_IEEE_Transactions_on_Pattern_Analysis_and_Machine_Intelligence_232_97-115?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/7771430_Visual_Learning_by_Evolutionary_Feature_Synthesis?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/220045412_Adoption_in_Natural_and_Artificial_System?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5



https://www.researchgate.net/publication/242356870_Adaptation_In_Natural_And_Artificial_Systems?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/3948234_Robust_facial_expression_recognition_using_a_state-based_model_ofspatially-localised_facial_dynamics?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/222407561_Human_face_recognition_based_on_multi-features_using_neural_networks_committee?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/221292070_Robust_Facial_Expression_Recognition_Using_a_State-Based_Model_of_Spatially-Localised_Facial_Dynamics?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5



https://www.researchgate.net/publication/3418972_Visual_Learning_by_Evolutionary_and_Coevolutionary_Feature_Synthesis?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/222528070_Object_Detection_Using_Feature_Subset_Selection?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/221606418_An_Expert_System_for_Recognition_of_Facial_Actions_and_their_Intensity?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5



https://www.researchgate.net/publication/262157094_Recent_advances_in_visual_and_infrared_face_recognition_-_A_review?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/221361583_Recognizing_Upper_Face_Action_Units_for_Facial_Expression_Analysis?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/201976347_Genetic_Algorithms_in_Search_Optimization?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


extraction process. Feature extraction involves defining,constructing or synthesizing an operator to extract low-dimensional image features and selecting a subset of thesefeatures for classification. At a feature selection level, thesemethods start with a set of possibly useful features, thatdefine the search space for the learning algorithm. Thealgorithm tries out different combinations of these featuresand its performance is evaluated with classificationaccuracy. For example, Bala et al. [25] use a GeneticAlgorithm to perform feature selection for recognizingvisual concepts. Sun et al. [17] perform PCA on twodifferent object classes, and use a GA to select the bestsubset of eigenimages for object recognition. Interestingly,the subset chosen by the learning algorithm shows that ahigh corresponding eigenvalue is not necessary nor suffi-cient for a given eigenimage to be useful for classification.Viola and Jones [18] use Adaboost learning in order to findthe best subset of their proposed integral image features toperform face detection.

A machine learning algorithm is also useful to constructor define an operator to extract relevant image features spe-cific to classification problems. Howard et al. [26] useGenetic Programming (GP) to generate image featuresfor target detection in SAR images. Lin and Bhanu [27]also do object detection in SAR images using GP in aCooperative Coevolutionary framework. By starting fromdomain knowledge based features, what they call primitive

features as a terminal set, the GP can synthesize novelfeatures that yield a high detection rate. Zhang et al. [28],use GP to perform multiclass detection of small objectspresent in large images. This work uses domain indepen-dent pixel statistics as the GP terminal set, and showshow a single evolved program solves both the object detec-tion and localization problem.

Visual learning has also influenced the field of facialanalysis. Teller and Veloso [29] study face recognitionusing a variant of GP, that they call the PADO learningarchitecture. A very important work in learning for facialanalysis is that of Viola and Jones [18]. Their approachshows an outstanding face detection rate that uses simpleto compute features and an efficient classification scheme.Silapachote et al. [30] also use Adaboots learning to selectappropriate derivative and Gabor filter based features forFER. Yu and Bhanu [31] use Gabor filter responses as aprimitive set to evolve a genetic program that performsFER.

Fig. 2. Sample images from the OTCBVS dataset of thermal images. Each

3. Outline of our approach

This paper introduces a novel visual learningapproach to FER, using thermal images as input signalsfor classification. Our approach works within the samebasic framework of state-of-the-art FER systems, seeFig. 1. We apply EC in order to automatically learnthe first two levels of this framework: ROI selectionand Feature Extraction. Feature extraction includes bothfeature construction and feature selection. Classificationis done using a support vector machine (SVM) Commit-tee. Features used by our approach are second order sta-tistics computed from the gray level co-occurrence matrix(GLCM). Since our approach is based on these domainindependent features, we assure ourselves of building aportable approach useful in other pattern recognitionproblems.

This work was realized with a set of thermal images ofthree different facial expressions taken from the OTCBVSdataset [16], see Fig. 2. The basic outline of our approachis depicted in Fig. 3, and proceeds as follows:

(1) Use a GA to perform search for ROI selection andfeature extraction optimization. The GA will selecta set X, of n facial ROIs.

(2) Simultaneously, the GLCM parameter set pxi is opti-mized "xi 2 X, where i = [1, n]. In this step, the GAperforms feature construcion by tuning the GLCMparameters.

(3) Feature selection creates a vector ~cxi ¼ ðb1; . . . ; bmÞof m different descriptors "xi 2 X, where{b1, . . . ,bm} ˝ W. W is the set of all availableGLCM based descriptors. Each bj 2~cxi representsthe mean value of the jth descriptor within regionxi. The compound set C ¼ f~cxig is then used forclassification. By evolving both ROI selection andfeature extraction within the same GA, we formu-late a coupled solution to the first two levels ofthe basic FER approach.

(4) A Support Vector Machine, /i is trained for each xi

and k-fold cross-validation is performed to estimatethe classifiers accuracy. The set U{/i} of all trainedSVMs represents a committee that performs FERusing a voting scheme. The total accuracy of theSVM Committee (SVMC) is used as the fitness func-tion for the GA.

row corresponds to each of the facial expression classes in the data set.



https://www.researchgate.net/publication/24391234_Feature_Selection_Using_Adaboost_for_Face_Expression_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/222536157_Evolutionary_feature_synthesis_for_facial_expression_recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/3421627_Evolutionary_Feature_Synthesis_for_Object_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5



https://www.researchgate.net/publication/2504829_Using_Learning_to_Facilitate_the_Evolution_of_Features_for_Recognizing_Visual_Concepts?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/2560543_Pado_A_New_Learning_Architecture_For_Object_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/222498236_Target_detection_in_SAR_imagery_by_genetic_programming?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

ROI Selection

Descriptor Selection

GLCM Parameters

POPULATION of N Individuals

Individual ... iFitness =

Accuracy

FEATUREEXTRACTION

SVMCOMMITTEE

N_Fold Cross–Validation

Fig. 3. Flowchart diagram of our approach to visual learning.


3.1. Research contributions

The main contributions of our approach are summa-rized now:

(1) We extendend the study of FER beyond the visualspectrum using images in the thermal spectrum.

(2) This paper presents a novel representation of theFER problem, using closed loop visual learning withEvolutionary Computation.

(3) This article presents a coupled approach to ROIselection and feature extraction.

(4) Our approach uses domain independent second orderstatistics, establishing a portable solution to differentpattern recognition problems.

4. Basic theory

This section intends to give a brief introduction to somerelevant theory. First, evolutionary computation (EC) isdiscussed focusing on its two main applications: searchand optimization. Next, we present an overview of theGLCM and its role in texture analysis. Finally, a briefintroduction to SVM classification is given.

4.1. Evolutionary computation

Early in the 1960s some researchers came to the conclu-sion that classical AI approaches were an inadequate toolto comprehend complex and adaptive systems. Artificialsystems based on large knowledge bases and predicate log-ic, are extremely hard to manage and have a very rigidstructure. This is one of the reasons why researches lookedelsewhere to find solutions for complex non-linear prob-lems that were difficult to decompose, understand and evendefine. The area of Soft Computing emerged as a paradigmto provide solutions in areas of AI where classic techniques

had failed. EC is a major field in this relatively new area ofresearch. EC is a paradigm for developing emergent solu-tions to different types of problems. EC based algorithmslook to emulate the adaptive and emergent propertiesfound in naturally evolving systems. These types of algo-rithms have a surprising ability to search, optimize andemulate learning in computer based systems using conceptssuch as populations, individuals, fitness and reproduction.The field of EC owes much of its current status as a viabletool to build artificial solutions, thanks to the pioneeringwork of John Holland [13] and his students. Holland pre-sented the first major type of EC algorithm, the geneticplan, later to be known as Genetic Algorithm (GA). AGA is a simplified model of the way in which populationsof living organisms are guide by the pressure of naturalselection in order to find optimal peeks within the speciesfitness landscape [14]. In a GA a population is formed bya set of possible solutions to a given problem. Each individ-ual in the population is represented by a coded string orchromosome, that represents the individuals genotype.Using genetic operations that simulate natural reproduc-tion and mutation, a new generation of child solutions isgenerated. The performance of each new solution is evalu-ated using a problem dependent fitness function. The fit-ness function places each individual on the problemfitness landscape. The GA selects the best individuals forsurvival across successive generations using the fitness eval-uation. When the evolutionary process terminates, the bestindividual found according to the fitness measure isreturned as the solution.

4.2. Texture analysis and the gray level co-occurrence matrix

Image texture analysis has been a major research area inthe field of computer vision since the 1970s. Researchershave developed different techniques and operators thatdescribe image texture. Those approaches were driven bythe hope of automating the human visual ability that seem-

https://www.researchgate.net/publication/220045412_Adoption_in_Natural_and_Artificial_System?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5




lessly identifies and recognizes texture information in ascene. Popular techniques for describing texture informa-tion include filter banks [32], random fields [33], primitivetexture elements known as textons [34] and more recentlytexture representations based on sparse local affine regions[35]. Historically, the most commonly used methods fordescribing texture information are the statistical basedapproaches. First order statistical methods use the proba-bility distribution of image intensities approximated bythe image histogram. With such statistics, it is possible toextract descriptors to describe image information. Firstorder statistics descriptors include: entropy, kurtosis andenergy, to name but a few. Second order statistical methodsrepresent the joint probability density of the intensity val-ues (gray levels) between two pixels separated by a givenvector ~V . This information is coded using the gray levelco-occurrence matrix (GLCM) M(i, j) [36]. Statistical infor-mation derived from the GLCM has shown reliable perfor-mance in tasks such as image classification [37] and contentbased image retrieval [38,39].

Formally, the GLCM Mi, j(p) defines a joint probabilitydensity function f ði; jj~V ;pÞ where i and j are the gray levelsof two pixels separated by a vector ~V , and p ¼ f~V ;Rg isthe parameter set for Mi,j(p). The GLCM identifies howoften pixels that define a vector ~V ðd; hÞ, and differ by a cer-tain amount of intensity value D = i � j appear in a regionR of a given image I, where ~V defines the distance d and ori-entation h between the two pixels. The direction of ~V , can orcannot be taken into account when computing the GLCM.

One drawback of the GLCM is that when the amount ofdifferent gray levels in region R increase, the dimensions ofthe GLCM make it difficult to handle or use directly. For-tunately, the information encoded in the GLCM can beexpressed by a varied set of statistically relevant numericaldescriptors. This reduces the dimensionality of the infor-mation that is extracted from the image using the GLCM.Extracting each descriptor from an image effectively mapsthe intensity values of each pixel to a new dimension. Inthis work, the set W of available descriptors [36] extractedfrom M(i, j) is the following:

• Entropy. A term more commonly found in thermodynam-ics or statistichal mechanics. Entropy is a measure of thelevel of disorder in a system. Images of highly homoge-neous scenes have a high associated entropy, while inho-mogeneous scenes poses a low entropy measure. TheGLCM entropy is obtained with the following expression.

H ¼ 1� 1

Nc � lnðNcÞX

i

Xj

Mði; jÞ � lnðMði; jÞÞ � 1Mði;jÞ

ð1Þwhere 1M(i,j) = 0 when M(i, j) = 0 and 1 otherwise.

• Contrast. It is a measure of the difference between inten-sity values of the neighboring pixels. It will favor contri-butions from pixels located away from the diagonal ofM(i, j).

C ¼ 1

NcðL� 1Þ2XL�1

k

k2Xji�jj¼k

Mði; jÞ ð2Þ

where Nc are the number of occurrences and L is thenumber of gray levels.

• Homogeneity. This gives a measure of how uniformly agiven region is structured, with respect to its gray levelvariations.

Ho ¼ 1

Nc2

Xi

Xj

Mði; jÞ2 ð3Þ

• Local homogeneity. This measure provides the homoge-neity of the image using a weight factor which givessmall values for non-homogeneous images when i „ j.

G ¼ 1

Nc

Xi

Xj

Mði; jÞ1þ ði� jÞ2

ð4Þ

• Directivity. This measure provides a bigger value whentwo pixel regions with the same grey values are separatedby a translation.

D ¼ 1

Nc

Xi

Xj

Mði; jÞ ð5Þ

• Uniformity. This is a measure of the uniformity of eachgray level.

Ho ¼ 1

Nc2

Xi

Mði; iÞ2 ð6Þ

• Moments. Moments express common statistical infor-mation, such as the variance that corresponds to the sec-ond moment. This descriptor increases when themajority of the values of M(i, j) are not on the diagonal.

Momk ¼X

i

Xj

ði� jÞkMði; jÞ ð7Þ

• Inverse moments. This produces the opposite effect com-pared to the previous descriptor.

Mom�k 1 ¼X

i

Xj

Mði; jÞði� jÞk

; i 6¼ j ð8Þ

• Maximum probability. Considering the GLCM as anapproximation of the joint probability density betweenpixels, this operator extracts the most probable differ-ence between pixel gray scale values.

maxðMði; jÞÞ ð9Þ

• Correlation. This is a measure of gray scale linear depen-dencies between pixels at the specified positions relativeto each other.

S ¼ 1

NcrxryjX

i

Xj

ði� mxÞðj� myÞMði; jÞj ð10Þ

https://www.researchgate.net/publication/20816688_Preattentive_texture_discrimination_with_early_vision_mechanism?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/7641979_A_sparse_texture_representation_using_local_affine_region?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/223306703_Performance_evaluation_for_four_class_of_texture_features?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/260869571_Textons_the_Fundamental_Elements_in_Preattentive_Vision_and_Perception_of_Textures?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/247642467_Statistical_and_Structural_12_Approaches_to_Texture?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


https://www.researchgate.net/publication/221368656_Evaluation_of_Texture_Features_for_Content-Based_Image_Retrieval?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/223673272_Texture_classification_and_segmentation_using_multiresolution_simultaneous_autoregressive_models?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


mx ¼1

Nc

Xi

Xj

iMði; jÞ

my ¼1

Nc

Xi

Xj

jMði; jÞ

r2x ¼

1

Nc

Xi

Xj

ði� mxÞ2Mði; jÞ

r2y ¼

1

Nc

Xi

Xj

ðj� myÞ2Mði; jÞ

4.3. Support vector machines

A machine learning algorithm [40] for classification isfaced with the task to learn the mapping xi fi yi, of datavectors xi to classes yi. The machine is actually definedby a set of possible mappings xi fi f(x,a), where a particu-lar choice of a generates a particular trained machine. Thesimplest way to introduce the concept of SVM is the case ofa two class classifier. In this case a SVM finds the hyper-plane that best separates elements from both classes whilemaximizing the distance from each class to the hyperplane.There are both linear and non-linear approaches to SVMclassification. Suppose you have a set of labeled trainingdata {xi,yi}, i = 1, . . . , l, yi 2 {�1,+1}, xi 2 Rd, then anon-linear SVM defines the discriminative hyperplane by

f ðxÞ ¼Xl

i¼1

aiyiKðxi; xÞ þ b ð11Þ

where xi are the support vectors, yi is its correspondingclass membership, and K(xi,x) is the ‘‘kernel function’’.The sign of the output of f(x) indicates the class member-ship of x. Finding this optimal hyperplane implies solvinga constrained optimization problem using quadratic pro-gramming, where the optimization criteria is the width ofthe margin between the classes. Extending the basic SVMconcept to a N-class problem is a straightforward process.The process trains N one-versus-rest classifiers (say, ‘‘one’’positive, ‘‘rest’’ negative) for the N-class case, and takes theclass for a test point to be the one corresponding to thelargest positive distance [40].

5. Technical approach

This section attempts to give a clear and detaileddescription of our visual learning approach to FER in ther-mal images.

5.1. Genetic algorithm for visual learning

The input images for the GA are the extracted facialregions using the same face localization proceduredescribed in [3]. Our learning approach accomplishes acombined search and optimization procedure in a singlestep. The GA searches for the best set X of facial ROIs

in each image and optimizes the feature extraction proce-dure by tuning the GLCM parameter set pi "xi 2 X andselecting the best subset {b1, . . . ,bm} of mean descriptorvalues from the set of all possible descriptors W, to forma feature vector ~ci ¼ ðb1; . . . ; bmÞ for each xi 2 X. Usingthis representation, we are tightly coupling the ROI selec-tion step with the feature extraction process. In this way,the GA is learning the best overall structure for the FERsystem in a single closed loop learning scheme. Ourapproach eliminates the need of a human designer, whichnormally combines the ROI selection and feature extrac-tion steps. Now this step is left up to the learning mecha-nism. Each possible solution is coded into a single binarystring. Its graphical representation is shown in Fig. 4.The entire chromosome consists of 94 binary coded vari-ables, each represented by binary strings of different sizes,from 1 to 5 bits each, depending on its use. The chromo-some can be better understood by logically dividing it intwo main sections. The first one encodes variables forsearching the ROIs on the image, and the second is con-cerned with setting the GLCM parameters and choosingappropriate descriptors for each ROI.

5.1.1. ROI selectionThe first part of the chromosome encodes ROI selec-

tion. The GA has a hierarchical structure that includesboth control and parametric variables. The section ofstructural or control genes ci determine the state (on/off) of the corresponding ROI definition blocks xi. Eachstructural gene activates or deactivates one ROI in theimage. Each xi establishes the position, size and dimen-sions of the corresponding ROI. Each ROI is definedwith four degrees of freedom around a rectangularregion: height, width, and two coordinates indicatingthe central pixel. The choice of rectangular regions isnot related in any way with our visual learning algo-rithm. It is possible to use other types of regions; e.g.,elliptical regions, and keep the same overall structureof the GA. The complete structure of this part of thechromosome is coded as follows:

(1) Five structural variables {c1, . . . ,c5}, represented by asingle bit each. Each one controls the activation ofone ROI definition block. This variables controlwhich ROI will be used in the feature extractionprocess.

(2) Five ROI definition blocks x1, . . . ,x5. Each block xi,contains four parametric variables xi ¼ fxxi ; yxi

;hxi ;wxig, coded into four bit strings each. Thesevariables define the ROIs center ðxxi ; yxi

Þ, height ðhxiÞand width ðwxiÞ. In essence each xi establishes theposition and dimension for a particular ROI.

5.1.2. Feature extraction

The second part of the solution representation encodesthe feature extraction variables for the visual learning algo-rithm. The first group is defined by the parameter set pi of


https://www.researchgate.net/publication/304425264_A_Tutorial_on_Support_Vector_Machines_for_Pattern_Recognition?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5


Fig. 4. Visual learning GA problem representation.


the GLCM computed at each image ROI xi 2 X. The sec-ond group is defined as a string of ten decision variablesthat activate or deactivate the use of a particular descriptorbj 2 W for each ROI. Since each of these parametricvariables are associated to a particular ROI, they are alsodependent on the state of the structural variables ci. Theyonly enter into effect when their corresponding ROI isactive (set to 1). Our approach uses domain independentfeature extraction based on texture information extractedfrom the GLCM. This makes the approach unbiased anduseful for different visual learning problems [28]. Realisti-cally this approach can be easily ported to different kindsof pattern recognition problems, where relevant imageregions and features need to be extracted. The completestructure of this part of the chromosome is as follows:

(1) A parameter set pxi is coded "xi 2 X, using three para-metric variables. Each pxi ¼ fRxi ; dxi ; hxig describesthe size of the region R, distance d and direction hparameters of the GLCM computed at each xi. Notethat R is a GLCM parameter, not to be confused withthe ROI definition block xi, see Section 4.2.

(2) Ten decision variables coded using a single bit, acti-vate or deactivate a descriptor bj;xi

2 W at a givenROI. These decision variables determine the size ofthe feature vector ~ci, extracted at each ROI in orderto search for the best combination of GLCMdescriptors. In this representation, each bj;xi

repre-sents the mean value of the jth descriptor computedat ROI xi.

5.1.3. ClassificationSince our problem amounts to classifying every extracted

region xi, we implement a SVM committee that uses a voting

scheme for classification. The SVM committee U, is formedby the set of all trained SVMs {/i}, one for each xi. Thecompound feature set C ¼ f~cxig is fed to the SVM commit-tee U, where each~cxi is the input to a corresponding /i. TheSVM Committee uses voting to determine the class of thecorresponding image.

5.1.4. Fitness evaluation

The goal of our approach is to find the best possibleROIs and feature extraction process to perform a properFER. Because of this, the majority of the fitness functionis biased towards classification accuracy of each extractedROI. Nevertheless, if two different solutions have the sameclassification accuracy, it would be preferable to select theone that uses the minimum amount of descriptors. This willpromote compactness in our representation and improvecomputational performance. In this way, the fitness func-tion is defined similar to [17]:

fitness ¼ 102 �Accuracyþ 0:25 � Zeros ð12Þwhere Accuracy is the average accuracy of all SVMs in U fora given individual. In other words, Accuracy ¼ 1

jUjP

xAcc/x,

summed "/x 2 U, where Acc/xis the accuracy of the /j

SVM. And, Zeros is the total amount of inactive descriptorsin the chromosome, given by Zeros ¼

Pi

Pjbxi;jxi where

i = 1, . . . , 5 and j = 1, . . . , 10. This formulae is based on thework of Sun et al. [17].

5.1.5. GA runtime parameters

The rest of the fundamental parameters, as well as theGA settings, are defined as follows:

• Population size and initialization: We use random initial-ization of the initial population. Our initial populationsize was set to 100 individuals.




• Survival method: For population survival we use an elitistbased strategy. The N parents of generation t � 1 andtheir offspring are combined, and the best N individualsare selected to form the parent population of generation t.

• Genetic operators: Since we are using a binary coded string,we use simple one point crossover and single bit binarymutation. Crossover probability was set to 0.66 and muta-tion probability to 0.05. Parents were selected with tourna-ment selection. The tournament size was set to 7.

5.1.6. SVM training parameters

SVM implementation was done using libSVM [41], aC++ open source library. For every / 2 U, the parametersetting is the same for all the population. The SVM param-eters are:

• Kernel type: A radial basis function (RBF) kernel wasused, given by:

kðx; xiÞ ¼ exp �kx� xik2

2r2

!ð13Þ

The RBF shows a greater performance rate for classify-ing non-linear problems than other types of kernels.

• Training set: The training set used was extracted from 92different images (see Section 3).

• Cross-validation: In order to compute the accuracy of eachSVM, we perform k-fold cross-validation, with k = 6. Dueto the small size of our data set, the accuracy computedwith cross-validation will out perform any other type ofvalidation approach [42]. In k-fold cross-validation thedata is divided into k subsets of (approximately) equalsize. The SVM was trained k times, each time leavingout one of the subsets from training, but using only theomitted subset to compute the classifiers accuracy. Thisprocess is repeated until all subsets have been used for bothtesting and training and the computed average accuracywas used as the performance measure for the SVM.

6. Experimental results

This section has two objectives. First, Section 6.1 providesa brief description of the database used to test our approach.Second, Section 6.2 gives experimental results and compari-sons. In order to contrast the obtained results of ourapproach, we compare it with classification done by humanobservers, and with results obtained in our previous work [3].

6.1. OTCBVS data set of thermal images

The OTCBVS dataset [16] contains pictures from 30 dif-ferent subjects taken at the UT/IRIS Lab.1 Each subject

1 http://imaging.utk.edu/.

poses three different facial emotions, this is used as theFER classification ground truth given by the datasetdesigners. It is apparent that the ground truth for eachimage is taken from the emotion that the subject ‘‘claims’’to be expressing, and not by what human observers ‘‘be-lieve’’ he is feeling. This is a critical point, and could beconsidered as a shortcoming of the dataset if subjects areonly ‘‘acting’’ and are not sincerely ‘‘feeling’’ a given emo-tion. This fact is evident in Fig. 7, and we explore the pos-sible consequences of this in the following section.

The images were taken from 12 different viewing angles.Each image in the database is in RGB bitmap format witha size of 320 · 240 pixels. The database contains subjectswearing glasses in some of the pictures. Despite the factthat our previous work [3] showed invariance to peoplewearing glasses, our current work is not concerned withachieving this goal. For the purposes of our work, we wereinterested in extracting the best frontal views to character-ize each expression class. We selected from the data set, 33surprise, 26 happy and 33 angry images, bringing the totalto 92 training/validation images to be used by our learningalgorithm. We pre-extracted the main facial area of eachimage using the same technique for automatic face localiza-tion described in [3], see Fig. 5. This area was cropped andresized to 32 · 32 gray level bitmap format. Fig. 2 showssample images before cropping, corresponding to three dif-ferent subjects. Each row corresponds to a different expres-sion class. Three different poses are shown for each of thefacial expressions.

6.2. Approach evaluation

Before describing the performance of our approach, weneed to address one aspect of the reported results. Becauseof the limited number of images that are useful in evaluat-ing our approach, cross-validation accuracy is computed.As mentioned in Section 5, machine learning techniquesthat work with a limited set of training images use cross-validation in order to avoid overfitting, and in order toobtain statistically relevant results. This is due to the factthat when cross-validation is performed we are assured thatall images are used to train and validate the classifier. Thevalidation step tests the classifier on a subset of the imagedataset not used in the training process. We performed10 different runs of the Visual Learning GA. To preservecompactness, we describe the results for the best of the10 completed experiments. The super-individual or fittest

Fig. 5. Extracting the prominent facial region and resizing to 32 · 32 bitsize.

http://imaging.utk.edu/

https://www.researchgate.net/publication/2370415_Note_on_Free_Lunches_and_Cross-Validation?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5

https://www.researchgate.net/publication/228715647_LIBSVM_A_library_for_support_vector_machines?el=1_x_8&enrichId=rgreq-96d63a98aa21eab5de1830c3f4c09c10-XXX&enrichSource=Y292ZXJQYWdlOzIyMzQ3NTMzMztBUzo5OTgzMzQ5MDcwNjQzNUAxNDAwODEzNTY4MzE5




ANGRY? SURPRISED? HAPPY?

Fig. 6. All images in the figure appear to be ‘‘Happy’’ but are not labeledas such in the data set.

Table 3Human classification of thermal images

SURPRISE HAPPY ANGRY

SURPRISE 56% 33% 11%HAPPY 23% 48% 29%ANGRY 7% 14% 79%

Table 4Human classification of visual spectrum images



Fig. 7. The best individual selects four facial regions. Each region isshown with a white bounding box.


individual obtained, showed a cross-validation classifica-tion accuracy of 77%. The best individual extracts 4 differ-ent image regions, with some overlapping areas shown inFig. 6. The complete confusion matrix of the cross-valida-tion performance of our approach can be seen in Table 1.We can see how classification performance degrades whenclassifying the HAPPY class. This is caused by two mainreasons: First, the overlap of the expressions modeled bythe human subjects, see Fig. 6; and Second, the question-able way in which the dataset providers have labeled eachof the images. This two points are examples of how subjectsthat only ‘‘act’’ as if they experience a given emotion andare not sincerely ‘‘feeling’’ it, can produce noisy data whengreat care is not taken by a dataset designer.

Because of the lack of training data, only 10 additionalimages that were not used during training, are now used fortesting. The algorithm was able to classify 8 of the 10 imag-es correctly. The confusion matrix is shown in Table 2. Astatistical experiment was conducted with human subjectsin order to validate the fact that the dataset used establish-es an extremely hard classification problem. We randomlyselected 30 images from the training set, and 100 peoplewere asked to classify them. The experiment was repeatedusing the corresponding visual spectrum images, in orderto contrast the augmented difficulty of the problem whenconducted for thermal imagery. This type of experimenta-tion is also reported by [5], which is evidence of how thelack of published results and comparable datasets makessystem comparisons difficult. The results for the thermal

Table 1Confusion matrix for cross-validation



Table 2Confusion matrix of our test results


SURPRISE 4 0 0HAPPY 0 2 1ANGRY 0 1 2

and visual image experiments are shown in Table 3 andTable 4, respectively. The human classification experimentsestablish the extreme difficulty of the FER problem posedin this paper. The proposed approach outperforms humanclassification of thermal images and is competitive withhuman classification of visual spectrum images, a verypromising result. We suggest that the low classification per-formance of visual spectrum images by human observerscould be attributed to two primary factors: (1) cultural dif-ferences between the subjects and the people classifyingthem, and (2) the laxness with which the dataset wasdesigned and the ground truth was established.

Our previous work [3] presented an approach based onPrincipal Component Analysis to perform feature extrac-tion, while ROI selection was done based on facial features[3]. The main contribution of that work was the automaticprocess it used for facial feature localization and how aSVM committee with a biased voting scheme classified test-ing images. The classification accuracy achieved in [3] for30 testing images was 76.6%. This is a comparable resultto the cross-validation and testing accuracy presented inthis paper. Moreover, the previous method utilized a totalof 50 eigenfeatures per region, a total that exceeds thereduced dimensionality of the feature vectors used in ourproposed approach. In the example of Fig. 7 a total of35 GLCM features were used for the recognition process.We have also tested our evolutionary approach on theEquinox database2 obtaining 62% in recognition. Thisdatabase provides images which are classified according

2 http://www.equinoxsensors.com/products/HID.html.

http://www.equinoxsensors.com/products/HID.html






to the long, short, and medium wave lengths. It is impor-tant to mention that the images classified as short seemsto correspond qualitatively to the OTCBVS images thatare long wave images. We believe that the low quality ofthe results should be explored in the future including theprocess of image acquisition to eliminate all errors thatwe are reporting in this paper.

7. Discussion and conclusions

This paper proposes a novel approach to solve theFER problem in thermal images. Visual learning is per-formed on a thermal image dataset that contains threedifferent facial expression classes. The proposed tech-nique performs visual learning with EC in order to solvetwo of the three main tasks found in common FERapproaches: ROI selection and Feature Extraction. Sincewe use a single learning algorithm for both tasks, learn-ing is performed in a parallel process. We tightly couplethe solution of both tasks in order to exploit dependen-cies between them. This is not commonly taken intoaccount in most published techniques related to FER lit-erature. The final classification task is carried out using aSVM Committee that uses a voting scheme for classifica-tion. The SVM Committee adds robustness in theclassification procedure. This is illustrated by an increaseon the classification accuracy, when compared to theaccuracy of the classification that is obtained withsolated image regions.

Experimental results show that the proposed algo-rithm is capable of efficient facial expression recognitionin an extremely hard problem. The visual learning cross-validation accuracy outperforms human classification byan average of 16% and shows promising results on asmall but characteristic testing set. Even do we presentgood results for the given FER problem, the smallamount of available data limits the kind of desired statis-tically relevant conclusion that we aspire to achieve. Wepropose that the current dataset presented in [16] beexpanded to include a larger group of expression sam-ples. The main idea of this paper is to show that theinformation provided by infrared images is useful tosolve the facial expression recognition problem. Theinformation provided by infrared images is normally con-sidered by researchers as low compared to visible imagesbased only on prejudging the quality of thermal images.This paper shows that it is possible to obtain results thatare above chance level (say 60–80%), so that a compar-ison is possible.

These preliminary results offer great insight into the pos-sibilities of the proposed method. As was stated earlier, webelieve that the approach offers a simple yet efficient meth-odology for learning two main problems common to manypattern recognition problems: ROI selection and featureextraction. This is due to the fact that the learningapproach is independent of how the regions are definedand what type of image features are extracted.

Acknowledgments

The benchmark used in this paper is available on thewebsite of IEEE OTCBVS WS Series Bench http://www.cse.ohio-state.edu/otcbvs-bench; We thank the Imag-ing, Robotics, and Intelligent Systems Laboratory for mak-ing the dataset available online. The dataset were collectedunder DOE University Research Program in Robotics un-der Grant DOE-DE-FG02-86NE37968; DOD/TACOM/NAC/ARC Program under Grant R01-1344-18; FAA/NSSA Grant R01-1344-48/49; Office of Naval Researchunder Grant #N000143010022. This research was supportedby the LAFMI project. Second author gratefully acknowl-edges the support of Junta de Extremadura granted whenDr. Olague was in sabbatical leave at the Universidad deExtremadura in Merida, Spain.

References

[1] J. Wilder, P.J. Phillips, C. Jiang, S. Wiener, Comparison of visible andinfra-red imagery for face recognition., in: 2nd International Confer-ence on Automatic Face and Gesture Recognition (FG ’96), October14–16, 1996, Killington, Vermont, USA, 1996, pp. 182–191.

[2] F. Prokoski, History, current status, and future of infrared identifi-cation, in: CVBVS ’00: Proceedings of the IEEE Workshop onComputer Vision Beyond the Visible Spectrum: Methods andApplications (CVBVS 2000), IEEE Computer Society, Washington,DC, USA, 2000, p. 5.

[3] L. Trujillo, G. Olague, R. Hammoud, B. Hernandez, Automaticfeature localization in thermal images for facial expression recogni-tion, in: CVPR ’05: Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition(CVPR’05)—Workshops, IEEE Computer Society, Washington,DC, USA, 2005, p. 14.

[4] I. Pavlidis, J. Levine, P. Baukol, Thermal image analysis for anxietydetection, in: International Conference on Image Processing, vol. 2,2001, pp. 315–318.

[5] Y. Sugimoto, Y. Yoshitomi, S. Tomita, A method for detectingtransitions of emotional states using a thermal facial image based on asynthesis of facial expressions, Journal of Robotics and AutonomousSystems 31 (3) (2000) 147–160.

[6] Y. Yoshitomi, S. Kim, T. Kawano, T. Kitazoe, Effect of sensor fusionfor recognition of emotional states using voice, face image and thermalimage of face, in: Proceedings of ROMAN, 2000, pp. 178–183.

[7] B. Fasel, J. Luettin, Automatic facial expression analysis: a survey,Pattern Recognition 36 (1) (2003) 259–275.

[8] M.J. Lyons, J. Budynek, S. Akamatsu, Automatic classification ofsingle facial images, IEEE Transactions on Pattern Analysis andMachine Intelligence 21 (12) (1999) 1357–1362.

[9] S. Dubuisson, F. Davoine, J.P. Cocquerez, Automatic facial featureextraction and facial expression recognition, in: AVBPA ’01:Proceedings of the Third International Conference on Audio- andVideo-Based Biometric Person Authentication, Springer-Verlag,London, UK, 2001, pp. 121–126.

[10] C. Padgett, G. Cottrell, Representing face images for emotionclassification, Advances in Neural Information Processing Systems(1996) 894–900.

[11] G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, T.J. Sejnowski,Classifying facial actions, IEEE Transactions on Pattern Analysis andMachine Intelligence 21 (10) (1999) 974–989.

[12] M. Turk, A.P. Pentland, Eigenfaces for recognition, Journal ofCognitive Neuroscience 3 (1) (1991) 71–86.

[13] J. Holland, Adaptation in Natural and Artificial Systems, Springer,1975.

http://www.cse.ohio-state.edu/otcbvs-bench

http://www.cse.ohio-state.edu/otcbvs-bench










































[14] D.E. Goldberg, Genetic Algorithms in Search, Optimization, andMachine Learning, Addison-Wesley Professional, 1989.

[15] K. Krawiec, B. Bhanu, Visual learning by evolutionary featuresynthesis, IEEE Transactions on Systems, Man and Cybernetics, PartB, Special Issue on Learning in Computer Vision and PatternRecognition 35 (3) (2005) 409–425.

[16] IEEE OTCBVS WS Series Bench; DOE University Research Programin Robotics under Grant DOE-DE-FG02-86NE37968; DOD/TACOM/NAC/ARC Program under Grant R01-1344-18; FAA/NSSA Grant R01-1344-48/49; Office of Naval Research under GrantNo. N000143010022.

[17] Z. Sun, G. Bebis, R. Miller, Object detection using feature subsetselection, Pattern Recognition 37 (11) (2004) 2165–2176.

[18] P. Viola, M. Jones, Robust real-time face detection, InternationalJournal of Computer Vision 57 (2) (2004) 137–154.

[19] Z.-Q. Zhao, D.-S. Huang, B.-Y. Sun, Human face recognition basedon multi-features using neural networks committee, Pattern Recog-nition Letters 25 (12) (2004) 1351–1358.

[20] S.G. Kong, J. Heo, B.R. Abidi, J. Paik, M.A. Abidi, Recent advancesin visual and infrared face recognition: a review, Computer Visionand Image Understanding 97 (1) (2005) 103–135.

[21] M. Pantic, L.J.M. Rothkrantz, Automatic analysis of facial expres-sions: the state of the art, IEEE Transactions on Pattern Analysis andMachine Intelligence 22 (12) (2000) 1424–1445.

[22] F. Bourel, C.C. Chibelushi, A.A. Low, Robust facial expressionrecognition using a state-based model of spatially-localised facialdynamics, in: FGR, 2002, pp. 113–118.

[23] M. Pantic, L.J.M. Rothkrantz, An expert system for recognition offacial actions and their intensity, Image and Vision Computing 18(2000) 881–905.

[24] Y. li Tian, T. Kanade, J.F. Cohn, Recognizing action units for facialexpression analysis, IEEE Transactions on Pattern Analysis andMachine Intelligence 23 (2) (2001) 97–115.

[25] J. Bala, K. DeJong, J. Huang, H. Vafaie, H. Wechsler, Using learningto facilitate the evolution of features for recognizing visual concepts,Evolutionary Computation 4 (3) (1996) 297–311.

[26] D. Howard, S.C. Roberts, R. Brankin, Target detection in imagery bygenetic programming, Advances in Engineering Software 30 (5)(1999) 303–311.

[27] Y. Lin, B. Bhanu, Evolutionary feature synthesis for object recog-nition, IEEE Transactions on Systems, Man and Cybernetics, Part C,Special Issue on Knowledge Extraction and Incorporation in Evolu-tionary Computation 35 (2) (2005) 156–171.

[28] M. Zhang, V.B. Ciesielski, P. Andreae, A domain-independentwindow approach to multiclass object detection using geneticprogramming, EURASIP Journal on Applied Signal Processing (8)(2003) 841–859 (Special Issue on genetic and evolutionary computa-tion for signal processing and image analysis).

[29] A. Teller, M. Veloso, PADO: a new learning architecture for objectrecognition, in: K. Ikeuchi, M. Veloso (Eds.), Symbolic VisualLearning, Oxford University Press, 1996, pp. 81–116.

[30] P. Silapachote, D.R. Karuppiah, A. Hanson, Feature selection usingadaboost for face expression recognition, in: Proceedings of theFourth IASTED International Conference on Visualization, Imaging,and Image Processing, Marbella, Spain, 2004, pp. 84–89.

[31] J. Yu, B. Bhanu, Evolutionary feature synthesis for facial expressionrecognition, Pattern Recognition Letters 27 (11) (2006) 1289–1298.

[32] J. Malik, P. Perona, Preattentive texture discrimination with earlyvision mechanisms, Journal of the Optical Society of America A:Optics, Image Science, and Vision 7 (5) (1990) 923–932.

[33] J. Mao, A.K. Jain, Texture classification and segmentation usingmultiresolution simultaneous autoregressive models, Pattern Recog-nition 25 (2) (1992) 173–188.

[34] B. Julesz, J.R. Bergen, Textons, the Fundamental Elements inPreattentive Vision and Perception of Textures, Kaufmann, LosAltos, CA, 1987.

[35] S. Lazebnik, C. Schmid, J. Ponce, A sparse texture representationusing local affine regions, IEEE Transactions on Pattern Analysis andMachine Intelligence 27 (8) (2005) 1265–1278.

[36] R.M. Haralick, Statistical and structural approaches to texture,Proceedings of the IEEE 67 (1979) 786–804.

[37] J. Kjell, Comparative study of noise-tolerant texture classification, in:1994 IEEE International Conference on Systems, Man, and Cyber-netics, Humans, Information and Technology, vol. 3, 1994, pp. 2431–2436.

[38] P. Howarth, S.M. Ruger, Evaluation of texture features for content-based image retrieval, in: CIVR, 2004, pp. 326–334.

[39] P.P. Ohanian, R.C. Dubes, Performance evaluation for four classes oftextural features, Pattern Recognition 25 (8) (1992) 819–833.

[40] C.J.C. Burges, A tutorial on support vector machines for patternrecognition, Data Mining and Knowledge Discovery 2 (2) (1998) 121–167.

[41] C.C. Chang, C.J. Lin, LIBSVM: a library for support vectormachines, 2001.

[42] C. Goutte, Note on free lunches and cross-validation, NeuralComputation 9 (6) (1997) 1245–1249.









































































Visual learning of texture descriptors for facial expression recognition in thermal imagery

Documents

Transcript of Visual learning of texture descriptors for facial expression recognition in thermal imagery