HMM-based geometric signatures for compact 3D face representation and matching

HMM-based geometric signatures for compact 3D face representation andmatching

U. Castellani1, M. Cristani1, X. Lu2, V. Murino1 and A. K. Jain3

1Dipartimento di Informatica, Strada le Grazie, 15 - 37134 Verona, Italy2Siemens Corporate Research, College Road East Princeton - NJ 08540 USA

3Michigan State University, East Lansing - MI 48824, USA

Abstract

3D face recognition(s) systems improve current 2Dimage-based approaches, but in general they are requiredto deal with larger amounts of data. Therefore, a compactrepresentation of 3D faces is often crucial for a better ma-nipulation of data, in the context of 3D face applicationssuch as smart card identity verification systems. We pro-pose a new compact 3D representation by focusing on themost significant parts of the face. We introduce a genera-tive learning approach by adapting Hidden Markov Models(HMM) to work on 3D meshes. The geometry of local areaaround face fiducial points is modeled by training HMMswhich provide a robust pose invariant point signature. Suchdescription allows the matching by comparing the signa-ture of corresponding points in a maximum-likelihood prin-ciple. We show that our descriptor is robust for recogniz-ing expressions and performs faster than the current ICP-based 3D face recognition systems by maintaining a satis-factory recognition rate. Preliminary results on a subset ofthe FRGC 2.0 dataset are reported by considering subjectsunder different expressions.

1. Introduction

Recent interest on face recognition has focused on theenhancements provided by the acquisition of 3D face data[1, 16]. Three dimensional information allows an improve-ment in the face recognition accuracy by possibly mergingtexture and surface geometry features [3, 8, 18]. Although3D approaches outperform current 2D image-based meth-ods by being invariant to pose and lighting variations, thereare still open issues such as the search for compact rep-resentation of large amount of data and developing meth-ods robust to face deformation. The most advanced meth-ods proposed recently aim mainly at building expressioninvariant recognition systems [9, 2, 11]. Roughly speak-ing, two main strategies can be identified, which are based

on: (i) expression invariant measurements [9, 2], and (ii)generation of new synthetic expressions from the neutralone [11]. Among the methods in the first approach [9],an annotated face model (AFM) is fitted to the 3D meshto generate a geometry image encoding shape information.Then, wavelet transform is applied for a 2D multiresolutionmatching. The method proposed in [2] assumes that facialexpressions can be modeled as isometries of the facial sur-face. This allows construction of expression-invariant rep-resentations of faces using the bending-invariant canonicalforms approach. Regarding the second strategy, a hierarchi-cal geodesic-based resampling strategy is applied to extractlandmarks for modeling facial surface deformations [11].Then, new deformed faces are generated from the 3D neu-tral models in order to compare the query image (of any ex-pression) with the synthesized templates from the gallery.

It is worth noting that all the above mentioned expressioninvariant methods are based on the proper alignment (i.e.,registration) between the query and the template face byadopting a global approach (i.e., the whole face is used asthe input), based on ICP-like algorithms [11]. This impliesthat an exhaustive search between dense 3D correspondingpoints needs to be carried out, making the approach com-putationally expensive, especially when a large number ofsubjects are present in the gallery. Moreover, the entire 3Dmodel of each subject must be kept in the gallery.

In this paper we propose a local approach by introduc-ing a new compact representation for robustly describingfew facial-landmarks. The recognition is done by findingthe best matching among the corresponding points signa-tures, without an alignment procedure. For each extractedfacial landmark, multidimensional local geometric featuresare sampled along a 3D geodesic spiral pathway, that liesin a neighborhood zone. Such information is modeled bya Hidden Markov Model (HMM) [15] providing a reliablemodel-based facial-point description. In the details, theHMM is able to compactly capture the local surface geom-etry variation around the facial landmarks points for recog-nition purpose.

1

Local regions around fiducial landmarks carry a largeportion of discriminative information about the facial iden-tity. We show that our HMM-based framework is effectivefor both expression and subject recognition. Therefore, wepropose to deal with possible deformed faces by adopting anew simple schema. A two-steps procedure is introduced:(i) recognize the expression first, and then (ii) match the in-put query to only templates with the same expression typein the gallery. This allows us to reduce the number of com-parisons by improving the speed of matching. In particu-lar, for the expression recognition stage, a single HMM istrained from a set of corresponding landmarks of trainingsubjects having the same expression, and used as expressionclassifier. For the subject recognition stage, a per-subjectHMM is trained from landmarks descriptions of the samesubject. Subsequently, point matching among different ex-pressions and subjects is performed by classifying corre-sponding landmark points using likelihood-based measures.

Other methods have been proposed in the literature basedon the extraction of local geometric features [12, 4, 18, 1, 6].In [6], a curvature-based segmentation of the face is carriedout. Then a set of features is extracted that describes bothcurvature and metric properties of the face. Thus, each facebecomes a point in a feature space and nearest-neighbormatching is done. A similar approach is adopted in [12].A segmentation based on mean and Gaussian curvatures isperformed first and then a feature vector based on the seg-mented region is created for matching. In [4, 18] the pointsignature descriptor is extended to 3D faces, employing amatching scheme which discards those parts of the face thatdeform nonrigidly. An exhaustive survey on 3D face recog-nition methods is reported in [1] for further reading. Finally,we highlight that very few studies propose statistical learn-ing approaches based on HMMs, mainly focused on the ex-tension of 2D global methodologies to the 3D domain [17].Here, instead, we design a new HMM-based geometric par-adigm which is able to characterize the 3D shape variationby merging different local facial properties into the samerepresentation which is useful for both expression and sub-ject recognition purposes.

The rest of the paper is organized as follows. In Section2, the HMM framework for describing facial landmarks isintroduced. In Section 3, we illustrate how our proposeddescriptor can be employed in an expression-driven facerecognition scenario. Section 4 reports some face recog-nition results, and finally conclusions are drawn in Section5.

2. HMM-based geometric signature of faciallandmarks

The HMM-based signature of facial landmarks is thecore of the proposed approach. The goal is to build a com-

pact description that is able to summarize information re-lated to interest points and their neighborhood. From theface scan organized as a 3D mesh, I interest points or land-marks are selected. Let us focus on landmark vi; aroundit, we build a clockwise spiral pathway s(vi) connectingvertices which lie at 1-ring distance [13], then at 2-ring dis-tance and so on, until a fixed geodesic radius r is reached.Connections among vertices which lie on different ring dis-tances are rearranged in order to maintain the area coveredby the spiral as regular as possible, thereby obtaining acircular geodesic area around vi. If holes are present onthe surface (especially around the mouth), no data is col-lected by jumping to the next available point, as shown inFig.1. Along this pathway, we extract local point informa-

s(vi)

vi

r

Figure 1. Facial landmark signature: on the left, two landmarksare depicted with black dots and their respective spiral pathways.On the top-right, a spiral s(vi) built in the presence of a hole in themesh; red dotted arrows give an idea of how the different portionsof the spiral are rearranged in a 1D array. On the bottom-right, azoom on the nose-point shows that the spiral pathway is limited bythe geodesic radius r.

tion [13] composed of the maximal and minimal curvature.Experimentally, we saw that other local features, such as theGaussian curvature and the shape index, do not improve thedescription.Once the data on the spiral s(vi) is acquired, we observedthat all its 2-dimensional entries {o}i form entities which,in principle, could be quantized to a few values, that oc-cur repeatedly themselves along the spiral. For this rea-son, modelling the spiral as a stochastic process, in whichthe different entities are thought as discrete states, is a rea-sonable choice. The model more suited for this idea is thediscrete-time Hidden Markov Model (HMM)[15].

A HMM can be viewed as a Markov model whose statesare not directly observable: instead, each state is charac-terized by a probability distribution function, modelling theobservations corresponding to that state. More formally, a

HMM is defined by the following entities [15]:

• S = {S1, S2, · · · , SN}, the finite set of (hidden)states; in our case each state is associated to a particu-lar local geometric configuration that occurs along thespiral.

• the transition matrix A = {akj}, 1 ≤ k, j ≤ N ,representing the probability of moving from state Sk

to state Sj ,

akj = P [Qt+1 = Sj |Qt = Sk], 1 ≤ t ≤ T,

with akj ≥ 0,∑N

j=1 akj = 1, and where T is thelength of the sequence and Qt denotes the state oc-cupied by the model at site t. This matrix encodeshow the different local configurations succeed alongthe spiral.

• the emission matrix B = {b(o|Sk)}, indicating theprobability of emission of symbol o ∈ V when sys-tem state is Sk; V can be a discrete alphabet or a con-tinuous set (e.g. V = IR), in which case b(o|Sk) is aprobability density function. We used a 2-dimensionalGaussian HMM, i.e.

b(o|Sk) = N (o|μk, Σk) ,

where N (o|μ, Σ) denotes a Gaussian density withmean μ and diagonal covariance matrix Σ, evaluatedat o, which represents an entry of the spiral pathway.This distribution codifies how probable values on thespiral are connected to a hidden state.

• π = {πk}, the initial state probability distribution,

πk = P [Q1 = Sk], 1 ≤ k ≤ N

with πk ≥ 0 and∑N

k=1 πk = 1.

For convenience, we represent an HMM by a triplet of pa-rameters λ = (A, B, π).

The learning of the HMM parameters, given an observedsequence s(vi), is usually performed using the well-knownBaum-Welch (BW) algorithm [15], which is able to deter-mine the parameters of the model λi by maximizing thelikelihood P (s(vi)|λi). In this way, the HMM gives a sta-tistical encoding of the facial landmark and its neighbor-hood, taking into account the uncertainty in the data. Ac-tually, each HMM state captures a particular geometricalaspect particularly evident near vi. In practice, as shown inthe experiments, the expressivity of such a characterizationis robust to pose, irregular sampling (for example, due toholes in the mesh) and resolution variation of the mesh overwhich the interest point lies.

3. Expression-driven 3D face recognition

In a typical face recognition scenario (e.g., driver license,passport), the face image captured at the enrollment stage(for populating the gallery) is generally with neutral ex-pression. But, at the verification stage, user can offer anyexpression. Although it is only at the verification stage thatwe want to relax the requirement of neutral expression, wepropose to acquire 3D scan images of a subject with mul-tiple expressions also for the gallery construction. Even ifthis approach implies the acquisition of a larger number ofinstances, we highlight that in principle only the HMMs pa-rameters have to be stored in the gallery.

The proposed application is based on two main phases:(i) gallery construction, and (ii) testing stage.

3.1. Gallery construction

During the enrollment stage, 3D face of each subject iscaptured. As we mentioned above, the same subject is ac-quired with different expressions such as those described in[2] (i.e., neutral, surprise, sadness, and so on). In order tobuild the gallery, the HMM training phase needs to be car-ried out for both the expression and subjects recognition.For each 3D face I fiducial landmarks are extracted accord-ing to the anthropometric guidelines [5].

Let OEXPj = [s(v1), ..., s(vI)] be the set of observed se-

quences around the I landmarks extracted for subject j hav-ing the expression EXP∈{ “NEU”, “SML”, ...} (i.e., neu-tral, smiling and so on). Therefore, for expression recog-nition we collect this data from subjects with the sameexpression. Then, the face expression model ΛEXP =[λEXP

1 , ..., λEXPI ] is generated by learning an HMM λi

for each of the I landmarks by taking into account all thesubjects j = 1, ..., N with that expression. Indeed, theprocess is repeated for all different expressions, by obtain-ing ΛNEU , ΛSML and so on. Similarly, for subject recogni-tion, the subject model ΛEXP

j = [λEXP1j , ..., λEXP

Ij ] is es-timated for each subject j having the particular expressionEXP. Note that just a single sample is sufficient for learningthe HMM parameters related to the subject model. Figure 2shows the gallery construction scheme.

3.2. Testing stage

Given a test sample of a subject with an arbitraryexpression, the I landmarks are extracted and the re-spective sequences are collected by defining OUNK

test =[stest(v1), ..., stest(vI)] (i.e., the sequence of the I land-marks of the test subject having an unknown expression).Indeed, a two-step testing procedure is performed. The faceexpression recognition is computed first by adopting a max-imum likelihood approach on each fiducial point. In partic-ular, for each fiducial point i, the following score is com-

Subj 1

NEU 1NEU …

Subj 1

SML 1SML

Subj 2

NEU 2NEU …

Subj 2

SML 2SML

Subj M

NEU MNEU …

Subj M

SML MSML

NEU SML

… … … …

…

Expression models

Subject

mod

els

Figure 2. Gallery construction scheme. From all the subjects withthe same expression, one expression model is estimated. Then, foreach subject with a given expression, a subject model is computed.

puted:

mlEi = maxEXP log P (stest(vi)|λEXPi ). (1)

Therefore, each landmark “votes” for a particular expres-sion, and a majority criterion is used for classifying the ex-pression of the test-face.

Once the expression has been recognized, the sub-ject recognition is performed by computing the followinglandmark-similarity score:

lSij = log P (stest(vi)|λEXP ′ij ), (2)

where, lSij is the log-likelihood of landmark i w.r.t. thegallery subject model j, with j = 1, · · · , M referring tothose gallery models having the same expression EXP ′,which has been recognized in the previous step. In thiscase, the so-called Borda count criterion [7] is used: all thesimilarity-landmark scores in the gallery are sorted; then,a subject receives the sum of the corresponding ranks ob-served on each landmark. Finally, the subject j′ with thelowest sum of ranks is chosen as match. Figure 3 shows thescheme of the testing stage.

4. Experimental results

The proposed approach has been tested on the 3D rangeimages available in the FRGC dataset [14]. We have se-lected the early 80 subjects among the scans of both the‘fall 2003’ and ‘spring 2004’ subsets [14]. Therefore, eachsubject is acquired with possible different expression, poseand at different time period. In particular, our preliminaryexperiments are focused on two different expressions: neu-tral and smiling. Among the selected subjects, 40 are ob-served with both neutral and smiling expressions while 40

ΛNEU

TEST SAMPLE

OtestUNK

…

ΛSML

EXPRESSION RECOGNITION

Λ1EXP’

ΛMEXP’

…

ΛEXP’

SUBJECT RECOGNITION

OUTPUT

j’

Figure 3. Testing stage. Given a test sample, local sequencesOUNK

test are collected and fed to each expression model to recog-nize the expression. Then, the same sequences are fed to each ofthe subject model having the already recognized expression. Thesubject j′ satysfying the Borda count criterion is the output.

subjects are observed with only the neutral one 1. We col-lect 2 neutral scans for each of the 80 subjects (i.e., 80 scansfor the gallery and 80 scans for testing) and 2 smiling scansfor each of the 40 smiling ones (i.e., 40 scans for the galleryand 40 scans for testing). Figure 4 shows some scans of thesubjects with both the expressions.

Figure 4. Example scans from subjects in the FRGC V.2.0 data-base. Each subject is observed with both neutral (top) and smiling(bottom) expression.

In each scan, I = 9 landmarks are extracted and theneighborhood information is collected as described in Sec-tion 2.

First, we evaluate the proposed algorithm with the 7 an-notated landmarks provided in the FRGC database [14].The two additional landmarks (i.e., the cheeks points) areobtained by extracting the middle point along the geodesicpath between the mouth corner and the outside eye corner.Figure 5 shows an example of a scan illustrating both therespective landmarks and spirals. Note that the lengths ofthe spirals are chosen with the aim to cover the most signif-icant part of the face.

1Note that among the early 80 subjects in the FRGC, not all of themare acquired with different expresions.

1 12

22

334

4 45 5 43

32

2

2

33

A

AA

IB

B

H

H

C

GD

E

EF

L

(a) (b) (c) (d)

Figure 6. The Viterbi path of the HMM models built on corresponding points. Two scans of the same subject are fed to the same neutralsubject model in (a) and (b), respectively (i.e., the gallery and the testing). Point in (c) is the same as in (a), but it is fed to the neutralexpression model. The corresponding point on a smiling scan is fed to the smiling expression model in (d).

Figure 5. Data collection. Given a scan (left), I = 9 landmarks areextracted and local features along the spiral pathway are collected(right).

For the face expression recognition stage we randomly se-lect 20 neutral and 20 smiling scans. For each class, theexpression model is trained as described in Section 2, byobtaining ΛNEU and ΛSML. Then, for the subject recog-nition stage, two subject models are trained, by obtainingΛNEU

j (with j = 1 . . . 80) and ΛSMLj (with j = 1 . . . 40).

In both the training stages, we found that setting N = 5hidden states gives the best performance. According to theproposed schema, the test evaluation is performed as fol-lows: given a test scan, the expression is recognized firstand then the subject recognition is carried out only amongthe subjects in the gallery with the same recognized expres-sion. In Figure 6, the Viterbi path of the HMMs built on cor-responding points of the same subject is shown. In particu-lar, in Figures 6 (a) and (b) the same subject model λNEU

k

is used to recognize the subject k with neutral expressionobserved by two different scans. For visual clarity, a state-identifying number is positioned on the area which exhibits

mainly the presence of that state. Note that similar states liein corresponding areas. In Figures 6 (c) and (d) the Viterbipath is shown (states are indicated with letters) with respectto the expressions models λNEU and λSML, respectively ofthe same subject with different expressions. Here, both thethe states (visible in the figure) and transitions are different.Note further that the same point of the same scan in (a) andin (c) assumes different meaning based on the consideredmodel (i.e., subject or expression model). In our experi-ment, the two-class facial expression recognition accuracyis 100%. Moreover, the neutral subjects recognition ratereaches 92.50%, while the smiling subjects recognition is90.00%. Therefore, the overall recognition accuracy of theproposed approach is 91.25%. Note that images are quitenoisy, with several holes especially around the eyes and themouth (e.g., see the eyebrows in Figure 5). Moreover, thesubjects are acquired with different distances from the sen-sor by providing models at varying resolutions (e.g., seethe different spirals in Figures 6 (c) and (d)). Despite that,the proposed descriptor is robust and allows the method tofind correct matching. Table 1 shows the computational ef-fort required by the main steps, where I is the number oflandmarks, η is mean size of the landmark-neighborhoodin computing the local features, N is the number of hiddenstates, τ is the mean length of the spirals associated with alandmark and W is the mean number of BW iterations [15].

It is worth noting that the main computational effort inour approach is in the training phase. The test stage is rea-sonably fast for on-line applications, as required in typicalbiometric scenario 2. Nevertheless, the training phase con-

2Currenly, the code is not optimized and it will be drastically improved

Step Complexity Run. timesData collection O(I · η · τ) 0.1

Single Model training O(I · N2 · τ · W ) 5.65Single subject testing O(I · N2 · τ) 0.65

Table 1. Run times (mean sec.) of the main steps of the proposedframework.

structs the expression and subject models which are the onlydata which need to be stored in the gallery. Each landmarkneeds ≈ 2Kbytes of memory. Note that a typical uncom-pressed 3D model occupies ≈ 13Mb. For instance, themodel in Figure 5 has 95, 520 vertices and 182, 960 trian-gles for a total of 13.318Mb.

Finally, we test our method by extracting the landmarksautomatically, on a subset of 20 subjects (with 20+20 neu-tral and 20+20 smiling scans). We use the automatic land-mark detection method proposed in [10]. The performanceis similar to the manual landmark case: the 2-class expres-sion recognition rate is again 100%, while the subject recog-nition rate is 90.00% and 85.00% for neutral and smiling,respectively, for an overall rate of 87.5%.

5. Discussions

We propose a new statistical signature for compact rep-resentation of few facial landmarks, in the context of 3Dface recognition applications. Our HMM-based geometricdescriptor allows a drastic reduction in the amount of storeddata, by estimating few HMMs model parameters for eachfiducial point. We have shown that the same learning frame-work is suitable for both expression and subject recognitionpurpose. Therefore, we propose a simple strategy for ex-pression invariant 3D face recognition. Preliminary resultson a subset of the FRGC V.2.0 dataset are reported sup-porting the effectiveness of results. Although the accuracydoes not outperform current expression invariant recogni-tion methods (e.g., in [9] a rate of 97% is reached on thewhole FRGC V.2.0 dataset), the obtained performance isstill satisfactory. Moreover, we highlight that most of themethods which provide the highest performance are basedon ICP-like algorithms that involve a large number of con-trol points. Hence, the comparison is not fair since only fewpoints are considered by our algorithm. We are confidentthat we can improve the results by possibly enlarging thenumber of facial landmarks. Finally, regarding the speedwe outline the improvements obtained with the proposedapproach by evidencing both the computational complexityand the timing of each of the main stages.

Future work will address the extension of our methodto 3D smart card identity verification systems (SCIVS). Infact, our 3D face descriptor should be particularly suitable

by switching from Matlab to C++.

for such kind of application due to the compactness and ro-bustness of the representation.

References

[1] K. W. Bowyer, K. Chang, and P. Flynn. A survey of ap-proaches and challenges in 3D and multi-modal 3D + 2Dface recognition. Computer Vision and Image Understand-ing, 101(1):1–15, 2006. 1, 2

[2] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Three-dimensional face recognition. Inter. Jour. Comp. Vis.,64(1):5–30, 2005. 1, 3

[3] K. I. Chang, K. W. Bowyer, and P. J. Flynn. An evaluation ofmultimodal 2D+3D face biometrics. IEEE Trans. Patt. Anal.Mach. Intell., 27(4):619–624, 2005. 1

[4] C.-S. Chua, F. Han, and Y.-K. Ho. 3d human face recogni-tion using point signature. In Automatic Face and GestureRecognition, 2000. 2

[5] L. Farkas. Anthropometry of the Head and Face. RavenPress, 2nd edition, 1994. 3

[6] G. Gordon. Face recognition based on depth and curvaturefeatures. In CVPR, 1992. 2

[7] A. Jain, K. Nandakumar, and A. Ross. Score normaliza-tion in multimodal biometric systems. Pattern Recognition,38:2270–2285, 2005. 4

[8] A. K. Jain and S. Z. Li. Handbook of Face Recognition.Springer-Verlag, (eds.), 2005. 1

[9] I. A. Kakadiaris, G. Passalis, G. Toderici, M. N. Mur-tuza, Y. Lu, N. Karampatziakis, and T. Theoharis. Three-dimensional face recognition in the presence of facial ex-pressions: An annotated deformable model approach. IEEETrans. Patt. Anal. Mach. Intell., 29(4):640–649, 2007. 1, 6

[10] X. Lu and A. K. Jain. Automatic feature extraction for mul-tiview 3D face recognition. In Automatic Face and GestureRecognition, 2006. 6

[11] X. Lu and A. K. Jain. Deformation modeling for robust 3Dface matching. In CVPR, 2006. 1

[12] A. Moreno, A. Snchez, J. Vlez, and F. Daz. Face recogni-tion using 3D surfac-extracted descriptors. In Irish MachineVision and Image Processing Conference, 2003. 2

[13] S. Petitjean. A survey of methods for recovering quadrics intriangle meshes. ACM Comput. Surv., 34(2), 2002. 2

[14] P. Phillips, P. Flynn, T. Scruggs, K. B. abd J. Chang, K. Hoff-man, J. Marques, J. Min, and W. Worek. Overview of theface recognition grand challenge. In CVPR, 2005. 4

[15] L. Rabiner. A tutorial on Hidden Markov Models and se-lected applications in speech recognition. Proc. of IEEE,77(2), 1989. 1, 2, 3, 5

[16] A. Scheenstra, A. Ruifrok, and R. C. Veltkamp. A surveyof 3D face recognition methods. In Audio- and Video-BasedBiometric Person Authentication, Lecture Notes in ComputerScience, pages 891–899. Springer-Berlin, 2005. 1

[17] F. Tsalakanidou, S. Malassiotis, and M. Strintzis. Integrationof 2D and 3D images for enhanced face authentication. InAutomatic Face and Gesture Recognition, 2004. 2

[18] Y. Wang and C.-S. Chua. Face recognition from 2D and 3Dimages using 3D gabor filters. Image and Vision Computing,23(11):1018–1028, 2005. 1, 2

HMM-based geometric signatures for compact 3D face representation and matching

Documents

Transcript of HMM-based geometric signatures for compact 3D face representation and matching