HMM-based architecture for face identification

HMM-based architecture for face identification

Ferdinand0 Samaria*+ and Steve Young*

- This paper describes an approach to the problem of face identification which uses Hidden Markov Models (HMM) to represent the statistics of facial images. HMMs have pre- viously been used with considerable success in speech recognition. Here we describe how two-dimensional face images can be converted into one-dimensional sequences to allow similar techniques to be applied. We investigate the factors that affect the choice of model type and model parameters. We show how a HMM can be used to automatically segment face images and extract features that can be used for identification. Successful

results are obtained when facial expression, face details and lighting vary. Small head orientation changes are also tolerated. Experiments are described which assess the performance of the HMM-based approach and the results are compared with the well-known Eigenface method. For the given test set of 50 images, the HMM approach performs favourably. We conclude by summarizing the benefits of using HMMs in this area, and indicate future directions of work.

Keywords: face recognition, image segmentation, hidden Markov models -

In recent years, substantial research effort has gone into understanding how to build a successful model for face identification. Psychologists have devoted particular attention to the problem, because of the fundamental role played by faces in human interactions. A collection of articles on the psychology of face identification may be found elsewhere’.‘. One of the striking aspects of human face identification is its robustness. Humans are able to identify faces even when there is distortion (as in the case of a caricature), coarse quantization, occlusion of details (a person wearing sunglasses) and even when the facial image has been inverted3. The apparent

complexity of the identification process and the high

- *Cambridge tiniversity Engineermg Department. Trumpington Street, Cambridge CB2 lPZ, UK ‘Olivetti Research Ltd. Old Addenbrookes Site, 24a Trumpmgton Street. CambrIdge CB2 IQA, UK Paper rrc~eivrci: 7 Drcemhrr IYY3; revised paper received: 3 March 1YY4

rate of success for humans have lead researchers to

argue that neural specialization has evolved to support a processor specific to faces4. In support of this, special neurons responsive to faces have been detected in the cerebral cortex of monkeys’.

In addition to the cognitive aspects, there are many applications for the automatic identification of faces by

machine. For example, a robust identification system could be used for workstation6 and building security, credit and cash card verification, criminal identification and video-document retrieval. This interest in practical applications has lead to a variety of algorithms and approaches. In an early paper by Goldstein et a1.7, a set of 34 features was manually extracted from a collection of 255 faces and used for identification. In recent years,

more automated approaches have been explored and a selection of references to these may be found elsewhere’, 9.

In this paper, the problem of face identification is addressed from the perspective of statistical pattern recognition. Intuitively, a face can be divided into a

number of regions such as mouth, eyes, nose, etc., and if these could be located reliably, then standard pattern matching techniques could be used on each region individually to compute an overall distance metric. However, accurate location is in practice very difficult. Furthermore, the precise demarkation of the regions is fuzzy since it is unclear where, for example, the mouth region ends and the chin region begins.

A potential solution to the above problem is to associate facial regions with the states of a continuous

density hidden Markov model’“. This allows the boundaries between regions to be represented by probabilistic transitions between states and the actual image within a region to be modelled by a multivariate Gaussian distribution. In the general case, such a Markov model would need to be two-dimensional. However, since faces are broadly symmetric, a first- order approximation can be made in which facial regions are restricted to horizontal bands. In this case, simple one-dimensional HMMs can be used.

0262-8856/94/08/0537-07 $; 1994 Butterworth-Heinemann Ltd

Image and Vision Computing Volume 12 Number 8 October 1994 537

This paper describes such a system of face identilica- tion based on standard one-dimensional HMMs. The paper is organized as follows. In the next section, the

basic theory of HMMs is reviewed for the benefit of readers who are unfamiliar with this technique, and then their application to face recognition is described.

We then present an experimental evaluation of a HMM- based face recognition system and compare the results with a system based on Eigenfaces’. Finally, our conclusions are presented and future work in this area discussed.

HMMs FOR FACE RECOGNITION

This section outlines the basic principles of HMMs, and

explains how they can be used for face recognition. HMMs are used for the stochastic modelling of non-

stationary vector time-series. As such, they have an immediate and obvious application in speech processing applications, particularly recognition, where the signal of interest is naturally represented as a time-varying

sequence of spectral estimates. Because of this, much of

the development of HMMs in recent years has been done within the speech area. A comprehensive tutorial describing HMMs is given by Rabiner”, and a good modern treatment of HMM-based speech recognition is given by Rabiner and Juang”.

Hidden Markov Models

A HMM provides a statistical model for a set of observation sequences. Let a particular observation

sequence have length T and be denoted as oi..or. In our case, an observation o, is a block of pixels extracted from an image, and the sequence is formed by scanning the image in some order. This process of sampling the image is discussed further in the following sections. A HMM consists of a sequence of states numbered 1 to N and it is best understood as a generator of observations. The states are connected together by arcs, and each time that a state j is entered, an observation is generated according to the multivariate Gaussian distribution bj(o,) associated with that state. The arcs themselves have transition probabilities associated with them such that a transition from state i to state j has probability aij. All of the experimental work described here uses the HTK software package described by Young13, which adopts the convention that a HMM always starts in state 1 and ends in state N, and furthermore, both of these states are non-emitting states. A HMM is thus defined by a set of transition parameters A = {aij} and a set of Gaussian output functions B = {bj(.)}. For a given model A = {A,B}, the joint likelihood of a state sequence S = si ..sr and the corresponding observation- sequence 0 = oi..oT is given by multiplying each transition probability by each output probability at each step t as follows:

where SO = 1 and ST+1 = N. In practice, the state sequence is unknown, i.e. it is hidden and so equation (1) cannot be evaluated. However, the likelihood P(OIA) can be evaluated by summing over all possible state sequences:

HMM-based architecture for face identification: F Samaria and S Young

The key attraction of HMMs is that there is a simple procedure for finding the parameters il which maximize

equation (2). This procedure is usually referred to as Baum-Welch reestimation, and it depends for its operation upon the forward-backward algorithm. The latter allows the so-called forward probability P(ol ..ot,sI =jlA) and the backward probability P(o,+ 1 ..oTIsf = j, 2) to be found efficiently via a simple recursion. The product of these two probabilities when normalized gives the probability of occupying state j at step t given the observation sequence 0. Using this state occupation probability, a new set of HMM parameters

for each state j can be found, essentially by computing weighted averages. This process is repeated until the parameter estimates converge, and it can be showni4*” that this will be at a local maximum of P(OIA). All of the above has been described in terms of a single observation sequence, but it is trivial to extend this to

maximize over a set of observation sequences. Thus, given one or more training observation sequences known to come from a specific face, the parameters of a HMM can be estimated to form a statistical model for

that face. To use HMMs for identification, an observation

sequence is extracted from the unknown face, and then the likelihood of each HMM generating this face is computed. The HMM which has the highest likelihood then identities the unknown face. This likelihood should strictly be the total likelihood as defined in equation (2). However, in practice, it is more convenient to find the state sequence which maximizes equation (1) and use the corresponding maximum likelihood instead. This maximization, known as the Viterbi algorithm16, is a simple dynamic programming optimization procedure. The advantage of using it instead of the full likelihood computed by the forward-backward algorithm is that it also yields the maximum likelihood state sequence as a

by-product, and this can be useful in determining which regions of the observation sequences are being modelled by each state. Subsequent sections will make extensive use of this facility to give some insight into the identification processes.

The above describes the main elements of using HMMs for pattern recognition. The essential point is that each HMM represents the statistical distribution of all observation sequences associated with a particular class. Thus, in the case of faces, a number of different- views and expressions of each face can be combined in a

538 Image and Vision Computing Volume 12 Number 8 October 1994


single statistical model. Furthermore, since the HMM associates states with the quasi-stationary regions of its observation sequences, it offers away of automatically locating and utilising the regions of a face which are important for identification. In subsequent sections, these regions will be referred to as features, and the ability of the various forms of HMM to extract plausible features is considered to be an important indicator of how good the model is.

HMM topology

As noted above. HMMs have been used extensively in speech, where data is naturally one-dimensional (1D) along the time axis. Images are two-dimensional (2D) and no equivalent of 1D HMMs exists for 2D signals, where a fully connected 2D HMM would lead to an NP-complete problem26. We have experimented with various sampling techniques to convert image data into

a suitable 1 D sequence 0. In general, the states of a HMM can be arbitrarily

connected allowing it to represent ergodic signals.

However. for pattern recognition applications, it is usually better to imply some constraints on the allowed state transitions to reflect known properties of the data. In particular, so-called kft-right HMM topologies are often employed, which have the property that the state index must monotonically increase when progressing through the observation sequence. For faces, the natural order is to traverse the face from top to bottom and hence, top-bottom is a more natural designation than kft-right. In this section. ergodic and top-bottom HMMs are compared.

A simple ergodic HMM can be built using the sampling technique shown in Figure 1. A total of IO

‘i+l

Figure I Sampling technique for ergodic HMM

images (256 x 256 pixel, &bit grey level) are used for training. Sampling is carried out using a 64 x 64 pixel

window sliding left to right with steps of 48 pixels. When the right margin of the image is reached, the window slides vertically down 48 pixels and starts sampling again right to left. and so on. An &state model is used, since approximately eight distinct regions

seem to appear in the face image (eyes, mouth, forehead. hair, background, shoulders and two extra states for boundary regions). Figure 2 shows the

training images and the mean of the Gaussian distribution computed for each HMM state. Some states, such as the eye, the neck and the background are recogniz- able. To build this model no use of structural informa- tion (i.e. the fact that the image contains a face) has

been made. Different feature extraction can be obtained using the

sampling technique shown in Figure 3. In this case, an observation sequence 0 is generated from an Xx Y image using an Xx L sampling window with X x A4

pixels overlap. Each observation vector is a block of L

lines. There is an M-line overlap between successive

observations. Assuming that each face is in an upright, frontal position. features will occur in a predictable

order, i.e. forehead, then eyes, then nose. and so on. This ordering suggests the use of a top-bottom (non- ergodic) model, where only transitions between adjacent states in a top-to-bottom manner will be allowed. Figure 4 shows the model for a 5-state HMM, with the expected facial regions as shown. The features extracted

with this mode1 can be observed in Figure 5, where five images (I 84 x 224 pixels, X-bit grey level) are used to train a top-bottom HMM. Each training image is sampled by a block of 16 lines moving down in steps of four lines (i.e. with a 12-line overlap). The overlapping

allows the features to be captured in a manner which is independent of vertical position. where a disjoint partitioning of the image could result in the truncation of features occurring across block boundaries. Sampling and parameterization are discussed in more detail in the next section. Each of the five means of the distributions calculated by the HMM represent a,/kwtuw hand. These feature bands are the values of p, obtained by the Baum-Welch method to locally maximize the probability of observing the training data, given the model.

We can subjectively associate the feature bands with features as understood by humans. For example. the second band seems to contain the eyes, which represent one of the salient features used for identification by humans”. The feature bands correspond to the facial regions which were intuitively predicted in Figure 4. Due to the overlap, each state leaks into the state following

Image and Vision Computing Volume 12 Number 8 October 1994


Figure 3 Sampling technique for top-to-bottom HMM

Figure 4 Top-to-bottom 5-state HMM

as the first few observations of each state contain pixels already seen in the previous state.

Sampling parameters

In the previous section, the benefits of using a top- bottom model have been outlined. The results presented above make use of a model that parameterizes the data as shown in Figure 3 using X = 184, Y = 224, L = 16, M = 12. In this section, alternative parameterizations for the top-bottom model will be investigated. The segmentation of the training data which is used to generate the feature bands is shown and used to assess the parameterization.

Figure 6 summarizes the experimental results obtained with various parameters. The values of X, Y, L and A4 refer to the sampling technique of Figure 3.

Experiments (a) to (e) do not use overlap and the window height varies from 1-16 lines. For small height values the data segmentation does not correspond to intuition, and the feature bands do not reveal facial features. As the height of the sampling window increases, we note that the segmentation results are closer to those predicted in Figure 4. For experiment (e), the sampling window is sufficiently large to contain

distinguishable features. With no overlap, however, as the window size increases there is a higher probability of cutting across features. Experiment (f) shows the results obtained with the model described in the previous section. A 12-line overlap is allowed to avoid feature truncation. The segmentation seems accurate and the feature bands seem to match what we would intuitively expect.

EXPERIMENTAL EVALUATION

This section presents experimental results obtained using a database of facial images for 24 different subjects. The method of training each HMM is described, and the results of identification experiments on 50 test images are reported. For comparison, results using the Eigenface method’ are also given.

Training

A training set comprising of five images (92 x 112 pixel, 8-bit grey levels) of 24 different subjects is used to train 24 HMMs. Each image is sampled into an observation sequence using a block of eight lines, moving vertically

two lines at a time (i.e. with a six-line overlap). Using the parameters of Figure 3, we have X = 92, Y = 112,

L = 8, M = 6. For each subject, live such sequences (one per training image) are generated and used to train

a 5state top-to-bottom HMM. The training and testing are carried out using the

HTK: Hidden Markov Model Toolkit V1.3 developed by the Cambridge University Engineering Department’*. The training process consists of the following steps:

Create a prototype HMM model. The prototype serves the purpose of specifying the number of states in the HMM, the state transitions allowed and the size of the observation sequence vectors. Compute iteratively a set of initial parameter values using the training data. On the first cycle, the data is uniformly segmented and matched with each model state. On successive cycles, the uniform segmentation is replaced by Viterbi alignment. Re-estimate parameters using the Baum-Welch method. The model parameters are adjusted so as to locally maximize the probability of observing the training data, given each corresponding model.

Figure 7 shows the training data for the 24 subjects and the feature bands obtained after training for each of them.

540 Image and Vision Computing Volume 12 Number 8 October 1994

a out the identification using a Sun Spare II workstation.

Figure 8 shows the test images segmented using the best matching HMM. Figure 9 shows the identification results. The crossed images are the ones which are misclassified. The overall identification performance is 84%, with eight images being incorrectly identified.

b Tests are carried out on the same data using eigenfaces. With this approach, the number of dimen- sions is reduced to a feature space with a smaller set of orthogonal basis vectors found using principal component analysis on the training images. A full

account of this method can be found elsewhere”. 20.

c Figure 10 shows the projection of the 50 test images onto the feature space. The best match is found by

looking for the nearest training image in the feature space. Figure II shows the identification results using this method. The crossed images are the ones which are

misclassified. The overall identification performance is 74%, with 13 images being incorrectly identified.

, . *. .& . -- _-e-m._+...~. --... A.

Figure 6 Data segmentation and feature bands for different sampling parameters

Identification experiments

In this section, the identification rates obtained with the HMM-based and the Eigenface approach are compared. Different success rates are reported in the literature and it is often difficult to compare different- methods”. For the experiments reported here, a set of 50 test images that were not part of the training set are collected to test the two approaches. The images are captured at different times, and contain faces with different facial expressions, facial details (with and without glasses) and lighting. Small changes in orientation of the head are allowed.

Identification using the HMM-based approach is carried out by matching the test image against each of the 24 trained models. To do this, the test image is converted to an observation sequence O,,,, and the model likelihoods:

are computed for each face model Ibck). The model with the highest likelihood reveals the identity of the unknown face. To give an impression of the computa-

tion involved in using this method, it takes approxi-

mately ten minutes to load the 24 trained models and

the 50 test images from a tile into memory, and to carry

CONCLUSIONS

In this paper, an approach to the problem of face identification based on HMMs has been described. The approach provides a way of automatically segmenting face images and extracting useful features for identification. Other methods for feature extraction, such as neural networks* and deformable templates22, have usually required considerable initial guidance. Successful identification results have been obtained using the HMM-based method with relatively few constraints on the test data. Small orientation changes, non-homogeneous lighting and local feature variation (with and without glasses, smiling and non-smiling, open and closed eyes) are accommodated satisfactorily. A comparison has been presented with the eigenface approach, and using the given test data, the HMM-

based method is superior. This initial work with HMMs on images appears to

be promising, and HMM-based methods are already being used for other application?. Future work will concentrate on investigating alternative domains of representation to improve identification. For example, Fourier transformed images will be considered, as there is evidence that frequency and frequency/space repre- sentations may lead to better data separation24.We will also investigate the possibility of making better use of 2D dependencies by using a pseudo-2D HMM structure25. This approach may yield more accurate feature location through the vertical segmentation of the various facial regions.

ACKNOWLEDGEMENTS

This work is supported by a Trinity College Internal Graduate Studentship and an Olivetti Research Limited




Figure 7 Trainin fied state means

g data and

Figure 8 Test data segmentation using HMMs Figure 10 Test images projected onto feature space

Figure 9 Test results for HMM based approach; crosses indicate misclassified images

Figure 11 Test results for eigenface approach; crosses indicate misclassified images

542 image and Vision Computing Volume 12 Number 8 October 1994


CASE award. Their support is gratefully acknowledged. We also wish to thank many of our colleagues for ideas discussed together: Andy Hopper and Andy Harter of Olivetti Research, Gabor Megyesi of the Pure

Mathematics Department, Gavin Stark of the Computer Laboratory, Tat Jen Cham of the Engineering Department, Barney Pell of NASA Ames Research Centre and Junji Yamato of NTT Research in Yokosuka. Finally, the authors wish to remember the late Professor Frank Fallside.

REFERENCES

Bruce. V (ed) Face Recognition. Laurence Erlbaum, NJ (1991) Davies. G, Ellis, H and Shepherd, J (eds.) Perceiving and Remembering Faces, Academic Press, New York (198 1) Diamond, R and Carey, S ‘Why faces are and are not special: an effect of expertise’, J. Exper. Psychol. General, Vol 115 No 2 (1986) pp 107-l 17 Yin, R K ‘Face recognition by brain-injured patients: a dissoci- able ability?’ .Neuropq?cho~ogia, Vol 8 (1970) pp 395402 Perret. D 1, Rolls, E T and Caan, W ‘Visual neurones responsive to faces in the monkey temporal cortex’, Exper. Brain Re.y.. Vol 47 (1982) pp 329.-342 Gallery, R and Trew, T I P ‘An architecture for face classitica- tion’, IEEE Colloquium on ‘Machine Storage and Recognition of Faces’, Digest No: 19921017, Vol 2 (1992) pp l-5 Goldstein A J. Harmon, L D and Lesk, A B ‘Identification of human faces’. Proc. IEEE, Vol 59 No 5 (1971) pp 748-760 Samal, A and Iyengar, P A ‘Automatic recognition and analysis of human faces and facial expressions: a survey’, Patt. Recogn., Vol 25 No 1 (1992) pp 65577 Turk. M and Pentland, A ‘Eigenfaces for recognition’, J. Cogn. Neurosci., Vol 3 No I (1991) pp 7lL86 Samaria, F ‘Face segmentation for identification using hidden Markov models’, Br. Machine Vision Conf., BMVA Press (1993) Rabiner. L R ‘A tutorial on hidden Markov models and selected

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

applications in speech recognition’, Proc. IEEE, Vol 77 No 2 (January 1989) pp 257-286 Rabiner, L R and Juang, B-H Fundamentals of Speech Recogni- rion, Prentice-Hall, Englewood Cliffs, NJ (1993) Young, S J The HTK Hidden Markov Model Toolkit: Design und Philosophy, Technical Report TR. 152. Cambridge University Engineering Department (1993) Baum, L E ‘An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes’. Inequalities, Vol III (1972) pp I-8 Baum. L E and Sell, G R ‘Growth transformations for functions on manifolds’, Pacific J. Math., Vol 27 No 2 (1968) pp 21 l-227 Forney, G D ‘The Viterbi algorithm.‘ Proc. IEEE. Vol 61 No 3 (March 1973) pp 268-278 Shepherd, J, Davies, G and Ellis, H ‘Studies of cue saliency’, in G Davies. H Ellis and J Shepherd (eds.), Perceiving and Remembering Fuce.v. Academic Press. New York (19X1) pp 105- 131 Young, S J HTK: Hidden Markor’ Model Toolkit V1.3. Reference Manual, Cambridge University Engineering Department (1992) Robertson, G and Craw, I ‘Testing face recognition systems’, Br. Machine Vi.yion Confi, BMVA Press (1993) Kirby, M and Sirovich, L ‘Application of the Karhunen-Loeve procedure for the characterisation of human faces‘, fEEE Truns. P.4MI. Vol 12 No I (1990) pp 1033108 Hutchinson, R A and Welsh. W J ‘Comparison of neural networks and conventional techniques for feature location in facial images’. IEEE Int. C‘or$ Artif: Neurrtl .Nc,works, IEEE Press, New York (1989) Yuille. A L. Hallinan, P W and Cohen. D S ‘Feature extraction from faces using deformable templates’. fnr J. Compuf. Vision, Vol 8 No 2 (1992) pp 9991 II Yamato, J. Ohya, J and Ishii. K ‘Recognizing human action in time-sequential images using hidden Markov model’, Proc. CVPR (1992) pp 379-385 Wechsler. H Compututiontrl Vision. Academic Press, San Diego, CA (1990) Kuo, S and Agazzi. 0 E ‘Machine vision for keyword spotting using pseudo 2d hidden Markov models’. Proc, ICASSP, V (1993) pp 81-84 Levin, E and Pieraccini, R ‘Dynamic planar warping for optical character recognition’, Proc. ICASSP. 111 (1992) pp 149-I 52


HMM-based architecture for face identification

Documents

Transcript of HMM-based architecture for face identification