Separation of texture and shape in images of faces for image coding and synthesis

10
Separation of texture and shape in images of faces for image coding and synthesis Thomas Vetter and Nikolaus F. Troje Max-Planck-Institute for Biological Cybernetics, Spemannstrasse 38, D-72076 Tu ¨ bingen, Germany Received August 1, 1996; revised manuscript received January 10, 1997; accepted February 24, 1997 Human faces differ in shape and texture. Image representations based on this separation of shape and tex- ture information have been reported by several authors [for a review, see Science 272, 1905 (1996)]. We in- vestigate such a representation of human faces based on a separation of texture and two-dimensional shape information. Texture and shape were separated by use of pixel-by-pixel correspondence among the various images, which was established through algorithms known from optical flow computation. We demonstrate the improvement of the proposed representation over well-established pixel-based techniques in terms of cod- ing efficiency and in terms of the ability to generalize to new images of faces. The evaluation is performed by calculating different distance measures between the original image and its reconstruction and by measuring the time that human subjects need to discriminate them. © 1997 Optical Society of America [S0740-3232(97)00909-5] 1. INTRODUCTION Human language is organized in terms of hierarchical categories that comprise similar objects of the world into object classes. A natural description of an object typi- cally assumes a priori knowledge about the object class per se and evaluates only its particularities and devia- tions with respect to a prototype. This comparison con- tains differences in surface properties as well as in the spatial arrangement of object features. For human faces, most attributes used in such a de- scription are continuous. Faces can have lighter or darker skin, larger or smaller eyes, a wider or a narrower mouth. Other image attributes that do not belong di- rectly to the face itself, such as the illumination or the viewpoint from which the face is seen, are also physically continuous and are perceived and described as such. Some of the above attributes, e.g., the skin color and the illumination, correspond to continuous intensity changes within parts of the image. Others, such as the size of the eyes and the shape of the mouth, correspond to changes in the spatial arrangement among the parts in the image. They have to be described as image distor- tions rather than intensity changes. In most cases it is easy to classify an attribute as hav- ing changes in the surface properties only or changes in the spatial arrangement of the features only. We refer to the information contained in surface properties as the texture of the face and the information contained in the spatial arrangement as the shape of the face (since we are working with images, we mean two-dimensional shape). Image representations separating shape and texture information have been introduced by several authors (for a review, see Ref. 1). They differ in principle from ap- proaches that do not rely on information about the feature-by-feature correspondence and subsequently do not separate shape and texture information (for a review, see Ref. 2). In this paper we argue that a representation that sepa- rates the information contained in the image of a face into its texture and shape components has several advantages for modeling and for efficient coding. We will present an algorithm that separately represents texture and shape. We will then evaluate the quality of low-dimensional re- constructions made with this representation by compar- ing it with the quality of low-dimensional reconstructions in a pixel-based image space. We use the term ‘‘pixel-based representations’’ for rep- resentations that are based on a code in which a digitized image of size s 5 n 3 m is described by a vector of length s by simply concatenating all the intensity values in the image. Low-dimensional representations can then be achieved by performing a KarhunenLoeve expansion and using subspaces spanned by only the first principal components. 3 Such representations were introduced by Sirovich and Kirby 4 and have been applied successfully to many different tasks, such as face recognition 57 and gen- der classification. 8 O’Toole et al. 9 have used such a rep- resentation to model the ‘‘other race effect,’’ a well-known psychophysical phenomenon. 10 As was pointed out by several authors, 1,11,12 pixel- based face representations have unpleasant properties. An important property of a linear space is the existence of an addition and a scalar multiplication defining linear combinations of existing objects. All such linear combi- nations are objects of the space. In a pixel-based repre- sentation, this is typically not the case. One of the sim- plest linear combinations—the mean of two faces—will in general not result in a single intermediate face but will appear as two superimposed images. Any linear combi- nation of a larger set of faces will appear blurry. The set of faces is not closed under addition. One can reduce these disadvantages by carefully standardizing the faces in the images, for instance, by providing for a common po- sition of the eyes (see, e.g., Ref. 13). However, even after such a standardization step the matching error can still be large, yielding fuzzy and imprecise images. Aligning the eyes of two faces requires only translation 2152 J. Opt. Soc. Am. A / Vol. 14, No. 9 / September 1997 T. Vetter and N. F. Troje 0740-3232/97/0902152-10$10.00 © 1997 Optical Society of America

Transcript of Separation of texture and shape in images of faces for image coding and synthesis

2152 J. Opt. Soc. Am. A/Vol. 14, No. 9 /September 1997 T. Vetter and N. F. Troje

Separation of texture and shape in images of facesfor image coding and synthesis

Thomas Vetter and Nikolaus F. Troje

Max-Planck-Institute for Biological Cybernetics, Spemannstrasse 38, D-72076 Tubingen, Germany

Received August 1, 1996; revised manuscript received January 10, 1997; accepted February 24, 1997

Human faces differ in shape and texture. Image representations based on this separation of shape and tex-ture information have been reported by several authors [for a review, see Science 272, 1905 (1996)]. We in-vestigate such a representation of human faces based on a separation of texture and two-dimensional shapeinformation. Texture and shape were separated by use of pixel-by-pixel correspondence among the variousimages, which was established through algorithms known from optical flow computation. We demonstratethe improvement of the proposed representation over well-established pixel-based techniques in terms of cod-ing efficiency and in terms of the ability to generalize to new images of faces. The evaluation is performed bycalculating different distance measures between the original image and its reconstruction and by measuringthe time that human subjects need to discriminate them. © 1997 Optical Society of America[S0740-3232(97)00909-5]

1. INTRODUCTIONHuman language is organized in terms of hierarchicalcategories that comprise similar objects of the world intoobject classes. A natural description of an object typi-cally assumes a priori knowledge about the object classper se and evaluates only its particularities and devia-tions with respect to a prototype. This comparison con-tains differences in surface properties as well as in thespatial arrangement of object features.

For human faces, most attributes used in such a de-scription are continuous. Faces can have lighter ordarker skin, larger or smaller eyes, a wider or a narrowermouth. Other image attributes that do not belong di-rectly to the face itself, such as the illumination or theviewpoint from which the face is seen, are also physicallycontinuous and are perceived and described as such.

Some of the above attributes, e.g., the skin color andthe illumination, correspond to continuous intensitychanges within parts of the image. Others, such as thesize of the eyes and the shape of the mouth, correspond tochanges in the spatial arrangement among the parts inthe image. They have to be described as image distor-tions rather than intensity changes.

In most cases it is easy to classify an attribute as hav-ing changes in the surface properties only or changes inthe spatial arrangement of the features only. We refer tothe information contained in surface properties as thetexture of the face and the information contained in thespatial arrangement as the shape of the face (since we areworking with images, we mean two-dimensional shape).

Image representations separating shape and textureinformation have been introduced by several authors (fora review, see Ref. 1). They differ in principle from ap-proaches that do not rely on information about thefeature-by-feature correspondence and subsequently donot separate shape and texture information (for a review,see Ref. 2).

In this paper we argue that a representation that sepa-

0740-3232/97/0902152-10$10.00 ©

rates the information contained in the image of a face intoits texture and shape components has several advantagesfor modeling and for efficient coding. We will present analgorithm that separately represents texture and shape.We will then evaluate the quality of low-dimensional re-constructions made with this representation by compar-ing it with the quality of low-dimensional reconstructionsin a pixel-based image space.

We use the term ‘‘pixel-based representations’’ for rep-resentations that are based on a code in which a digitizedimage of size s 5 n 3 m is described by a vector of lengths by simply concatenating all the intensity values in theimage. Low-dimensional representations can then beachieved by performing a Karhunen–Loeve expansionand using subspaces spanned by only the first principalcomponents.3 Such representations were introduced bySirovich and Kirby4 and have been applied successfully tomany different tasks, such as face recognition5–7 and gen-der classification.8 O’Toole et al.9 have used such a rep-resentation to model the ‘‘other race effect,’’ a well-knownpsychophysical phenomenon.10

As was pointed out by several authors,1,11,12 pixel-based face representations have unpleasant properties.An important property of a linear space is the existence ofan addition and a scalar multiplication defining linearcombinations of existing objects. All such linear combi-nations are objects of the space. In a pixel-based repre-sentation, this is typically not the case. One of the sim-plest linear combinations—the mean of two faces—will ingeneral not result in a single intermediate face but willappear as two superimposed images. Any linear combi-nation of a larger set of faces will appear blurry. The setof faces is not closed under addition. One can reducethese disadvantages by carefully standardizing the facesin the images, for instance, by providing for a common po-sition of the eyes (see, e.g., Ref. 13). However, even aftersuch a standardization step the matching error can stillbe large, yielding fuzzy and imprecise images.

Aligning the eyes of two faces requires only translation

1997 Optical Society of America

T. Vetter and N. F. Troje Vol. 14, No. 9 /September 1997 /J. Opt. Soc. Am. A 2153

and scaling operations. For an alignment of all featuresin a face, such linear image operations are not enough.Nonlinear deformations have to be used to change thespatial arrangement of the different features. The spa-tial arrangement, however, might be an important part ofthe character of a face, and changing it would meanchanging appearance and identity. Aligning faces fea-ture by feature leads to a shape-free11 representation,which is deprived of an important part of the informationcontained in the face. On the other hand, a spacespanned by shape-free faces (at least if enough featuresare aligned) is a proper linear space, with the set of facesbeing convex in the sense that any point between twofaces will result in a sharp face again. The mean of twoshape-free faces will no longer be qualitatively distin-guishable from the original shape-free faces.

The shape that has been eliminated in the shape-freerepresentation can well be described in terms of the non-linear deformation that maps a given face onto a commonface. This common face serves as a prototype and definesthe origin of the resulting shape space. The shape-freerepresentation together with the shape itself contains allthe information that had originally been in the image. Arepresentation of the shape in terms of the deformationfield with respect to a common prototype is also convex.The mean of two deformation fields is a valid deformationfield.

The crucial step in defining the deformation fields be-tween different images is to establish feature-by-featurecorrespondence between them. For this reason we callthis kind of representation a correspondence-based repre-sentation in contrast to pixel-based representations,which are based on the unprocessed images. As men-tioned above, we use the terms texture and shape for thetwo parts of the correspondence-based representation.

During the past few years various researchers haveworked with correspondence-based representations of hu-man faces.1,11,12,14–17 The features used for establishingcorrespondence span the whole range from semanticallymeaningful features, such as the corners of the eyes andthe mouth, to pixel-level features that are defined by thelocal gray-level structure of the image. Most authorshave established correspondence by hand selecting a lim-ited set of features in the faces. Beymer et al.18 solvedthe correspondence problem by using an adapted opticalflow algorithm that established correspondence on thesingle-pixel level.

The purposes for using a correspondence-based facespace vary. Beymer and Poggio1 used this representa-tion to train a regularization network to learn image de-formations produced by changes in expression and pose.Craw and Cameron11 concentrated on artificial face rec-ognition and compared recognition systems, based onprincipal-component analysis (PCA), that used eitherpixel-based representations or correspondence-based rep-resentations. Their results clearly showed the advan-tage of a correspondence-based representation in a recog-nition task. Hancock et al.17 showed that principalcomponents derived from a correspondence-based repre-sentation better reflect psychophysical ratings of distinc-tiveness and memorability than do principal componentsderived from a pixel-based representation.

In this paper we investigate the advantages of acorrespondence-based representation for modeling andcoding efficiency and its generalizability. How much dowe gain in terms of reconstruction quality when using acorrespondence-based representation instead of a pixel-based representation?

Coding efficiency and generalizability were evaluatedin a cross-validation experiment. Faces from a test setwere coded with the faces from a training set. Recon-structions were thus obtained by projecting test faces intospaces spanned by different numbers of principal compo-nents obtained from a set of training faces. As Kirby andSirovich13 pointed out, this procedure allows us to test thecoding abilities of the representation better than we couldby evaluating only reconstructions of faces that were al-ready used to calculate the principal components. In thelatter case, using all the principal components will alwaysresult in a perfect reconstruction, irrespective of the num-ber of faces used for the calculation. Only if new facesare used will we be able to use the relation between thereconstruction quality and the number of faces that wereused to span the space to draw conclusions about the di-mensionality of the set of faces. The quality of the recon-structions will be evaluated by calculation of different dis-tance measures between the original image and itsreconstruction and by means of a psychophysical experi-ment.

The paper is organized as follows. First, a PCA is per-formed separately on the shape and the texture informa-tion as well as on the images themselves. All technicaldetails of the implementation used to separate shape andtexture in images of human faces are described in Appen-dix A. Following the theoretical evaluation of the repre-sentations, we describe a psychophysical experiment thatevaluates the quality of the reconstructions. Finally, themain properties and possible future extensions of the rep-resentation are discussed.

2. PRINCIPAL-COMPONENT ANALYSIS,RECONSTRUCTIONS, ANDRECONSTRUCTION ERRORSA. Separation of Texture and Shape in Images of FacesThe central part of the approach is a representation offace images that consists of a separate texture vector andtwo-dimensional (2D) shape vector, each with componentsreferring to equivalent feature points. In our approachwe treat any single pixel as a ‘‘feature’’ and establishpixel-by-pixel correspondence between a given image of aface and a reference image. All images of the training setare mapped onto a common reference face. The corre-spondences were computed automatically with agradient-based optical flow technique that had alreadybeen used successfully on face images.18,19 Technical de-tails can be found in Appendix A.

Assuming a pixel-by-pixel correspondence to a refer-ence face, a given example image can be represented asfollows: Its 2D shape is coded as the deformation fieldthat has to be applied to the reference image to match the

2154 J. Opt. Soc. Am. A/Vol. 14, No. 9 /September 1997 T. Vetter and N. F. Troje

example image. This deformation field is defined at eachsingle pixel. So the shape of a face image is representedby a vector S 5 (Dx1 , Dy1 , Dx2 , Dy2 , ..., Dxn , Dyn)T,that is, by the Dx, Dy displacement of each pixel with re-spect to the corresponding pixel in the reference face.The texture is coded as a difference map between the im-age intensities of the exemplar face and its correspondingintensities in the reference face. This normalized texturecan be written as a vector T 5 (DI1 , DI2 , ..., DIn)T,which contains the image-intensity differences DI of then pixels of the image (Fig. 1).

B. Linear Analysis of Texture, Shape, and Pixel-Represented ImagesWe performed a PCA separately on both the texture andthe shape part of the correspondence-based representa-tion. In addition, we calculated the principal compo-nents from the images themselves (i.e., on the pixel-basedrepresentation). The database of images contained 100faces. For details, see Appendix A.

Principal components were obtained by calculating theeigenvectors of the covariance matrix of the data. Fig-ures 2 and 3 show variations of the shapes and the tex-tures along the first four principal components of the twosubspaces. To the average face we added the respectivenormalized principal component with weights correspond-ing to two, four, and six standard deviations in both di-rections. The center row (s 5 0) in both Figs. 2 and 3always shows the same image: the average face consist-ing of the average texture and the average shape.

Although six standard deviations widely exceeds therange of naturally occurring images, even the most ex-treme faces still look rather normal. The first few prin-cipal components clearly mark important characteristics.The first component of the shape shows a size effect thatseems to correspond to the perceived gender. The secondcomponent captures the size of the forehead. Component3 accounts for rotations around the horizontal axis in theimage plane. The fourth component shows the transitionfrom a narrow head to a wide head and thus accounts forthe aspect ratio of the face.

The first principal component of the texture clearlycaptures the variability in illumination that is stillpresent in our database. The light source is moving frombelow to above and changes in its intensity. The second

Fig. 1. Example of a face image (left) mapped onto the referenceface (center) by pixel-by-pixel correspondence establishedthrough an optical flow algorithm. This algorithm separates the2D shape information captured in the correspondence field (lowerright) from the texture information captured in the texturemapped onto the reference face (upper right).

texture component accounts for facial hair. Although thedatabase contained no people with beards, some maleswere better shaved than others. Extrapolating six stan-dard deviations away from the mean recovers the beard.This correlates with the strength of the eyebrows.People with pronounced eyebrows also have a more pro-nounced beard growth. However, variations of thestrength of the eyebrows also occur in women and in care-fully shaved men. They show up in the third and fourthprincipal components. The higher components (notshown) are not so clearly characterizable. Many of themcode for very small changes in the intensity distributionin the eyes.

Figure 4 illustrates the first four principal componentsas calculated from the images themselves. To makethem comparable with the principal components derived

Fig. 2. Images along the four first principal components of theshape. The coefficients for the respective axes have values cor-responding to two, four, and six standard deviations away fromthe mean face in both directions. All other coefficients (includ-ing the ones coding for the texture) are set at zero.

T. Vetter and N. F. Troje Vol. 14, No. 9 /September 1997 /J. Opt. Soc. Am. A 2155

from the correspondence-based representation we alsoshow the images corresponding to locations that weretwo, four, or six standard deviations away from the mean.The images are more blurry, in particular in the center ofthe space. The faces in the periphery carry strange fea-tures, such as the white ghost mouth in component 2 andthe white outline in component 3.

The first principal component in the pixel-based repre-sentation captures mainly illumination differences. Thesecond component captures the positions of the mouthand the ears. The third component captures the size ofthe face, which, as in the first component of the shapesubspace, goes along with a shift in perceived gender.Note that this size change occurs here in the form of add-ing or subtracting a ring around the face. If too much ofthe ring is added, the ring becomes unnaturally white.The fourth component seems to account for left–right il-

Fig. 3. Images along the four first principal components of thetexture. The coefficients for the respective axes have values cor-responding to two, four, and six standard deviations away fromthe mean face in both directions. All other coefficients (includ-ing the ones coding for the shape) are set at zero.

lumination changes. However, this impression is mis-leading. The database did not contain any horizontalvariability in the direction of the light source. The fourthprincipal component rather accounts for a slight rotationaround a vertical axis. In the pixel-based representationthis means that a little bit of image intensity has to beadded at the left side of the face if it is turned to the rightside and vice versa. If too much is added (six standarddeviations away from the mean), a white edge that is theninterpreted as being due to illumination will result.

C. Reconstructions and Reconstruction ErrorsPCA yields an orthogonal basis with the axes ordered bymeans of their overall variance. In Fig. 5, 1 minus therelative cumulative variance covered by the first k princi-pal components is plotted for three PCA’s described in

Fig. 4. Images along the first four principal components derivedfrom the pixel-based representation. The coefficients for the re-spective axes have values corresponding to two, four, and sixstandard deviations away from the mean face in both directions.All other coefficients are set at zero.

2156 J. Opt. Soc. Am. A/Vol. 14, No. 9 /September 1997 T. Vetter and N. F. Troje

Subsection 2.B. The relative cumulative variances werecalculated by successively summing up the first k eigen-values y i and dividing them by the sum of all eigenvalues:

training errork 5 1 2 (k

y iY (n

y i . (1)

Note that this term is equivalent to the expected value forthe mean squared distance between the reconstructionXk and the original image X divided by the overall vari-ance s2:

training errork 5 1 2 (k

y iY (n

y i

51

s2~n 2 1 !(

n~Xk 2 X !2. (2)

It is thus an appropriate measure for the reconstructionerror. Since it refers to the set of faces that was used toconstruct the principal-component space from which thereconstructions were made, we call this kind of error thetraining error.

For a training error of 10% (i.e., to recover 90% of theoverall variance), the first 47 principal components areneeded in the pixel-based representation, 22 principalcomponents are needed in the texture representation, and15 are needed in the shape representation. Because thetest face was contained in the set from which the princi-pal components were derived, the training error ap-proaches zero when all available principal componentsare used for the reconstruction.

To evaluate the generalizability of the representationto new faces, we performed a different computation. Us-ing a leave-one-out procedure, we took one face out of thedatabase and performed PCA on the remaining 99 faces,yielding 98 principal components. Then the single facewas projected into various principal-component subspacesranging from dimensionality k 5 1 to k 5 98 to yield thereconstruction Xk .

In Fig. 6 the evaluation of this procedure is illustrated.The plot shows the generalization performance of the dif-

Fig. 5. Training error: 1 minus the relative cumulative vari-ance is plotted. The cumulative variance is equal to the mean ofthe squared Euclidian distance between the original face and re-constructions derived by truncating the principal-component ex-pansion. The calculation was performed for the two parts of thecorrespondence-based representation and for the pixel-based rep-resentation.

ferent representations in terms of the testing error.Much like the training error, the testing error is definedby the mean squared difference between reconstructionand original image divided by the variance s2 of thewhole data set:

testing errork 51

ns2 (n

~Xk 2 X !2. (3)

The testing error when the pixel-based representation isused is never smaller than 28%, even if all 98 principalcomponents are used for the reconstruction. A testing er-ror of 28% is reached with only five principal componentsfor the texture space and five principal components forthe shape space. If all principal components are used,the testing error can be reduced to 6% for the shape andto 12% for the texture.

Fig. 6. Testing error (A). The relative mean-squared Euclid-ian distance between the original and its reconstructions. Inthis case the reconstruction was derived by projecting the datainto spaces spanned by principal components computed from theset of remaining faces that did not contain the original face.The calculation was performed for the two parts of thecorrespondence-based representation and for the pixel-based rep-resentation.

Fig. 7. Testing error (B). As for the calculation of testing errorA, the faces were projected into principal-component spaces de-rived from the remaining faces. The error for the pixel-basedrepresentation is the same as the one plotted in Fig. 6. The er-ror corresponding to the correspondence-based representation ismeasured by the squared Euclidian distance in the pixel spaceafter the reconstructed shape is combined with the reconstructedtexture to yield an image (for details, see text).

T. Vetter and N. F. Troje Vol. 14, No. 9 /September 1997 /J. Opt. Soc. Am. A 2157

A single image of a face can be used to code either oneprincipal component in the pixel-based representation orone principal component of the shape subspace and oneprincipal component of the texture subspace of thecorrespondence-based representation. Thus the informa-tion contained in five images is enough to code for 72% ofthe variance in a correspondence-based representation,whereas 98 images are needed in the pixel-based repre-sentation.

The reconstruction error in Figs. 5 and 6 was measuredin terms of the squared Euclidian distance between recon-struction and original in the respective representation.To make the three distances comparable, we normalizedthem with respect to the overall variance of the databasein the respective representation. The texture and theshape parts of the correspondence-based representationwere treated separately.

Fig. 8. Examples of the different kinds of reconstruction used.We chose an example for which the reconstructions in thecorrespondence-based space are relatively poor to illustrate whatkinds of error still occur. a, Original face. b, reconstructed tex-ture combined with the original shape. c, Reconstructed shapecombined with the original texture. d, Reconstructed texturecombined with reconstructed shape. e, Reconstruction with thepixel-based representation.

To compare the achieved reconstruction qualities di-rectly with the pixel-based and with the correspondence-based representation, we combined reconstructed textureand reconstructed shape (see Appendix A) to yield a re-constructed image. The distance between this recon-struction and the corresponding original image can bemeasured by means of the squared Euclidian distance inthe pixel-based image space, and thus in the same space,and with the same metric as the reconstruction error ofthe pixel-based representations. Figure 7 shows the re-sults of such a calculation. To reach a reconstruction er-ror of 28%—the best that can be reached with 99 faces byusing a pixel-based representation—only 12 principalcomponents have to be used in the correspondence-basedrepresentation. If all principal components of thecorrespondence-based representation are used, a recon-struction error of 13% can be achieved.

Figure 8 shows examples of different kinds of recon-struction. All use the full set of available principal com-ponents in the respective representation. We picked outa bad example to illustrate the differences still presentbetween the original and the different kind of reconstruc-tion. The majority of the faces are reconstructed so wellthat the slight differences could not be seen in the smallreproductions of Fig. 8.

3. PSYCHOPHYSICAL EVALUATION OFTHE RECONSTRUCTIONSA. PurposeAll the distance measures described above are based onthe Euclidian distance in the different face spaces used.These distances, however, might only approximately re-flect the perceptual distance used by the human face rec-ognition system. Consider, for instance, the fact that hu-man sensitivity to differences between faces is not at allhomogeneous within the whole image. Changes in theregion of the eyes are more likely to be detected thanchanges of the same size (with respect to any of our dis-tance measures) in the region of the ears. Since it seemsto be difficult to formulate an image distance that exactlyreflects human discrimination performance, we used hu-man discrimination performance directly and evaluatedthe reconstruction quality by means of a psychophysicalexperiment.

In the experiment, subjects were presented with threeimages simultaneously on a computer screen. In the up-per part of the screen, an original face from our databasewas shown. Below this target face, two further imageswere shown. One of them was again the same targetface; the other was a reconstruction. The subject wastold that one of the two lower images was identical to theupper one and was instructed to find out which one it was.We measured the time that the subjects needed for thistask and the accuracy of their performance.

B. Methods

1. StimuliAll the reconstructions used in this experiment weremade by projecting faces into spaces spanned by the prin-cipal components derived from all the other faces in our

2158 J. Opt. Soc. Am. A/Vol. 14, No. 9 /September 1997 T. Vetter and N. F. Troje

data base. We thus used the same leave-one-out proce-dure as described in the context of calculating the testingerror (Subsection 2.C). Four different kinds of recon-struction were used. To investigate the reconstructionquality within the texture subspace we combined recon-structed textures with the original shapes. Similarly, weshowed images with reconstructed shape in combinationwith the original texture. The third kind of reconstruc-tion was made from a combination of reconstructed shapeand reconstructed texture. Finally, we made reconstruc-tions with the principal components derived from thepixel-based representation. In any of the four recon-struction modes, reconstructions made with the first 5,with 15, and with all 98 principal components wereshown. We chose these values because 5 and 15 princi-pal components cover approximately one and two thirds,respectively, of the overall variance.

2. DesignA two-factor mixed block design was used. The first fac-tor was a within-subject factor named QUALITY, whichcoded for the quality of the reconstruction. It had thelevels REC05, REC15, and REC98, corresponding to re-constructions made by using 5, 15, or all 98 principal com-ponents, respectively. The second factor was a between-subjects factor named MODE, which had the four levels,TEX, SHP, BTH, and PIX. TEX corresponds to trialsthat used images with only the texture reconstructed,SHP to trials with only the shape reconstructed, BTH totrials with reconstructed texture and shape, and PIX totrials with reconstructions in the pixel-based space (seealso Fig. 8).

Twenty-four subjects were randomly divided into fourgroups, each assigned one of the levels of factor MODE.Each subject performed three blocks. Each block con-tained 100 trials that used only REC05, REC15, orREC98 reconstructions. The order of the blocks wascompletely counterbalanced. There are six possible per-mutations, and any of them was used once for one of thesix subjects in each group. Each of the 100 faces wasused exactly once in each block.

3. ProcedureEach stimulus presentation was preceded by a fixationcross that was presented for 1 s. Then three images weresimultaneously presented on the computer screen. To-gether they covered a visual angle of 12 deg. One of thetwo lower images was identical to the single upper image.The subject had to indicate which one by pressing eitherthe left or the right arrow key on the keyboard. Subjectswere instructed to respond ‘‘as accurately and as quicklyas possible.’’ The images were presented until the sub-ject pressed one of the response keys.

C. ResultsFigure 9 illustrates the results of this experiment. Accu-racy is generally very high, as expressed by the low errorrates (mean, 5.9%), and differences that are due to factorMODE do not reach significance [two-factor analysis ofvariance on the error rate, F(3, 20) 5 1.49, p . 0.05].In all four conditions of factor MODE, we found an in-crease in the error rate with the number of principal com-

ponents used for the reconstruction [main effect of factorQUALITY: F(2, 40) 5 14.05, p , 0.01].

The response times are affected strongly by both factorMODE [F(3, 20) 5 10.9, p , 0.01] and factor QUALITY[F(2, 40) 5 21.8, p , 0.01]. The mean response timeneeded for discrimination between an original image andits reconstruction in the pixel-based representation (con-dition PIX) was 606 ms. The mean response times inconditions TEX and SHP were 3488 and 3385 ms, respec-tively. In condition BTH the mean response time was1872 ms. In all four conditions of factor MODE, responsetimes increased with the number of principal components,although only slightly in condition PIX. Note that thetime that the observer needed to identify the worst recon-struction in the correspondence-based representation(BTH, REC05) from the original is still almost twice thetime needed for the best reconstruction in the pixel-basedspace (PIX, REC98).

4. DISCUSSIONThe results clearly demonstrate an improvement in thecoding efficiency and generalization to new face images ofthe correspondence-based image representation overpixel-based techniques previously proposed.4,5 The cor-respondence, here computed automatically through anoptical flow algorithm, allows the separation of two-dimensional shape and texture information in images ofhuman faces. The image of a face is represented by itsprojection coefficients in separate linear vector spaces for

Fig. 9. Psychophysical evaluation of the different kinds of re-construction. (a) Error rates and (b) response times are plotted.TEX, reconstructed texture combined with original shape; SHP,reconstructed shape combined with original texture; BTH, recon-structed texture combined with reconstructed shape. PIX, re-construction in the pixel-based space; REC05, reconstructionsbased on the first five principal components; REC15, reconstruc-tions based on the first 15 principal components; REC98, recon-structions based on all 98 principal components.

T. Vetter and N. F. Troje Vol. 14, No. 9 /September 1997 /J. Opt. Soc. Am. A 2159

shape and texture. The improvement was demonstratedtheoretically as well as in a psychophysical experiment.

The results of the different evaluations indicate the im-portance of the proposed representation for an efficientcoding of face images. We have demonstrated the codingefficiency within a given set of images as well as the gen-eralizability to new test images that were not in the dataset from which the representations were originally ob-tained. In comparison with those for a pixel-based imagerepresentation, the number of principal components nec-essary for the same image quality is strongly reduced.The expected error for coding a new image is less thanhalf of the one produced by use of a pixel-based represen-tation.

Human observers could discriminate a reconstructionderived from the pixel-based representation much fasterfrom the original face than from a reconstruction derivedfrom the correspondence-based representation. The im-age quality when a large number of basis faces are used issufficient for recognition tasks and demonstrates the im-portance of the representation for image synthesis andcomputer graphics applications. The results from thepsychophysical experiments are important, since it is wellknown that the Euclidian distance used to optimize thereconstructions as well as to compute the principal com-ponents does not, by itself, in general reflect perceived im-age distance.20

Clearly, the crucial step in the proposed technique is adense correspondence field between images of faces seenfrom one viewpoint. The optical flow technique used onour data set worked well; however, for images obtainedunder less-controlled conditions a more sophisticatedmethod for finding the correspondence might be neces-sary. New correspondence techniques based on activeshape models21,22 are more robust against local occlusionsand larger distortions when they are applied to a knownobject class. Their shape parameters are optimized ac-tively to model the target image. This technique incor-porates object class-specific knowledge directly into thecorrespondence computation step. Similarly, Hallinan23

demonstrated an improvement by using a low-dimensional, linear illumination model as prior knowl-edge.

In this paper we used a PCA for a parameterization ofthe space. Such a parameterization is optimal in thesense that the mean squared error introduced by truncat-ing the expansion is at a minimum. The perceptual in-terpretation of single principal components, however,should not be overrated. A situation in which the direc-tion of single axes is apparently arbitrary emerges whentwo or more eigenvalues have approximately the samesize. But even without the occurrence of this situation,there is no reason to assume that the principal compo-nents are corresponding to semantically meaningful fea-tures. The benefit of PCA is rather that it provides anorthogonal basis for the face space with the axes orderedby means of their contribution to the overall variance.The variance is based on the Euclidian distance in theface space and must not necessarily reflect perceptual dis-tance.

Additionally, it is not clear yet how much redundant in-formation is kept in the principal components of shape

and texture. The proposed image representation, sepa-rating shape and texture information, is not at all depen-dent on a PCA. A future reduction of this redundancy,based on a more extended example set of images, mightlead to an even more efficient parameterization of imagesof faces.

The main result of this paper proves that an image rep-resentation in terms of separated shape and texture is su-perior to a pixel-based image representation in any com-parison that we performed. These results complementother findings in which a separate texture and shape rep-resentation of three-dimensional objects in general wasused for visual learning1 and made possible the synthesisof novel views from a single image.18,19 Finally, on thebasis of our psychophysical experiments, we suggest thatthe correspondence-based representation of faces is muchcloser to a human description of faces than is a pixel-by-pixel comparison of images, in which the spatial corre-spondence of features is ignored.

APPENDIX A1. ImagesImages of 100 Caucasian faces were available. The im-ages were originally rendered for psychophysicalexperiments24 from a database of three-dimensional hu-man head models recorded with a laser scanner (Cyber-ware). All faces had no makeup, accessories, or facialhair. Additionally, the head hair was removed digitally(but with manual editing) by means of a vertical cut be-hind the ears.

The images were rendered from a viewpoint 120 cm infront of the face and with ambient light only. There wasstill a little bit of shading present that resulted from thescanning procedure. The scanner uses approximately cy-lindrical illumination, and the resulting texture map con-tains some shadows below the chin. From each face, onefrontal view was rendered. The resolution of the gray-level images was 256 3 256 pixels with 8 bits per pixel.The images were aligned to a common position at the tipof the nose.

2. Image MatchingThe essential step in our approach is the computation ofthe correspondence between two images for every pixel lo-cation. That means that we have to find for every pixelin the first image, e.g., a pixel located on the nose, the cor-responding pixel location on the nose in the other image.Since we controlled for illumination, and since all facesare compared in the same orientation, a strong similarityof the images can be assumed, and problems attributed toocclusions should be negligible. These conditions makefeasible an automatic mechanism for comparing the im-ages of the different faces.18 Algorithms for finding cor-respondence between similar images are known from op-tical flow computation, in which points have to be trackedfrom one frame of a series of images to another. We useda coarse-to-fine gradient-based optical flow algorithm25

applied to the Laplacians of the images and following animplementation described by Bergen and Hingorani.26

2160 J. Opt. Soc. Am. A/Vol. 14, No. 9 /September 1997 T. Vetter and N. F. Troje

We computed the Laplacians of the images from theGaussian pyramid, adopting the algorithm proposed byBurt and Adelson.27

Beginning with the lowest level of a resolution pyramidfor every pixel (xi) in the first image, the error term

E 5 (i

~IxiDxi 1 Iyi

Dyi 2 DI !2 (4)

was minimized for Dx and Dy, where Ix and Iy are thespatial derivatives of the Laplacians and DI is the differ-ence of the Laplacians of the two images. The resultingvector field (Dx, Dy) was then smoothed, and the proce-dure was iterated through all levels of the resolutionpyramid. The final resulting vector field was used as thecorrespondence pattern between the two images.

3. Reference FaceFor a consistent representation, the correspondence fieldsbetween each face and a single reference face had to becomputed. In theory, any face could be used as a refer-ence. However, small peculiarities of a face can influencethe automated matching process strongly. We thus useda synthetic face, namely, the average face of our database, as the reference face. This average could not be de-termined in a single step but had to be calculated in aniterative procedure: Starting from an arbitrary face asthe preliminary reference, the correspondence algorithmestablished correspondence between all other faces andthis reference. The correspondence fields were satisfyingfor only some of the faces. We selected these faces, cal-culated the mean face out of them, and used it as the newreference. Repeating this procedure twice resulted in aface that was taken as the final reference face. Furtheriteration did not improve the correspondence fields.

4. Synthesis of an ImageTo generate the original image from its correspondence-based representation, we had to shift each pixel in thetexture map to the new location given by the shape vec-tor. The new location generally did not coincide with theequally spaced grid of pixels on the destination image. Acommon solution of this problem is known as forwardwarping.28 For every new pixel we used the nearestthree points to linearly approximate the pixel intensity.Not only can the images of the original faces be repro-duced in this way but also new, synthetic faces can begenerated. New textures can be generated from any lin-ear combination of already existing textures. The samecan be done for the shapes. A new image can be gener-ated by combining any texture with any shape. This ispossible because both are given in the coordinates of thereference image.

Address correspondence to Nikolaus F. Troje at Depart-ment of Psychology, Queen’s University, Kingston, On-tario Canada K7L 3N6.

REFERENCES1. D. Beymer and T. Poggio, ‘‘Image representation for visual

learning,’’ Science 272, 1905–1909 (1996).2. D. Valentin, H. Abdi, A. J. O’Toole, and G. W. Cottrell,

‘‘Connectionist models of face processing: a survey,’’ Patt.Recog. 27, 1209–1230 (1994).

3. N. Ahmed and M. H. Goldstein, Orthogonal Transforms forDigital Signal Processing (Springer-Verlag, New York,1975).

4. L. Sirovich and M. Kirby, ‘‘Low-dimensional procedure forthe characterization of human faces,’’ J. Opt. Soc. Am. A 4,519–554 (1987).

5. M. Turk and A. Pentland, ‘‘Eigenfaces for recognition,’’ J.Cogn. Neurosci. 3, 71–86 (1991).

6. A. J. O’Toole, H. Abdi, K. A. Deffenbacher, and D. Valen-tine, ‘‘Low-dimensional representation of faces in higher di-mensions of the face space,’’ J. Opt. Soc. Am. A 10, 405–411(1993).

7. H. Abdi, D. Valentin, B. Edelman, and A. J. O’Toole, ‘‘Moreabout the difference between men and women: evidencefrom linear neural networks and the principal componentapproach,’’ Perception 24, 539–562 (1995).

8. A. J. O’Toole, H. Abdi, K. A. Deffenbacher, and J. C. Bar-lett, ‘‘Classifying faces by face and sex using an autoasso-ciative memory trained for recognition,’’ in Proceedings ofthe Thirteenth Annual Conference of the Cognitive ScienceSociety, K. J. Hammond and D. Gentner, eds.(Erlbaum, Hillsdale, N.J., 1991), pp. 847–851.

9. A. J. O’Toole, K. A. Deffenbacher, D. Valentin, and H. Abdi,‘‘Structural aspects of face recognition and the other-raceeffect,’’ Memory Cogn. 22, 208–224 (1994).

10. C. A. Feingold, ‘‘The influence of environment on identifica-tion of persons and things,’’ J. Crim. Law Criminol. PoliceSci. 5, 39–51 (1914).

11. I. Craw and P. Cameron, ‘‘Parameterizing images for recog-nition and reconstruction,’’ in British Machine Vision Con-ference, P. Mowforth, ed. (Springer-Verlag, Berlin, 1991),pp. 367–370.

12. T. Vetter and N. F. Troje, ‘‘Separation of texture and two-dimensional shape in images of human faces,’’ in Mus-tererkennung 1995, S. Posch, F. Kummert, and G. Sagerer,eds. (Springer-Verlag, Berlin, 1995), pp. 118–125.

13. M. Kirby and L. Sirovich, ‘‘Application of the Karhunen–Loewe procedure for characterization of human faces,’’IEEE Trans. Pattern. Anal. Mach. Intell. 12, 103–109(1990).

14. D. I. Perrett, K. A. May, and S. Yoshikawa, ‘‘Facial shapeand judgements of female attractiveness,’’ Nature (London)368, 239–242 (1994).

15. T. Vetter, ‘‘Synthesis of novel views from a single face im-age,’’ Tech. Rep. 26 (Max-Planck-Institut fur biologischeKybernetik, Tubingen, Germany, 1996).

16. N. Costen, I. Craw, G. Robertson, and S. Akamatsu, ‘‘Auto-matic face recognition: What representation,’’ in Com-puter Vision—ECCV’96 Lecture Notes in Computer Science1064, B. Buxton and R. Cippola, eds. (Springer-Verlag, Ber-lin, 1996), pp. 504–513.

17. P. J. B. Hancock, A. M. Burton, and V. Bruce, ‘‘Face pro-cessing: human perception and principal components,’’Memory Cogn. 24, 26–40 (1996).

18. D. Beymer, A. Shashua, and T. Poggio, ‘‘Example-based im-age analysis and synthesis,’’ A. I. memo 1431 (Artificial In-telligence Laboratory, Massachusetts Institute of Technol-ogy, Cambridge, Mass., 1993).

19. T. Vetter and T. Poggio, ‘‘Image synthesis from a single ex-ample image,’’ in Computer Vision—ECCV’96 Lecture Notesin Computer Science 1064, B. Buxton and R. Cippola, eds.,(Springer-Verlag, Berlin, 1996), pp. 652–659.

20. W. Xu and G. Hauske, ‘‘Picture quality evaluation based onerror segmentation,’’ Visual Commun. Image Process. 2308,1–12 (1994).

21. T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham,‘‘Active shape models—their training and application,’’Comput. Vis. Image Understand. 61, 38–59 (1995).

22. M. Jones and T. Poggio, ‘‘Model-based matching of linedrawings by linear combination of prototypes,’’ in Proceed-ings of the 5th International Conference on Computer Vision(IEEE Computer Society Press, Los Alamitos, Calif., 1995),pp. 531–536.

23. P. W. Hallinan, ‘‘A deformable model for the recognition of

T. Vetter and N. F. Troje Vol. 14, No. 9 /September 1997 /J. Opt. Soc. Am. A 2161

human faces under arbitrary illumination,’’ Ph.D. disserta-tion (Harvard University, Cambridge, Mass., 1995).

24. N. F. Troje and H. H. Bulthoff, ‘‘Face recognition undervarying pose: the role of texture and shape,’’ Vision Res.36, 1761–1771 (1995).

25. J. A. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani,‘‘Hierarchical-model-based motion estimation,’’ in SecondEuropean Conference on Computer Vision, G. Sandini, ed.(Springer-Verlag, Berlin, 1992), pp. 237–252.

26. J. R. Bergen and R. Hingorani, ‘‘Hierarchical motion-basedframe rate conversion,’’ tech. rep. (David Sarnoff ResearchCenter, Princeton, N.J., 1990).

27. P. J. Burt and E. H. Adelson, ‘‘The Laplacian pyramid as acompact image code,’’ IEEE Trans. Commun. 31, 532–540(1983).

28. Georg Wolberg, Image Warping (IEEE Computer SocietyPress, Los Alamitos, Calif., 1990).