A Framework for Long Distance Face Recognition Using Dense - and Sparse-Stereo Reconstruction
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of A Framework for Long Distance Face Recognition Using Dense - and Sparse-Stereo Reconstruction
A Framework for Long Distance Face Recognition
using Dense- and Sparse-Stereo Reconstruction
Ham Rara, Shireen Elhabian, Asem Ali, Travis Gault, Mike Miller, Thomas Starr,
and Aly Farag
CVIP Laboratory, University of Louisville, KY, USA {hmrara01,syelha01,amali003,travis.gault,wmmill06,tlstar01,
aafara01}@louisville.edu
Abstract. This paper introduces a framework for long-distance face recognition
using both dense- and sparse-stereo reconstruction. Two methods to determine
correspondences of the stereo pair are used in this paper: (a) dense global ste-
reo-matching using maximum-a-posteriori Markov Random Fields (MAP-
MRF) algorithms and (b) Active Appearance Model (AAM) fitting of both im-
ages of the stereo pair and using the fitted AAM mesh as the sparse correspon-
dences. Experiments are performed regarding the use of different features ex-
tracted from these vertices for face recognition. A comparison between the two
approaches (a) and (b) are carried out in this paper. The cumulative rank curves
(CMC), which are generated using the proposed framework, confirms the feasi-
bility of the proposed work for long distance recognition of human faces.
1 Introduction
Automatic face recognition is a challenging task that has been an attractive research
area in the past three decades (for more details see [1]). At the outset, most efforts
were directed towards 2D facial recognition which utilizes the projection of the 3D
human face onto the 2D image plane acquired by digital cameras. The face recogni-
tion problem is then formulated as follows: given a still image, identify or verify one
or more persons in the scene using a stored database of face images. The main theme
of the solutions provided by different researchers involves detecting one or more
faces from the given image, followed by facial feature extraction which can be used
for recognition. Challenges involving 2D face recognition are well-documented in the
literature. Intra-subject variations such as illumination, expression, pose, makeup, and
aging can severely affect a face recognition system.
To address pose and illumination, researchers recently are focusing on 3D face
recognition [2]. 3D face geometry can either be acquired using 3D sensing devices
such as laser scanners [3] or reconstructed from one or more images [4-6]. Although
3D sensing devices have been proven to be effective in 3D face recognition [7], their
high cost, limited availability and controlled environment settings have created the
need for methods that extract 3D information from acquired 2D face images.
Recently, there has been interest in face recognition at-a-distance. Yao, et al. [8]
created a face video database, acquired from long distances, high magnifications, and
both indoor and outdoor under uncontrolled surveillance conditions. Medioni, et al.
[9] presented an approach to identify non-cooperative individuals at a distance by
inferring 3D shape from a sequence of images.
To realize our objectives and the current lack of existing facial stereo databases, we
constructed our own passive stereo acquisition setup [10]. The setup consists of a
stereo pair of high resolution cameras (and telephoto lenses) with adjustable baseline.
It is designed such that user can remotely pan, tilt, zoom and focus the cameras to
converge to the center of the cameras’ field of views on the subject’s nose tip. This
system is used to capture stereo pairs of 30 subjects at various distances (3-, 15-, and
33-meter ranges).
The paper is organized as follows: Section 2 discusses stereo reconstruction me-
thods (dense and sparse), Section 3 shows the experimental results, Section 4 vali-
dates the best method in Sec. 3 using the FRGC database, and later sections deal with
discussions and limitations of the proposed approaches, conclusions and future work.
2 Stereo Matching-Based Reconstruction
Dense, Global Stereo Matching: The objective of the classical stereo problem is to
find the pair of corresponding points p and q that result from the projection of the
same scene point (X, Y, Z) to the two images of the stereo-pair. Currently, the state-of-
the-art in stereo matching is achieved by global optimization algorithms [11], where
the problem is formulated as a maximum-a-posteriori Markov Random Field (MAP-
MRF) scenario. Given the left and right images, the goal is to find the disparity map
D, where at each pixel p, the disparity is 𝑑𝑝 = 𝑝𝑥 − 𝑞𝑥 . To correctly solve this prob-
lem, the constraints of the visual correspondence should be satisfied: (a) uniqueness,
where each pixel in the left image corresponds to at most one pixel in the right image
and (b) occlusion, where some pixels do not have correspondences. To achieve these
constraints, similar to Kolmogorov’s approach [12], we treat the two images symme-
trically by computing the disparity maps for both images simultaneously. The dispari-
ty map D is computed by minimizing the energy function 𝐸 𝐷 = 𝐸𝑑𝑎𝑡𝑎 𝐷 +
𝐸𝑠𝑚𝑜𝑜𝑡 ℎ 𝐷 + 𝐸𝑣𝑖𝑠 𝐷 . The terms refer to the penalty, smoothness, and visibility con-
straint terms [13][14]. To fill the occluded regions, we propose to interpolate between
the correctly reconstructed pixels of each scan line using the cubic splines [15] inter-
polation model. Finally, after getting a dense disparity map from which we get a set of
correspondence points, we reconstruct the 3D points of the face [10]. To remove some
artifacts of the reconstruction, an additional surface fitting step is done [16].
Sparse-Stereo Reconstruction: The independent AAM version of [17] is used to
find sparse correspondences of the left and right images of the stereo pair. The shape
𝑠 can be expressed as the sum of a base shape 𝑠0 and a linear combination of 𝑛 shape
vectors 𝑠𝑖 , 𝑠 = 𝑠0 + 𝑝𝑖𝑠𝑖𝑖 , where 𝑝𝑖 are the shape parameters. Similarly, the appear-
ance 𝐴 𝒙 can be expressed as the sum of the base appearance 𝐴0 𝒙 and a linear
combination of basis images 𝐴𝑖 𝒙 , 𝐴 𝒙 = 𝐴0 𝒙 + 𝜆𝑖𝐴𝑖 𝒙 𝑖 , where the pixels 𝒙 lie
on the base mesh 𝑠0. Fitting the AAM to an input image involves minimizing the
error image between the input image warped to the base mesh and the appearance
𝐴 𝒙 = 𝐴0 𝒙 + 𝜆𝑖𝐴𝑖 𝒙 𝑖 , that is, 𝐴0 𝒙 + 𝜆𝑖𝐴𝑖 𝒙 𝑖 − 𝐼 𝑊 𝒙; 𝑝 2
𝒙𝜖𝑠0 . For this
work, the error image is minimized using the project-out version of the inverse com-
positional image alignment (ICIA) algorithm [17].
To facilitate a successful fitting process, the AAM mesh is initialized according to
detected face landmarks (eyes, mouth center, and nose tip). After detecting these
facial features, the AAM base mesh is warped to these points.
The detection of facial features starts with identifying the possible facial regions in
the input image, using a combination of the Viola-Jones detector [18] and the skin
detector of [19]. The face is then divided into four equal parts to establish a geome-
trical constraint of the face. The face landmarks are then identified using variants of
the Viola-Jones detector, i.e., the face detector is replaced with the corresponding face
landmark (e.g., eye detector) detector [20]. False detections are then removed by
taking into account the geometrical structure of the face (i.e., expected facial feature
locations).
3 Experimental Results
The 3D acquisition system in [10] is used to build a human face database of 30 sub-
jects at different ranges in controlled environments. The database consists of a gallery
at 3 meters and three different probe sets at the 3-, 15-, and 33-meter ranges. Table I
shows the system parameters at different ranges.
Table I: Stereo-based acquisition system parameters
Dense, Global Stereo 3D Face Reconstructions: The gallery is constructed by cap-
turing stereo pairs for the 30 subjects at the 3-meter range. We reconstruct the 3D face
of each subject using the approach that is described in Section 2. Fig. 1(a) illustrates a
sample from this gallery for different subjects. This figure shows the left image of
each subject and two different views for the 3D reconstruction with and without the
textures.
For the dense, global 3D reconstruction approach, only the images from the probe
sets 3-meter and 15-meter ranges are considered. The reason behind this is that the
methodology from Sec. 2.1 fails to determine acceptable correspondences of the ste-
reo-pair images of the 33-meter range, leading to unacceptable 3D reconstructions.
This result has led the authors to propose the second method (sparse-stereo) to deal
with stereo-pairs that are difficult to extract dense correspondences. Fig. 1(b) and
1(c) illustrate the samples from these probe sets.
Sparse-Stereo 3D Face Reconstructions: The gallery and probe sets are similar to
above, except that the 33-meter images are now included in the probe set. The train-
ing of the AAM model involves images from the gallery. The vertices of the final
Figure 1: Dense 3D reconstructions: (a) 3-meter gallery, (b) 3-meter, and (c) 15-meter probes.
AAM mesh on both left and right images can be considered as a set of corresponding
pair of points, which can be used for stereo reconstruction. To further refine the cor-
respondences, a local NCC search around each point is performed, using the left im-
age as the reference, to get better correspondences. Fig. 2 shows stereo reconstruction
results of three subjects, visualized with the x-y, x-z, and y-z projections, after rigid
alignment to one of the subjects. Notice that in the x-y projections, the similarity (or
difference) of 2D shapes coming from the same (or different) subject is enhanced.
This is the main reason behind the use of x-y projections as features in Sec. 3 (Recog-
nition).
Figure 2: Reconstruction results. The 3D points are visualized as projections in the x-y, x-z,
and y-z planes. Circle and diamond markers belong to the same subject, while the square mark-
ers are that of a different subject.
Recognition: For face recognition, we use five approaches for using the 3D face
vertices derived from dense- and sparse-stereo (3D-AAM) to identify probe images
against the gallery: (a) moment-based approach for dense 3D reconstructions, (b)
feature vectors derived from Principal Component Analysis (PCA) of 3D-AAM ver-
tices, (c) goodness-of-fit criterion (Procrustes) after rigidly registering the 3D-AAM
vertices of a probe with that of a gallery subject, (d) feature vectors from PCA of x-y
plane projections of the 3D-AAM vertices, and (e) the same procedure as (c) but us-
ing the x-y plane projections of the 3D-AAM vertices of both probe and gallery, after
frontal pose normalization.
Moment-based Recognition: For the dense 3D reconstructions, to compare between
gallery and probe sets, feature vectors are derived from moments [21] derived from
the 3D vertex coordinates. The moments are computed as 𝜂𝑟𝑠𝑡 = 𝑋𝑟𝑍𝑌𝑋 𝑌𝑠𝑍𝑡 .
Principal Component Analysis (PCA): To apply PCA [22] for feature classification,
the primary step is to solve for the matrix P of principal components from a training
database, using a number of matrix operations. The feature vectors Y can then be
determined as follows: 𝑌 = 𝑃𝑇𝑋, where 𝑋 is a centered input data. The similarity
measure used for recognition is the L2 norm.
Goodness-of-fit (Procrustes): The Procrustes distance [23] between two shapes is a
least-squares type of metric that requires one-to-one correspondence between shapes.
After some preprocessing steps involving the computation of centroids, rescaling each
shape to have equal size and aligning with respect to translation and rotation, the
squared Procrustes distance between two shapes 𝒙1 and 𝒙2 is the sum of squared point
distances, 𝑃𝑑2 = 𝒙1 − 𝒙2
2.
Figure 3: Cumulative match characteristic (CMC) curve of the: (a) 3-meter probe set, (b) 15-
meter probe set, and (c) 33-meter probe set. Note that only the 3-m and 15-m probe sets use
moment-based recognition (see Fig. 3)
Discussion of Results: Fig. 3 shows the cumulative rank curves (CMC) curves for the
five types of feature extraction methods in the previous section (Sec. 3), using 3-, 15-,
and 33-meter probes. We can draw four conclusions: (a) both 2D Procrustes (i.e., x-y
projection of 3D-AAM) and 2D PCA outperform both 3D Procrustes and 3D PCA,
(b) goodness-of-fit criterion (Procrustes) slightly outperforms PCA in both 2D and
3D, (c) degradation of recognition at increased distances, and (d) the moment-based
methods perform poorly at lower ranks but shoots up quickly to 100% at rank-5 and
rank-8 for the 3-m and 15-m probe sets, respectively.
The conclusion in (a) can be explained with the help of Fig. 4. The diagram shows the
top view of a simple stereo system. Ol and Or are centers of projection, and pl, pr, ql, qr
are points on the left and right images.
Figure 4: Simple stereo illustration. 3D reconstruction is sensitive to the correspondence prob-
lem but projection to the x-y plane minimizes the error.
Assume that the y-coordinates of the four image points are equal. pl and pr will re-
construct P, ql and qr will reconstruct Q, and so on. Notice that a small change in the
correspondence affects the xyz reconstructions hugely, i.e., their Euclidean distance
between each other is huge. But when they are projected to the x-y plane, their 2D
Euclidean distances with each other are considerably lesser. This scenario is possibly
happening with the correspondence of the AAM vertices between the left and right
image.
The conclusion in (b) is related to the primary purpose of PCA, which is optimal
reconstruction error, in the mean-square-error (MSE) sense. Projection to a low-
dimensional space may remove the classification potential of the vectors. There is no
dimensional change with rigid alignment using Procrustes; similar shapes are ex-
pected to have less Procrustes distance after rigid alignment and geometric informa-
tion of faces (e.g., distance ratios between face parts) are maintained.
Results are expected to degrade with distance since the captured images are at less
ideal conditions (although recognition using 2D x-y projections remain stable). The
work in Medioni [9] deals with identification at a distance using recovered dense 3D
shape from a sequence of captured images. The results at the 15- and 33-meter range
(Figs 5(b,c)) are comparable (and slightly better) than their 9-meter results; however,
their experimental setting may be less controlled than ours.
The authors cannot find any concrete reason behind conclusion (d), except the fact
that the dense reconstructed shapes may be overfitted by the gridfit procedure of [16].
This part of the framework is still a work-in-progress and more elaborate 3D recogni-
tion methods will be incorporated as future work (see Conclusions).
Sensitivity Analysis: This section performs a sensitivity analysis of the recognition
performance with respect to errors in the AAM fitting of the stereo pair of images. To
simulate errors introduced to the system, the fitted AAM vertices of the stereo pair are
randomly perturbed with additive white Gaussian noise of a certain variance. Fig. 5
shows the plot of rank-1 recognition rates versus point sigmas, for the 3-, 15-, and 33-
meter ranges. The recognition method used is the 2D Procrustes approach of Fig. 3.
These findings reinforce the known fact that acceptable AAM fitting is necessary to
get satisfactory correspondence, which leads to suitable recognition.
Figure 5: Sensitivity analysis of recognition with respect to errors in the AAM fitting of the
stereo pair of images. Notice that the recognition rates are fairly stable across various values of
σ for the 3- and 15-meter probe sets. However, for the 33-meter probe set, the recognition
performance severely deteriorates after 𝜎 = 5.
4 Validation with FRGC Database
The main purpose of this section is to test if the 2D (x-y projection of 3D-AAM)
Procrustes results of Fig. 5 carry over to the much larger FRGC database. Since the
FRGC database contains close-to-perfect dense 3D information, the advantage of 2D
Procrustes over its 3D equivalent would also be investigated. The section of the
FRGC database [24] with range (3D) data is used. Each range is accompanied by a
corresponding texture image. 115 subjects (with three images each, for a total of 345)
were chosen out of the total number of subjects, the number being restricted by the
manual annotation of AAM training data.
Fig. 6 may provide some insights regarding the better performance of 2D Pro-
crustes over 3D Procrustes. Note that the 2D+3D (range, depth) partition of the
FRGC database contains 2D video images with corresponding range values for each
pixel. After the manual/automatic fitting of AAM vertices, the corresponding range
(depth) value is extracted for each vertex. In Fig. 6, the red dots represent the ex-
tracted depth values using the 2D coordinates of the fitted vertices. Notice that some
depth values are undesirable; they do not contain the intended depth of the facial
feature points (e.g., nose area of Fig. 6(a) and face boundary of Fig. 6(b)). The 2D
coordinates of the AAM vertices (except for the face boundary) are adjusted accord-
ing to the COSMOS framework [25]; specifically, the vertices are adjusted along a
local neighborhood in the horizontal (𝑥) direction, according to some extremum val-
ues defined by [25]. The green dots represent the adjusted 3D vertices. The face
boundaries are adjusted using ad-hoc methods that investigate the most acceptable
face vertex depth of the whole face boundary. Fig. 7 shows both the 2D and 3D Pro-
crustes methods using the FRGC database. Similar to Fig. 3, the 2D approach outper-
forms the 3D version.
Figure 6: Using the original 2D coordinates of the fitted AAM vertices in the range image do
not give the desired feature locations; hence, the adjustment of the vertices: movement of the
(x) dot to the () dot at the (a) nose location and (b) face boundary, indicated by the arrow. The
vertex movement is not noticeable in the corresponding 2D video image.
Figure 7: Cumulative match characteristic (CMC) curve of the FRGC experiment. Notice the
superiority of the 2D Procrustes approach over the 3D version. Classification is performed
using the leaving-one-out methodology of [22]. The reason behind this is related to Fig. 6; the
movement from the red to green dots represents a large variation in the 3D coordinate system,
but when projected to the 2D x-y plane, the planar movement is much smaller, e.g., there is no
noticeable pixel difference in the 2D image after the vertex movement in the nose of Fig. 6(a).
0 5 10 150.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rank
Recognitio
n R
ate
2D Procrustes
3D Procrustes
5 Conclusions and Future Work
This paper described a framework for identifying faces remotely, specifically at dis-
tances 3-, 15- and 33-meters. The best approach used AAM to get correspondences
between the left and right images of the stereo pair. This study found out that recogni-
tion using the point-to-point distance between 2D x-y projected shapes (after rigid
registration, Procrustes) outperforms the others. The advantage of the 2D Procrustes
approach over its 3D version carried over to the much larger FRGC database (with
chosen 115 subjects). Using our database, we have illustrated the potential of using
these few vertices, as opposed to the whole set of points of the human face (dense
reconstruction).
The authors are aware of more elaborate methods of 3D shape classification related
to face recognition (even with the presence of face expression), such as [7]. However,
for this application (of identifying faces at far distances), a close-to-perfect dense 3D
scan of the face is difficult to obtain; therefore, this study currently deals with sparse
3D points related to the AAM vertices. The next step of this work is to densify the
AAM mesh to have a reconstruction that has a close semblance to a dense 3D scan
but still contain lesser vertices than conventional 3D scans. This study does not cur-
rently consider face expression (since the 3D sparse reconstruction can only do so
much) but will be considered as future work once the densification of the AAM ver-
tices is taken care of. Additionally, the authors plan to increase the database size (for
better statistical significance) and capture images at further distances (with the help of
state-of-the art equipment).
References
1. Zhao, W., Chellappa, R., Rosenfeld, A.: Face recognition: a literature survey. ACM Com-
puting Surveys 35 (2003) 399–458
2. Kittler, J., Hilton, A., Hamouz, M., Illingworth, J.: 3D assisted face recognition: A survey
of 3D imaging, modelling and recognition approaches. In: Proc. of CVPR. (2005)
3. Pan, G., Han, S.,Wu, Z.,Wang, Y.: 3D face recognition using mapped depth images. In:
Proc. of CVPR- Workshop on Face Recognition Grand Challenge Experiments. (2005)
4. Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans.
on PAMI 25 (2003) 1063–1074
5. Hu, Y., Jiang, D., Yan, S., Zhang, L., Zhang, H.: Automatic 3D reconstruction for face
recognition. In: Proc. of Sixth IEEE International Conference on Face and Gesture Recog-
nition. (2004) 843–848
6. Chowdhury, A.R., Chellappa, R., Krishnamurthy, S., Vo, T.: 3D face reconstruction from
video using a generic model. In: Proc. of IEEE International Conference on Multimedia and
Expo. (2002) 449–452
7. Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Three-dimensional face recognition. In:
Intl. Journal of Computer Vision (IJCV). (2005) 5–30
8. Yao, et al..: Improving long range and high magnification face recognition: Database acqui-
sition, evaluation, and enhancement. In: Computer Vision and Image Understanding
(CVIU). (2008)
9. Medioni, G., Jongmoo, C., Cheng-Hao, K., Choudhury, A., Li, Z., Fidaleo, D.: Noncoo-
perative persons identification at a distance with 3D face modeling. In: Proc. of IEEE Inter-
national Conference on Biometrics: Theory, Applications, and Systems (BTAS’07). (2007)
10. Rara, H., Elhabian, S., Ali, A., Miller, M., Starr, T., and Farag, A.: Face recognition at-a-
distance based on sparse-stereo reconstruction. In: Proc. of CVPR- Biometrics Workshop.
(2009)
11. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen,
M.F., Rother, C.: A comparative study of energy minimization methods for markov random
fields. In: Proc. of ECCV. (2006) 16–29
12. Kolmogorov, V., Zabih, R.: Multi-camera scene reconstruction via graph cuts. In: Proc. of
ECCV. (2002) 82–96
13. Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sam-
pling. IEEE Trans. on PAMI 20 (1998) 401–406
14. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algo-
rithms for energy minimization in vision. IEEE Trans. on PAMI 26 (2004) 1124–1137
15. Knott, G.D.: Interpolating Cubic Splines. Springer-Verlag (2000)
16. D'Errico, J.: Surface Fitting using gridfit. In: Matlab Central File Exchange
(http://www.mathworks.com/matlabcentral/fileexchange/). (2006)
17. Matthews, I., Baker, S.: Active Appearance Models Revisited. In: International Conference
on Computer Vision (ICCV). (2004)
18. Viola, P., Jones, M. J.: Robust real-time face detection. In: International Journal of Com-
puter Vision (IJCV). (2004) 151–173
19. Jones, M., Rehg, J.: Statistical color models with application to skin detection. In: Interna-
tional Journal of Computer Vision (IJCV). (2002) 81—96
20. Castrillon-Santana, M., Deniz-Suarez, O., Anton-Canalıs, L., Lorenzo-Navarro, J.: FACE
AND FACIAL FEATURE DETECTION EVALUATION: Performance Evaluation of Pub-
lic Domain Haar Detectors for Face and Facial Feature Detection. In: VISAPP. (2008)
21. Elad, M., Tal, A., Ar, S.: Content based retrieval of vrml objects: an iterative and interactive
approach. In: Proceedings of the sixth Eurographics workshop on Multimedia. (2001) 107–
118
22. Belhumeur, P., Hespanha J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition using
Class Specific Linear Projection. In: IEEE Trans. PAMI. (1998)
23. Cootes, T.F., Taylor, C.J.: Statistical Models of Appearance for Computer Vision. In:
Technical Report, University of Manchester, UK. (2004)
24. Chang, K., Bowyer, K., Flynn, P.: An Evaluation of Multimodal 2D+3D Face Biometrics.
In: IEEE Trans. PAMI. (2005)
25. Dorai, C., Jain, A.K.: COSMOS - A Representation Scheme for 3D Free-Form Objects. In:
IEEE Trans. PAMI. (1997)