Pattern Analysis for an Automatic and Low-Cost 3D Face Acquisition Technique

11
A Space-Time Depth Super-Resolution Scheme For 3D Face Scanning Karima Ouji 1 , Mohsen Ardabilian 1 , Liming Chen 1 , and Faouzi Ghorbel 2 1 LIRIS, Lyon Research Center for Images and Intelligent Information Systems, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134 Ecully, France. 2 GRIFT, Groupe de Recherche en Images et Formes de Tunisie, Ecole Nationale des Sciences de l’Informatique, Tunisie. Abstract. Current 3D imaging solutions are often based on rather spe- cialized and complex sensors, such as structured light camera/projector systems, and require explicit user cooperation for 3D face scanning un- der more or less controlled lighting conditions. In this paper, we pro- pose a cost effective 3D acquisition solution with a 3D space-time super- resolution scheme, using a calibrated stereo rig coupled with a non cali- brated projector device, which is particularly suited to 3D face scanning, i.e. rapid, easily movable and robust to ambient lighting conditions. The proposed solution is a hybrid stereovision and phase-shifting approach, using two shifted patterns and a texture image, which not only takes advantage of the assets of stereovision and structured light but also overcomes their weaknesses. The super-resolution scheme corrects the 3D information and completes the 3D scanned view despite the presence of small non-rigid deformations as facial expressions. The preliminary experimental results further validated the effectiveness of the proposed approach. 1 Introduction Real-time 3D shape measurement plays an important role in a wide range of applications, including biometry, facial animation, aesthetic and plastic surgery, surgical operation simulation and medical tracking. 3D face capture has its spe- cific constraints such as safety, speed and natural deformable behavior of the human face. Current 3D imaging solutions are often based on rather specialized and complex sensors, such as structured light camera/projector systems, and re- quire explicit user cooperation for 3D face scanning under more or less controlled lighting conditions [1, 4]. For instance, in projector-camera systems, depth infor- mation is recovered by decoding patterns of a projected structured light. These patterns include gray codes, sinusoidal fringes, etc. Current solutions mostly utilize more than three phase-shifted sinusoidal patterns to recover the depth information, thus impacting the acquisition delay; they further require projector- camera calibration whose accuracy is crucial for phase to depth estimation step; and finally, they also need an unwrapping stage which is sensitive to ambient light, especially when the number of pattern decreases [2, 5]. An alternative to

Transcript of Pattern Analysis for an Automatic and Low-Cost 3D Face Acquisition Technique

A Space-Time Depth Super-Resolution Scheme

For 3D Face Scanning

Karima Ouji1, Mohsen Ardabilian1, Liming Chen1, and Faouzi Ghorbel2

1 LIRIS, Lyon Research Center for Images and Intelligent Information Systems,Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134 Ecully, France.

2 GRIFT, Groupe de Recherche en Images et Formes de Tunisie, Ecole Nationaledes Sciences de l’Informatique, Tunisie.

Abstract. Current 3D imaging solutions are often based on rather spe-cialized and complex sensors, such as structured light camera/projectorsystems, and require explicit user cooperation for 3D face scanning un-der more or less controlled lighting conditions. In this paper, we pro-pose a cost effective 3D acquisition solution with a 3D space-time super-resolution scheme, using a calibrated stereo rig coupled with a non cali-brated projector device, which is particularly suited to 3D face scanning,i.e. rapid, easily movable and robust to ambient lighting conditions. Theproposed solution is a hybrid stereovision and phase-shifting approach,using two shifted patterns and a texture image, which not only takesadvantage of the assets of stereovision and structured light but alsoovercomes their weaknesses. The super-resolution scheme corrects the3D information and completes the 3D scanned view despite the presenceof small non-rigid deformations as facial expressions. The preliminaryexperimental results further validated the effectiveness of the proposedapproach.

1 Introduction

Real-time 3D shape measurement plays an important role in a wide range ofapplications, including biometry, facial animation, aesthetic and plastic surgery,surgical operation simulation and medical tracking. 3D face capture has its spe-cific constraints such as safety, speed and natural deformable behavior of thehuman face. Current 3D imaging solutions are often based on rather specializedand complex sensors, such as structured light camera/projector systems, and re-quire explicit user cooperation for 3D face scanning under more or less controlledlighting conditions [1, 4]. For instance, in projector-camera systems, depth infor-mation is recovered by decoding patterns of a projected structured light. Thesepatterns include gray codes, sinusoidal fringes, etc. Current solutions mostlyutilize more than three phase-shifted sinusoidal patterns to recover the depthinformation, thus impacting the acquisition delay; they further require projector-camera calibration whose accuracy is crucial for phase to depth estimation step;and finally, they also need an unwrapping stage which is sensitive to ambientlight, especially when the number of pattern decreases [2, 5]. An alternative to

2 Karima Ouji

projector-camera systems consists of recovering depth information by stereovi-sion using a multi-camera system as proposed in [3, 6]. Here, a stereo matchingstep finds correspondence between stereo images and retrieve the disparity in-formation [7]. When a number of light patterns are successively projected ontothe object, the algorithm assigns to each pixel in the images a special codewordwhich allows the stereo matching convergence and the 3D depth information isobtained by optical triangulation [3]. Meanwhile, the model computed in thisway generally is quite sparse. One needs to resort to other techniques to densifythe obtained sparse model. Recently, researchers looked into super-resolutiontechniques as a solution to upsample and denoise depth images. Kil et al. [11]were among the first to apply super-resolution for laser triangulation scanners byregular resampling from aligned scan points with associated Gaussian locationuncertainty. Super-resolution was especially proposed for time-of-flight cameraswhich have very low data quality and a very high random noise by solving anenergy minimization problem [12, 15].

In this paper, we propose a cost effective 3D acquisition solution with a 3Dspace-time super-resolution scheme, using a calibrated stereo rig coupled witha non calibrated projector device, which is particularly suited to 3D face scan-ning, i.e. rapid, easily movable and robust to ambient lighting conditions. Theproposed solution is a hybrid stereovision and phase-shifting approach which notonly takes advantage of the assets of stereovision and structured light but alsoovercomes their weaknesses. According to our method, first a 3D sparse model isestimated from stereo-matching with a fringe-based resolution. To achieve that,only two π-shifted sinusoidal fringes are used to sample right and left candi-dates, with subpixel precision. Then, the projector vertical axis is automaticallyestimated. A dense 3D model is recovered by the intra-fringe phase estimation,from the two sinusoidal fringe images and a texture image, independently fromright and left cameras. The left and right 3D dense models are fused to producethe final 3D model which constitutes a spatial super-resolution. Super-resolutionthrough the time axis is also proposed which corrects the 3D information andcompletes the 3D scanned view and also considers the facial deformable aspect.In contrast to conventional methods, our method is less affected by the ambientlight thanks to the use of stereo in the first stage of the approach, replacingthe phase unwrapping stage. Also, it does not require a camera-projector off-line calibration stage which constitutes a tedious and expensive task. Moreover,our approach is applied only to the region of interest which decreases the wholeprocessing time.

Section (2) presents an overview of the offline and online preprocessing pre-ceeding the 3D model generation. Section (3) details the 3D sparse model gener-ation and the projector axis estimation. In Section(4), we high-light the spatialsuper-resolution principle which densify the 3D sparse model. Section(5) de-scribes the super-resolution process through the time axis and explains how the3D space-time super-resolution is carried out. Section (6) discusses the experi-mental results and section (7) concludes the paper.

3D Face Scanning 3

2 Offline and online preprocessing

First, an offline strong stereo calibration computes the intrinsic and extrinsicparameters of the cameras, estimates the tangential and radial distortion pa-rameters, and provides the epipolar geometry as proposed in [10]. In onlineprocess, two π-shifted sinusoid patterns and a third white pattern are projectedonto the face. A set of three couples of left and right images is captured, undis-torted and rectified.The proposed model is defined by the system of equations(1). It constitutes a variant of the mathematic model proposed in [5].

Ip(s, t) = Ib(s, t) + Ia(s, t) · sin(φ(s, t)),In(s, t) = Ib(s, t) + Ia(s, t) · sin(φ(s, t) + π),It(s, t) = Ib(s, t) + Ia(s, t).

(1)

At time t , Ip(s, t) constitutes the intensity term of the pixel s on the positiveimage, In(s, t) is the intensity of s on the negative image and It(s, t) is theintensity of s on the texture image. According to our proposal, a π-shiftingbetween the first two patterns is optimal for a stereo scenario; left-right samplesused in stereo matching are located with a subpixel precision. As to the role ofthe last pattern, it is twofold: it allows to normalize the phase information, but itis also used to texture the 3D model. This model is defined in order to decorrelatethe sinusoidal signals Ia(s, t)·sin(φ(s, t)) and Ia(s, t)·sin(φ(s, t)+π) distorded onthe face from the non-sinusoidal term Ib(s). In fact, Ib(s, t) represents the textureinformation and the lighting effect and constitutes a contamination signal forthe sinusoidal component. Ia(s, t) is the intensity modulation. φ(s, t) is the localphase defined at each pixel s. Solving (1), Ib(s, t) is computed as the averageintensity of Ip(s, t) and In(s, t). Ia(s, t) is then computed from the third equationof the system (1) and the phase value φ(s, t) is estimated by equation (2).

φ(s, t) = arcsin[

Ip(s,t)−In(s,t)2·It(s,t)−Ip(s,t)−In(s,t)

]

. (2)

When projecting the sinusoidal pattern by the light source on the object,gamma distortion makes ideal sinusoidal waveforms nonsinusoidal such that theresulting distorsion is phase dependent. The gamma correction step is crucial toget efficient sinusoidal component and is performed using a Look-Up-Table asproposed in [4] and the localization of the face is carried out as described in[7].

3 3D sparse model and projector axis estimation

The sparse 3D model is generated through a stereovision scenario. It is formed bythe primitives situated on the fringe change-over which is the intersection of thesinusoidal component of the positive image and the second π-shifted sinusoidalcomponent of the negative one [7]. Therefore, the localization has a sub-pixelprecision. Corresponding left and right primitives necessarily have the same Y-coordinate in the rectified images. Thus, stereo matching problem is resolved in

4 Karima Ouji

each epiline separately using Dynamic Programming. The 3D sparse facial pointcloud is then recovered by computing the intersection of optical rays comingfrom the pair of matched features.

When projecting vertical fringes, the video projector can be considered as avertical adjacent sources of light. Such a consideration provides for each epiline alight source point OPrj situated on the corresponding epipolar plane. The sparse3D model is a serie of adjacent 3D vertical curves obtained by the fringes inter-section of the positive and the negative images. Each curve describes the profileof a projected vertical fringe distorded on the 3D facial surface. We propose toestimate the 3D plane containing each distorded 3D curve separately. As a re-sult, the light source vertical axis of the projector is defined as the intersectionof all the computed 3D planes. This estimation can be performed either as anoffline or online process unlike conventional phase-shifting approaches where theprojector is calibrated on offline and cannot change its position when scanningthe object.

4 3D spatial super-resolution

Here, the idea is to find the 3D coordinates for each pixel situated between twosuccessive fringes in either left or right camera images to participate separatelyon the 3D model elaboration. Therefore, we obtain a left 3D point cloud fromthe left images and a right 3D point cloud from the right images. The spatialsuper-resolution consists of merging the left and right 3D point clouds. The 3Dcoordinates of each pixel are computed using phase-shifting analysis.

Conventional phase-shifting techniques estimates the local phase in [0..2π] foreach pixel on the captured image. Local phases are defined as wrapped phases.Absolute phases are obtained by a phase unwrapping principle. Phase unwrap-ping consists of determining the unknown integral multiple of 2π to be addedat each pixel of the wrapped phase map to make it continuous. The algorithmconsiders a reference axis for which absolute phase value is equal to 0.

In the proposed approach, the sparse model replaces the reference axis andlets us retrieve 3D intra-fringe information from wrapped phases directly in con-trast to conventional approaches. Each point in the sparse model is used as areference point to contribute to the extraction of the intra-fringe informationfrom both left and right images. In fact, each point Pi in the sparse model con-stitutes a reference point for all pixels situated between Pi and its next neighborPi+1 on the same epiline of the sparse model. For a pixel Pk situated betweenPi(Xi, Yi, Zi) and Pi+1(Xi+1, Yi+1, Zi+1), we compute its local phase value φk

using equation (2). We propose to find depth information for Pk from the localphase directly and avoid the expensive task of phase unwrapping. The phasevalue of Pi is φi = 0 and the phase value of Pi+1 is φi+1 = π.

The phase φk which belongs to [0..π] has monotonous variation if [PiPi+1]constitutes a straight line on the 3D model. When [PiPi+1] represents a curveon the 3D model, the function φk describes the depth variation inside [PiPi+1].Therefore, the 3D coordinates (X(φk), Y (φk), Z(φk)) of the 3D point Pk corre-

3D Face Scanning 5

sponding to the pixel point Gk are computed by a geometric reconstruction asshown in figure 1.

Fig. 1. Intra-fringe 3D information retrieval scheme.

The 3D intra-fringe coordinates computation is carried out for each epiline i

separately. An epipolar plane is defined for each epiline and contains the opticalcenters OL and OR of respectively left and right cameras and all 3D pointssituated on the current epiline i. Each 3D point Pk is characterized by its ownphase value φ(Pk). The light ray coming from the light source into the 3D pointPk intersects the segment [PiPi+1] in a 3D point Ck having the same phasevalue φ(Ck) = φ(Pk) as Pk. To localize Ck, we need to find the distance PiCk.This distance is computed by applying the sine law in the triangle OPrjPiCk asdescribed in equation (3).

PiCk

sin(θC)=

OPrjPi

sin(π − (θC + α)). (3)

The distance OPrjPi and the angle α between (OPrjPi) and (PiPi+1) areknown. Also, the angle θ between (OPrjPi) and (OPrjPi+1) is known. Thus,the angle θC is defined by equation (4). After localizing Ck, the 3D point Pk

is identified as the intersection between (ORGk) and (OPrjCk). This approachprovides the 3D coordinates of all pixels. Points meshing and texture mappingare then performed to obtain the final 3D face model.

θC =π

θ.φ(Ck). (4)

6 Karima Ouji

Conventional super-resolution techniques carry out a registration step be-tween low-resolution data, a fusion step and a deblurring step. Here, the phase-shifting analysis provides a registrated left and right point clouds since their 3Dcoordinates are computed based on the same 3D sparse point cloud provided bystereo matching. Also, left and right point clouds present homogeneous 3D dataand need only to be merged to retrieve the high-resolution 3D point cloud.

5 3D space-time super-resolution

A 3D face model can present some artifacts caused by either an expressionvariation, an occlusion or even a facial surface reflectance. To deal with theseproblems, we propose to apply a 3D super-resolution through the time axis foreach couple of successive 3D point sets Mt−1 and Mt at each moment t. First, a3D non-rigid registration is performed and formulated as a maximum-likelihoodestimation problem since the deformation between two successive 3D faces is nonrigid in general. We employ the CPD (Coherent Point Drift) algorithm proposedin [13, 14] to registrate the 3D point set Mt−1 with the 3D point set Mt.

The CPD algorithm considers the alignment of two point sets Msrc and Mdst

as a probability density estimation problem and fits the GMM (Gaussian Mix-ture Model) centroids representing Msrc to the data points of Mdst by maximiz-ing the likelihood as described in [14]. The source point set Msrc represents theGMM centroids and the destination point set Mdst represents the data points.Nsrc constitutes the number of points of Msrc and Msrc = {sn|n = 1, ..., Nsrc}.Ndst constitutes the number of points of Mdst and Mdst = {dn|n = 1, ..., Ndst}.To create the GMM for Msrc, a multi-variate Gaussian is centered on each pointin Msrc. All gaussians share the same isotropic covariance matrix σ2I, I beinga 3X3 identity matrix and σ2 the variance in all directions [13, 15]. Hence thewhole point set Msrc can be considered as a Gaussian Mixture Model with thedensity p(d) as defined by equation (5).

p(d) =

Ndst∑

m=1

1

Ndst

p(d|m), d|m ∝ N(sm, σ2I). (5)

Once registered, the 3D point sets Mt−1 and Mt and also their correponding2D texture images are used as a low resolution data to create a high resolution 3Dpoint set and its corresponding high resolution 2D texture image. We apply the2D super-resolution technique as proposed in [16] which solves an optimizationproblem of the form:

minimize Edata(H) + Eregular(H). (6)

The first term Edata(H) measures agreement of the reconstruction H withthe aligned low resolution data. Eregular(H) is a regularization or prior energyterm that guides the optimizer towards plausible reconstruction H.

The 3D model Mt cannot be represented by only one 2D disparity imagesince the points situated on the fringe change-over have sub-pixel precision.

3D Face Scanning 7

Also, the left and right pixels participate separately in the 3D model since the3D coordinates of each pixel is retrieved using only its phase information asdescribed in section (4). Thus, we propose to create three left 2D maps definedby the X, Y and Z coordinates of the 3D left points and also three right 2D mapsdefined by the X, Y and Z coordinates of the 3D right points. The optimizationalgorithm and the deblurring are applied to compute high-resolution left andright images of X, Y, Z and texture separately from the low-resolution images ofthe left and right maps of X, Y, Z and texture. We obtain a left high-resolution3D point cloud Lt using left high-resolution data of X, Y and Z. Also, a righthigh-resolution 3D point cloud Rt is obtained using right high-resolution data ofX, Y and Z. The final high-resolution 3D point cloud is retrieved by merging Lt

and Rt which are already registrated since both of them contain the 3D sparsepoint cloud computed from stereo matching.

6 Experimental results

The stereo system hardware is formed by two network cameras with a 1200x1600pixel resolution and a LCD video projector. The projected patterns are digital-generated. The projector vertical axis is defined by a directional 3D vector Nproj

and a 3D point Pproj . They are computed by analysing each 3D plane defined byall the 3D points situated at the vertical profile of the same fringe change-over.The equations of the fringe planes are estimated by a mean square optimiza-tion method with a precision of 0.002mm. The directional vector Nproj is thencomputed as the normal vector to all the normal vectors of the fringe planes.Nproj is estimated with a deviation error of 0.003rad. Finally, the point Pproj

is computed as the intersection of all the fringe planes using a mean squareoptimization.

(a) Left view (b) Right view

Fig. 2. Primitives extraction on localized left and right faces.

Figure 2 presents the primitives extracted from left and right views of a facewith neutral expression and figure 3 presents the reconstruction steps to retrievethe corresponding dense 3D model. Figure 3.a presents the sparse 3D face model

8 Karima Ouji

of 4486 points. The spatial super-resolution provides a 3D dense model of 148398points shown in figure 3.b before gamma correction and in figure 3.c after gammacorrection. Figure 3.d presents the texture mapping result.

(a) Sparse Point Cloud (b) Before gamma correction

(c) After gamma correction (d) Textured Model

Fig. 3. 3D Spatial super-resolution results.

Capturing a 3D face requires about 0.6 seconds with an INTEL Core2 DuoCPU (2.20Ghz) and 2GB RAM. The precision of the reconstruction is esti-mated using a laser 3D face model of 11350 points scanned by a MINOLTA VI-300 non-contact 3D digitizer. We perform a point-to-surface variant of the 3Drigid matching algorithm ICP (Iterative Closest Point) between a 3D face modelprovided by our approach and a laser 3D model of the same face. The mean de-viation obtained between them is 0.3146mm. Also, a plane with a non-reflectivesurface is reconstructed to measure its precision quality. Its sparse model con-tains 14344 points and the phase-shifting process provides a final dense modelof 180018 points. To measure the precision, we compute the plane’s theoreticalequation which constitutes the ground truth using three manually marked pointson the plane. Computing the orthogonal distance vertex between the theoreticalplane and the 3D point cloud provides a mean deviation of 0.0092mm.

3D Face Scanning 9

(a) First left view (b) First right view

(c) Second left view (d) Second right view

Fig. 4. Left and right primitives of two successive frames.

(a) First Mesh (b) Second Mesh

(c) First Textured model (d) Second Textured model

Fig. 5. 3D Spatial super-resolution results for two successive stereo frames.

10 Karima Ouji

Figure 4 presents the primitives extracted on localized left and right views oftwo successive frames of a moving face with an expression variation and figure 5presents their corresponding dense 3D models computed by only spatial super-resolution. Figure 5.a and figure 5.b show the reconstruted meshes. Figure5.cand figure 5.d present their texture mapping results.

(a) Mesh (b) Textured Model

Fig. 6. 3D Space-time super-resolution results.

At time t, the left and right cameras create two different right and left cap-tured views of the face which leads to some occluded regions leading to artifactscreation as shown in figure 5. Occluded regions are situated on face border. Also,the left view of the second stereo frames presents an occlusion on the noze regionand corresponding points situated on the fringe change-over are not localized asshown in figure 4.c which creates an artifact on the computed 3D model asshown in figure 5.c and figure 5.d. To deal with these problems, 3D informationfrom the first and second 3D models are merged despite their non-rigid deforma-tion thanks to the super-resolution approach proposed in section (5). As shownin figure 6, our approach enhances the quality of the computed 3D model andcompletes as well its scanned 3D view.

7 Conclusion and future work

This paper proposes a 3D acquisition solution with a 3D space-time super-resolution scheme which is particularly suited to 3D face scanning. The proposedsolution is a hybrid stereovision and phase-shifting approach, using two shiftedpatterns and a texture image. It is low-cost, fast, easily movable and robust toambient lighting conditions. A scanned 3D face model can present some artifactscaused by either an expression variation, an occlusion or even a facial surfacereflectance. Super-resolution aims to enhance the quality of the face and to com-plete the 3D scanned view in the presence of small non-rigid deformations asfacial expressions.

3D Face Scanning 11

The super-resolution through the time axis fails to correct the 3D face whentwo successive 3D models present severe artifacts which can propagate throughthe following 3D frames. As a future work, we suggest to consider more 3Dframes through the time axis to enhance the 3D video quality and to completethe scanned view from several previous 3D models.

8 Acknowledgments

This research is supported in part by the ANR project FAR3D under the grantANR-07-SESU-003.

References

1. Blais, F.: Review of 20 years of range sensor development. J. Electronic Imaging.13, 231–240 (2004)

2. Zhang, S.and Yau, S.: Absolute phase-assisted three-dimensional data registrationfor a dual-camera structured light system. J. Applied Optics. 47, 3134–3142 (2008)

3. Zhang, L. and Curless, B. and Seitz, S. M.: Rapid shape acquisition using colorstructured light and multipass dynamic programming. 3DPVT Conference. (2002)

4. Zhang, S.and Yau, S.: Generic nonsinusoidal phase error correction for three-dimensional shape measurement using a digital video projector. J. Applied Optics.46, 36–43 (2007)

5. Zhang, S.: Recent progresses on real-time 3D shape measurement using digital fringeprojection techniques. J. Optics an Lasers in Engineering. 48, 149–158 (2010)

6. Cox, I. and Hingorani, S. and Rao, S.: A maximum likelihood stereo algorithm. J.Computer Vision and Image Understanding. 63, 542–567 (1996)

7. Ouji, K. and Ardabilian, M. and Chen, L. and Ghorbel, F.: Pattern Analysis foran Automatic and Low-Cost 3D Face Acquisition Technique. ACIVS Conference.666–673 (2009)

8. Lu, Z. and Tai, Y. and Ben-Ezra, M. and Brown, M. S.: A Framework for UltraHigh Resolution 3D Imaging. CVPR Conference. (2010)

9. Klaudiny, M. and Hilton, A. and Edge, J.: High-detail 3D capture of facial perfor-mance. 3DPVT Conference. (2010)

10. Zhang, Z.: Flexible Camera Calibration by Viewing a Plane from Unknown Ori-entations. ICCV Conference. (1999)

11. Kil, Y. and Mederos, Y. and Amenta, N.: Laser scanner super-resolution. Euro-graphics Symposium on Point-Based Graphics. (2006)

12. Schuon, S. and Theobalt, C. and Davis, J. and Thrun, S.: LidarBoost: DepthSuperresolution for ToF 3D Shape Scanning. CVPR Conference. (2009)

13. Myronenko, A. and Song, X. and Carreira-Perpinan, M. A.: Non-rigid point setregistration: Coherent Point Drift. NIPS Conference. (2007)

14. Myronenko, A. and Song, X.: Point set registration: Coherent Point Drift. IEEETrans. PAMI. 32, 2262–2275 (2010)

15. Cui, Y. and Schuon, S. and Chan, D. and Thrun, S. and Theobalt, C.: 3D ShapeScanning with a Time-of-Flight Camera. 3DPVT Conference. (2010)

16. Farsiu, S. and Robinson, D. and Elad, M. and Milanfar, P.: Fast and robust multi-frame super-resolution. IEEE Trans. Image Processing. (2004)