3-D studio production of animated actor models

10
3-D studio production of animated actor models A. Hilton, M. Kalkavouras and G. Collins Abstract: A framework for construction of detailed animated models of an actor’s shape and appearance from multiple view images is presented. Multiple views of an actor are captured in a studio with controlled illumination and background. An initial low-resolution approximation of the person’s shape is reconstructed by deformation of a generic humanoid model to fit the visual hull using shape constrained optimisation to preserve the surface parameterisation for animation. Stereo reconstruction with multiple view constraints is then used to reconstruct the detailed surface shape. High-resolution shape detail from stereo is represented in a structured format for animation by displacement mapping from the low-resolution model surface. A novel integration algorithm using displacement maps is introduced to combine overlapping stereo surface measurements from multiple views into a single displacement map representation of the high-resolution surface detail. Results of 3-D actor modelling in a 14 camera studio demonstrate improved representation of detailed surface shape such as creases in clothing compared to previous model fitting approaches. Actor models can be animated and rendered from arbitrary views under different illumination to produce free-viewpoint video sequences. The proposed framework enables rapid transformation of captured multiple view images into a structured representation suitable for realistic animation. 1 Introduction Realistic representation of real people remains a primary goal of computer graphics research. Model construction in a structured format is a major bottleneck in the widespread use of shape capture for computer animation. Currently manual techniques are widely used in film production to build models of people suitable for realistic animation. Previous research has introduced the use of active 3-D measurement technologies or multiple camera studios to capture the shape and appearance of a moving person [1–5]. Shape-from-silhouette and multiple view stereo are used to reconstruct sequences of 3-D shape and appearance and render them from novel viewpoints. However, the resulting sequences are unstructured and do not allow modification of the actors movement. Recent research [6, 7] has introduced model fitting techniques for automatic reconstruction of structured animated models from multiple view image sequences. Structured models allow modification of a person’s move- ment in a standard animation pipeline and provide an efficient representation of a person’s shape, appearance and motion. These approaches fit a generic humanoid model to the multiple view images using shape-from-silhouette [6] and model-based stereo refinement [7]. However, owing to the fitting of a generic model to the captured data these approach do not reproduce accurately the fine detail of surface shape such as creases in clothing. There are two problems common to previous model fitting approaches: the visual-hull from multiple view silhouettes only provides a coarse representation of a person’s shape without concavities; and fitting a generic (polygonal) model is limited to the degrees-of-freedom of the representation resulting in loss of fine surface detail. Displacement mapping techniques have been used previously to represent fine surface detail for reconstruction of structured animated models from high-resolution range images of objects [8–11]. In this paper we introduce novel displacement mapping techniques for representation of surface detail reconstructed using passive stereo. Stereo reconstruction results in relatively noisy estimates of surface shape requiring robust methods for integration of overlapping surface measurements. Multiple view stereo and displacement mapping techniques are presented, which allow detailed modelling of a person’s shape in a structured form suitable for animation. 2 Previous work Advances in active sensor technology together with research in computer vision and graphics has resulted in systems for capturing surface models of complex 3-D objects, people and internal environments [12–14]. These approaches capture accurate and realistic 3-D models of complete objects with a level-of-detail not possible with previous manual techniques. However, such techniques result in object models, which are represented as unstructured polygonal meshes consisting of millions of polygons. Conversion of such models to a structured form suitable for animation require labour intensive manual remeshing. Structuring captured data in a form suitable for animation is a challenging problem that has received only limited attention for character models [8, 15, 16] and people [17, 11]. Allen et al. [9] have recently used captured models of naked people in multiple static poses to animate a persons skin deformation during movement. However, active sensors are currently limited to the acquisition of static objects. Research in computer vision has investigated passive methods for reconstructing models of people and their q IEE, 2005 IEE Proceedings online no. 20045114 doi: 10.1049/ip-vis:20045114 The authors are with the Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU27XH, UK E-mail: [email protected], [email protected] Paper first received 15th July 2004 and in revised form 28th January 2005 IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005 481

Transcript of 3-D studio production of animated actor models

3-D studio production of animated actor models

A. Hilton, M. Kalkavouras and G. Collins

Abstract: A framework for construction of detailed animated models of an actor’s shape andappearance from multiple view images is presented. Multiple views of an actor are captured in astudio with controlled illumination and background. An initial low-resolution approximation of theperson’s shape is reconstructed by deformation of a generic humanoid model to fit the visual hullusing shape constrained optimisation to preserve the surface parameterisation for animation. Stereoreconstruction with multiple view constraints is then used to reconstruct the detailed surface shape.High-resolution shape detail from stereo is represented in a structured format for animation bydisplacement mapping from the low-resolution model surface. A novel integration algorithm usingdisplacement maps is introduced to combine overlapping stereo surface measurements frommultiple views into a single displacement map representation of the high-resolution surface detail.Results of 3-D actor modelling in a 14 camera studio demonstrate improved representation ofdetailed surface shape such as creases in clothing compared to previous model fitting approaches.Actor models can be animated and rendered from arbitrary views under different illumination toproduce free-viewpoint video sequences. The proposed framework enables rapid transformation ofcaptured multiple view images into a structured representation suitable for realistic animation.

1 Introduction

Realistic representation of real people remains a primarygoal of computer graphics research. Model construction in astructured format is a major bottleneck in the widespreaduse of shape capture for computer animation. Currentlymanual techniques are widely used in film production tobuild models of people suitable for realistic animation.

Previous research has introduced the use of active 3-Dmeasurement technologies or multiple camera studios tocapture the shape and appearance of a moving person [1–5].Shape-from-silhouette and multiple view stereo are used toreconstruct sequences of 3-D shape and appearance andrender them from novel viewpoints. However, the resultingsequences are unstructured and do not allow modification ofthe actors movement.

Recent research [6, 7] has introduced model fittingtechniques for automatic reconstruction of structuredanimated models from multiple view image sequences.Structured models allow modification of a person’s move-ment in a standard animation pipeline and provide anefficient representation of a person’s shape, appearance andmotion. These approaches fit a generic humanoid model tothe multiple view images using shape-from-silhouette [6]and model-based stereo refinement [7]. However, owing tothe fitting of a generic model to the captured data theseapproach do not reproduce accurately the fine detail ofsurface shape such as creases in clothing. There are twoproblems common to previous model fitting approaches: thevisual-hull from multiple view silhouettes only provides

a coarse representation of a person’s shape withoutconcavities; and fitting a generic (polygonal) model islimited to the degrees-of-freedom of the representationresulting in loss of fine surface detail.

Displacement mapping techniques have been usedpreviously to represent fine surface detail for reconstructionof structured animated models from high-resolution rangeimages of objects [8–11]. In this paper we introduce noveldisplacement mapping techniques for representation ofsurface detail reconstructed using passive stereo. Stereoreconstruction results in relatively noisy estimates ofsurface shape requiring robust methods for integration ofoverlapping surface measurements. Multiple view stereoand displacement mapping techniques are presented, whichallow detailed modelling of a person’s shape in a structuredform suitable for animation.

2 Previous work

Advances in active sensor technology together withresearch in computer vision and graphics has resulted insystems for capturing surface models of complex 3-Dobjects, people and internal environments [12–14]. Theseapproaches capture accurate and realistic 3-D models ofcomplete objects with a level-of-detail not possible withprevious manual techniques. However, such techniquesresult in object models, which are represented asunstructured polygonal meshes consisting of millions ofpolygons. Conversion of such models to a structured formsuitable for animation require labour intensive manualremeshing. Structuring captured data in a form suitable foranimation is a challenging problem that has received onlylimited attention for character models [8, 15, 16] andpeople [17, 11]. Allen et al. [9] have recently usedcaptured models of naked people in multiple static posesto animate a persons skin deformation during movement.However, active sensors are currently limited to theacquisition of static objects.

Research in computer vision has investigated passivemethods for reconstructing models of people and their

q IEE, 2005

IEE Proceedings online no. 20045114

doi: 10.1049/ip-vis:20045114

The authors are with the Centre for Vision, Speech and Signal Processing,University of Surrey, Guildford GU27XH, UK

E-mail: [email protected], [email protected]

Paper first received 15th July 2004 and in revised form 28th January 2005

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005 481

movement from multiple view images. Kanade et al. [1]demonstrated the first multiple camera studio to capture amoving persons 3-D shape and appearance. This was usedfor ‘Virtualised Reality’ production of dynamic events witharbitrary camera viewpoint, illumination and backgroundscene. Subsequent research [2–5] has used shape-from-silhouette and multiple-view stereo techniques to recon-struct sequences of 3-D people. These approaches producean unstructured sequence of 3-D surfaces, which do notsupport modification of the movement or efficient represen-tation for animation. In addition, both multiple view stereoand shape-form-silhouette fail to reconstruct accuratelythe detailed surface shape owing to visual ambiguitiesresulting in a loss of visual fidelity compared to the capturedimages [7].

Terzopoulos [18] introduced the reconstruction ofstructured ‘functional models’ instrumented for facialanimation from captured 3-D face shape of real people.This approach was applied to single camera reconstructionof animated models of people by fitting a generic humanoidmodel to silhouette images [19]. Recent research [6, 7] hasaddressed the problem of reconstructing structured modelsof people from multiple simultaneous camera views. Theseapproaches fit a generic humanoid animation model tosilhouette and stereo reconstruction from multiple cameraviews of a person in an arbitrary pose resulting in realisticmodels. Carranza et al. [6] automatically estimated theperson’s pose and fitted a polygonal humanoid model to thevisual-hull reconstructed from multiple view image silhou-ettes. The quality of reconstructed texture mapped models islimited by the visual hull that only reconstructs anapproximate, locally convex, representation of the surfaceresulting in coarse geometry and visual artefacts owing toincorrect alignment of overlapping images in texture mapgeneration. Stereo surface has been used to improvereconstruction of surface detail [7, 20, 21]. Plaenkers andFua [21] used implicit surface models to simultaneouslyreconstruct shape and movement from stereo point clouds.Results demonstrated reconstruction of complex move-ments with self-occlusion. Shape reconstruction is limited toa smooth surface owing to the use of an implicit surfacerepresentation based on ellipsoidal volumetric primitives(meta-balls). Model-based reconstruction using shapeconstrained fitting [7] overcomes the limitations of shape-from-silhouette and stereo for reconstruction in the presenceof visual ambiguities such as stereo correspondence inuniform surface regions. However, owing to the fitting of ageneric model with prior assumptions of local surface shapethe reconstructed model may not accurately represent finesurface detail.

In this paper we build on previous approaches for model-based reconstruction of people [6, 7] by introducingtechniques to reconstruct and represent additional high-resolution surface detail which is not represented in surfacefitting. This approach combines model-based reconstructionof structured representations in the presence of visualambiguities with accurate representation of surface detail,which is not present in the generic model such as creases inclothing or hair.

3 Reconstruction of animated actor models

This Section presents the algorithms developed to recon-struct animated models of people from multiple viewimages. Section 3.1 presents an overview of the frameworkfor model reconstruction. Subsequent Sections presentnovel contributions of the constrained shape fitting, multiple

view stereo, displacement map integration and represen-tation algorithms.

3.1 Model reconstruction from multipleviews

The pipeline for reconstructing animated models frommultiple view images is illustrated in Fig. 1. The outputfrom the system is a layered representation of the captureddata, which is structured in a form suitable for realistic andefficient animation. The representation consists of threelayers: articulated skeleton, control model and displace-ment=texture map. The skeleton and control model providea generic structure for animation with the control modelsurface non-rigidly deformed by animation of the skeleton.A displacement map is then used to represent the capturedsurface detail by mapping the stereo data onto the control

input:

output:

studio capture

generic modelalignment

shape fromsilhouette

shape constrainedfitting

multi-view stereocorrespondence

displacementmapping

texture mapping

animation

Fig. 1 Overview of framework for producing animated actormodels from multiple view studio capture

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005482

model surface. This work introduces robust methods fordisplacement map representation of surface measurementsfrom multiple passive stereo images with a relatively highnoise level compared to previous methods applied to activerange sensors. Texture mapping represents the detailedsurface appearance reconstructed from multiple viewimages. This layered representation enables animation ofthe high-resolution captured surface detail based ondeformation of the underlying control model.

The novel reconstruction approach combines narrowbaseline stereo to estimate the detailed surface shape withwide baseline multiple view constraints for robust recon-struction. Novel displacement mapping algorithms areintroduced to integrate estimates of surface shape frommultiple stereo pairs and represent the surface detail.The challenge is to establish a mapping between the genericmodel and captured data with the correct correspondence fornon-rigid deformation. Reconstruction of a layered represen-tation from multiple view stereo data comprises five stages:

1. Model registration: Initially the generic control model ismanually posed for approximate alignment with the capturedata. Joint centres are approximately aligned with theobserved pose and constrained optimisation used to estimatethe kinematic structure (limb-lengths) and pose. Anthropo-metric constraints preserve the skeletal symmetry of limb-lengths during fitting.2. Shape fitting: Shape constrained surface fitting is used todeform the control model surface to approximate thevisuahull reconstructed from multiple view silhouettes [7].Shape constraints preserve the control model parameterisa-tion required for animation.3. Stereo reconstruction: Narrow-baseline stereo is used toreconstruct measurements of a surface shape from eachadjacent camera pair. Wide baseline multiple view con-straints are used to dynamically define the disparity searchrange, adapt the correlation window size and removeoutliers from the stereo correspondence estimates. Stereoprovides dense reconstruction of the detailed surface shape.4. Displacement mapping: Displacement mapping is usedto integrate and represent the high-resolution estimates ofsurface shape from multiple view stereo. Normal-volumemapping [16] is used to establish a continuous correspon-dence between the generic model surface and each set ofstereo data. Novel methods are introduced for integration ofoverlapping estimates of surface shape from stereo to reducenoise and reconstruct a single high-resolution surfacemodel.

5. Texture mapping: Multiple view stereo reconstructsa surface that aligns surface appearance between views.An integrated texture map image is reconstructed bycombining the aligned images between multiple overlapp-ing views. A multi-resolution Gaussian pyramid is used tocombine overlapping images and fill texture holes in surfaceregions which are not visible from the captured views.

Steps 3 and 4 are new to the approach presented in this papercompared to previous model-based reconstruction algor-ithms [6, 7]. This framework results in a layeredrepresentation of the person with the generic modelstructure and a common image-based representation of thehigh-resolution surface shape and appearance. The resultingmodel can be animated and rendered for production ofimage sequences with novel movement. Figure 2a–dpresents an example of the reconstruction stages from theposed generic model(a), shape constrained fitting of thegeneric model (b), displacement mapping of high-resolutionsurface detail from multiple view stereo (c) and final texturemapped model (d ). It should be noted that there issignificantly more geometric detail such as clothing foldsin Fig. 2c compared to the generic model fitting to silhouetteand stereo data Fig. 2b.

3.2 Shape constrained fitting

Once the generic model is posed to match the 3-D data setthe next stage is to deform the shape of the generic controlmodel so that it conforms closely to the 3-D surface [7].A requirement for natural animation of the conformedcontrol model is that the mesh topology and vertexparameterisation does not change during conformance.A shape constrained deformable model is used to preservethe prior parameterisation of the control model while fittingto the 3-D data set. The novelty of this approach lies in theformulation of a unique parameterisation for arbitrarytriangular meshes, which is used as the internal energy inmesh deformation.

The deformable surface model ~xx minimises the energyfunction Eð~xxÞ incorporating the potential energy from datafitting Pð~xxÞ; and the internal energy from the shape of themodel Sð~xxÞ:

Eð~xxÞ ¼ Pð~xxÞ þ Sð~xxÞ ð1Þ

In previous work internal energy terms have been derivedbased on treating the surface as a membrane or thin-platematerial under tension [18]. This yields the surface withminimum area or distortion that fits to the data. However,

a b c d

Fig. 2 Sequence of model reconstruction

a Posed control modelb Control model fit to visual hullc High-resolution modeld High-resolution model with texture

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005 483

the thin-plate and membrane energy do not preserve thesurface parameterisation, which is essential for animation.

For an arbitrary triangular mesh a vertex position is notwell defined in relation to the vertex neighbourhood. Withan irregular number of vertices in the 1-neighbourhood it isnot possible to obtain a consistent definition of a local frameto describe the position of the central vertex. We thereforeconsider a triangle face-based scheme as used byKobbelt et al. [22]. The vertex positions of a triangle facecan be defined by the barycentric coordinates and heightoffset in the local frame of the vertices on the faces edge-connected to the central face, the vertices surrounding thecentral triangle.

The position of a mesh vertex is therefore constrainedby the local position in the triangle face frames in the1-neighbourhood of the vertex, leading to a 2-neighbour-hood support structure for a vertex position.

We define the internal energy of a shape constrained modelas the integral across the surface of the deviation in the localshape from the generic shape defined in each face basedframe [23]. This is given by the summation of the error at themesh vertices,~xxi; preserving the local parameterisation andshape in the vertex positions. Equation (2) defines the internalenergy where a0

if ; b0if ; h

0if

� �are the default barycentric

coordinates ða; bÞ and height offset h in the f th face basedframe for the ith vertex with valence Ni:

Sð~xxÞ ¼Xi

Xf

~xxi �~xx a0if ;b

0if ; h

0if

� ��� ��2

Ni

ð2Þ

This constraint preserves the local shape of the generichumanoid model during fitting to ensure that the para-meterisation of the control model ML for animation ispreserved. For fitting to the visual hull the external energy of(1) Pð~xxÞ is derived by summing a data fit error eð~xxÞ acrossthe model surface ~xxðu; vÞ. We define the error metric infitting the data as the least-squared error between the modeland the 3-D data set. The potential energy function is givenby (3) where~xxi spans the set of I model vertices and~yyj spansthe set of J 3-D data points.

Pð~xxÞ ¼Xi

Xj

mijk~yyj �~xxik2 ð3Þ

Fitting to the visual hull is performed by an iterativegradient descent solution to minimise the energy function(1). Further details of the fitting procedure are presented in[7]. Figure 2a and b show an example of the control modelsurface before and after shape constrained fitting to thevisual hull. This process preserves the generic control modelanimation structure and mesh parameterisation allowing theresulting model to be animated.

3.3 Stereo surface reconstruction

Stereo produces detailed surface reconstruction in regionsof non-uniform appearance, and noisy or erroneousmeasurements in regions of uniform appearance. In thiswork we enhance existing stereo algorithms [24, 25] toimprove reconstruction accuracy and reduce outliers formultiple camera capture in a controlled studio environment.Three novel constraints for multiple view stereo areintroduced: elimination of correspondences outside thevisual hull; a dynamic disparity range for correspondenceinside the visual hull; and neighbourhood consistency forambiguous correspondences. An adaptive correlationwindow size is also used to avoid ambiguities in regionsclose to the silhouette boundary. The enhanced stereo

algorithm reduces the number of false correspondenceswhich would result in outliers or noise in conventional pairwise stereo algorithms.

Dense stereo correspondence between each pair ofadjacent cameras is estimated as follows:

1. Rectification: Images are rectified in order to align theepipolar lines horizontally between camera views.2. Dynamic disparity range: The disparity range to searchfor correspondence between images is calculated dynami-cally for each pixel. The disparity range is constrained to lieinside the visual hull with a maximum distance dmax fromthe visual hull surface. The ray corresponding to a pixel inone image is intersected with the visual hull to evaluate thesegment of the ray inside the visual hull with distance lessthan dmax: The corresponding section of the epi-polar line inthe other image is then used to constrain the correspondencesearch. This procedure is performed in both directions toensure consistency of stereo correspondence. A value dmax¼ 0:1m has been used throughout this work, whichcorresponds to the maximum expected deviation betweenthe visual hull and true surface together with calibrationerror between views of � 0:01m � 1pixel: The resultingdisparity search range along the epi-polar line is approxi-mately 10 pixels. Dynamic disparity range reduces thenumber of incorrect stereo matches and improves compu-tational efficiency.3. Disparity search: Normalised cross-correlation is used tomeasure correspondence between an n� m pixel window ineach image by searching along the epi-polar line within thedisparity range. Adaptive matching is used to reduce thewindow size in regions close to the silhouette boundarywhilst maintaining a larger window size in internal regionsfor accurate localisation. All correlation peaks above athreshold tc are identified and stored. In this paper a windowsize of 11 � 11 is used in interior regions and 5 � 5 inregions near the silhouette border. Windows that fall outsidethe silhouette boundary are excluded. A correlationthreshold tc ¼ 0:7 has experimentally been found to give agood compromise between identifying the correct andspurious correlation peaks (note: pixels with multiple peaksare disambiguated in subsequent processing).4. Neighbourhood consistency: A two-pass algorithm isintroduced to estimate the correct disparity for each pixel.In the first pass only pixels with a single disparity peakabove the correlation threshold tc are accepted. The secondpass uses a median filter to identify the median neighbour-hood disparity peak for adjacent pixels with an unambigu-ous single disparity value. The correct peak for a pixel withmultiple peaks is constrained to lie within �td pixels ofthe median neighbourhood disparity. Throughout this worktd ¼ 2pixels is used, this is the minimum distance betweenadjacent peaks and imposes a strict threshold to robustlyeliminate incorrect peaks. Neighbourhood consistencyresults in a dense disparity map with holes in ambiguoussurface regions owing to low correlation peak or ambiguityin local appearance.5. Noise removal: A second median filter with a windowsize equal to the correlation matching window size isapplied to all disparity estimates to remove remainingoutliers and reduce noise. Any disparity estimate that isoutside �td pixels of the median disparity for the corre-sponding correlation window is removed as noise. Thisreduces correspondence noise in the dense disparity map byensuring neighbourhood consistency for smooth surfaces.The assumption of smooth surfaces is valid in thereconstruction of stereo disparity maps for people wherethe visual hull constraint eliminates surface discontinuities.

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005484

6. Sub-pixels disparity estimation: A quadratic function isfitted to the correspondence estimates over a 5 pixelneighbourhood.

The stereo algorithm results in a dense disparity map withrelatively few outliers owing to incorrect correspondence ornoise. Owing to the redundancy of overlapping disparitymaps from multiple views elimination of outliers ispreferable to recovering a surface estimate at every pixel.Figure 3 presents results of the stereo reconstruction for onecamera pair. Dense stereo is reconstructed for a largeproportion of the surface but holes also occur owing to therobust outlier removal using the multiple view constraints.Figures 3c and d show the stereo reconstruction of high-resolution surface detail such as ridges in the clothing.This high-resolution surface detail is not present in themodel resulting from the multiple view model fitting [7].

Integration of reconstruction from multiple stereo views isused to remove holes and obtain high-resolution data for theentire surface.

3.4 Integration of multiple view stereo

In this Section we introduce a robust method to integrateestimates of surface shape from multiple stereo pairs into asingle high-resolution surface. Previous algorithms forintegration of multiple view range images [12, 13, 26]assume relatively accurate surface measurements fromactive 3-D sensors using projected structured light. Stereoestimates of surface geometry for people in normal clothingare relatively noisy owing to the absence of local variationsin surface appearance required for accurate correspondencelocalisation. Therefore, a robust approach to integration ofsurface measurements is required.

In this paper we introduce a novel approach to integrationof surface shape estimates from stereo using displacementmaps. Initially a displacement map image represented isreconstructed for the surface estimates from each stereo pair.Robust integration of individual displacement map images isthen performed to obtain a unified displacement map imagerepresentation of the high-resolution surface detail.

Displacement mapping using the normal-volume waspreviously introduced [16] to allow representation andanimation of a high-resolution polygonal surface model,MH ; as displacements from a low-resolution polygonalanimation control model, ML: The normal-volume fortriangles on a polygonal mesh, ML; defines a continuousspatial mapping between the surface of the low-resolutionmodel and the volume occupied by the high-resolutionmodel, MH : Previous work [11, 16] used this approach torepresent efficiently high-resolution surface measurementscaptured using active 3-D sensors such as laser scanners. Itwas also shown that this parameterisation could be used toanimate seamlessly the high-resolution model based ondeformation of the underlying control model.

Here we use the normal-volume displacement mappingtechnique to represent and integrate the relatively noisysurface measurements from multiple view stereo. Displace-ment map representation of individual stereo shapeestimates allows the correspondence between multiplenoisy sets of overlapping surface measurements to beestablished. This overcomes limitations of previous implicitsurface [12, 13] and mesh-based nearest-point [27, 28]range image integration algorithms in the presence ofsignificant measurement noise.

The multiple view stereo integration algorithm isperformed in the following steps:

1. Stereo reconstruction: Stereo reconstruction of densedisparity image Iij between adjacent camera views i and j isperformed using the algorithm introduced in Section 3.3.2. Constrained triangulation: For each pixel Iijðr; sÞ in thedisparity image, for which a valid disparity has beenestimated, we estimate the corresponding 3-D surfaceposition~xxðr; sÞ: A high-resolution triangulated mesh MH

ij isthen constructed by constrained triangulation [13] of adjacentsurface position estimates ~xxðr; sÞ; ~xxðr þ 1; sÞ; ~xxðr; sþ 1Þ;~xxðr þ 1; sþ 1Þ: A triangle is added to the mesh if theEuclidean distance between adjacent surface measurementsjxðr; sÞ � xðp; qÞj< dt; where dt ¼ 3dx and dx is the surfacesampling resolution. For a capture volume of 2m3 withimage resolution 720 � 576 we have dx � 0:01m which isused throughout this work. Constrained triangulation hasbeen used in previous range image integration algorithms toconstruct a mesh that approximates the local surfacetopology [12, 13]. This results in a mesh MH

ij that

a b

c

d

Fig. 3 Stereo results from one camera pair

Note the reconstruction of detailed creases in the clothing shown in (c)and (d )a Full-bodyb Close-up torsoc Close-up with no texture (wireframe)d Close-up with no texture (shading)

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005 485

approximates the local surface geometry but does notconnect across step discontinuities that occur at occlusionboundaries.3. Displacement map representation: Normal-volume dis-placement mapping [16] is used to obtain a displacementimage representation Dijðu; vÞ of the high-resolution stereo

mesh MHij mapped onto the low-resolution surface ML of the

generic model fitted to the visual hull in step 2 of themultiple view reconstruction algorithm. Two-dimensionaltexture image coordinates ~uu ¼ ðu; vÞ for each vertex ~vvL ofthe low-resolution model ML are used to map the normal-volume displacements to a 2-dimensional displacementimage Dijðu; vÞ in a process analogous to texture mapping.Steps 1–3 result in a displacement map image Dijðu; vÞ foreach stereo pair in a common 2-D coordinate frame, asshown in Fig. 4.4. Displacement map integration: Individual displacementmap images Dijðu; vÞ are integrated into a single displace-ment map D(u, v) using a robust filtering approach. For apixel (u, v) in the integrated displacement map image D wecompute the average neighbourhood displacement daveðu; vÞusing an n� m window: daveðu; vÞ ¼ ð1Þ=ðnmÞ�P

p¼u�n=ð2Þ;q¼v�m=2 Dðp; qÞ: We then obtain the displace-ment map with a value closest to the average neighbourhooddisplacement for all individual displacement maps Dij

with valid displacement values at pixel ðu; vÞ : dðu; vÞ ¼min jdijðu; vÞ � davgðu; vÞj: If there are no valid neighbour-hood displacements then we select the smallest displacementvalue that lies closest to the visual hull. Throughout this worka 5 � 5 window is used to evaluate the average neighbour-hood displacement at a given pixel. The result is a singledisplacement map image D(u, v) that represents the inte-grated high-resolution surface detail from multiple stereopairs, as shown in Fig. 4d.

The integration algorithm overcomes a number oflimitations of previous approaches to allow integration ofrelatively noisy 3-D surface measurements from passivestereo. Specifically the integration algorithm avoidsaveraging of overlapping surface measurements that mayreduce reconstructed surface detail owing to misalignment;step discontinuities that occur at transitions betweenoverlapping sets of surface measurements if no averagingis used; and inclusion of outliers that are inconsistent withthe reconstructed surface in a local neighbourhood.Previous range image integration algorithms [12, 13] usesurface visibility to combine measurements based oneither a weighted average or best-view criteria. In the case

of relatively inaccurate stereo measurements it has beenfound that averaging results in a loss of surface detail.Likewise use of a best-view criteria results in visible stepdiscontinuities at the transition between estimates ofsurface shape from different stereo pairs. The closestvalue approach introduced in this work overcomes theseproblems, resulting in reconstruction of a continuous high-resolution surface representation with preservation ofsurface detail from individual stereo views.

Results from our displacement map integration algorithmcan be seen in Fig. 4. Three displacement map images forindividual stereo pairs (right, middle and left) are shown inFigs. 4a–c. Note the holes in individual displacement mapsthat are shown in the same uniform grey as the backgroundindicating zero displacement. Figure 4d shows the integrateddisplacement map image with holes filled from the data inindividual views and preservation of the detailed high-resolution structure from individual stereo pairs. Displace-ment map images are coded in the range �0:02 indicatingthe difference between the low-resolution control model andthe detailed surface reconstructed using stereo.

3.5 Texture map generation

The final stage in model reconstruction is integration of themultiple view images into a single texture image [7]. Stereoreconstruction between adjacent camera views gives goodestimates of correspondence for surface regions with localvariation in appearance. This correspondence is used to alignthe multiple view images and resample a single texture imagefor the model surface. Accurate correspondence reducesmisalignment in the texture map between visible features thatsimplifies the generation of a single texture map image.Texture integration is performed in the following steps:

1. Texture mapping individual images: A texture mapimage Tij is computed for each stereo pair using thecorresponding displacement=texture map coordinates Dij

for the estimated surface shape mapped onto the low-resolution model ML: The captured image Iiðr; sÞ is thenresampled into the texture image Tijðu; vÞ using bilinearinterpolation for each pixel (r, s) with a valid stereocorrespondence. For invalid pixels with no stereo match,which occur in regions of uniform appearance, a zerodisplacement value is assumed and texture resamplingis performed to avoid holes in the resulting texture map.This results in a set of overlapping texture map images Tijðu; vÞ in the 2-dimensional texture image space (u, v)defined a priori for the low-resolution model M L:

Fig. 4 Displacement map images for multiple stereo views and integrated displacement map image

Displacements are colour coded from black ð�0:02mÞ to white (0.02 m) relative to the control model surface (see Fig. 7d ), the uniform black backgroundcolour indicates zero displacementa Right viewb Middle viewc Left viewd Integrated

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005486

2. Texture map integration: Integration of texture mapimages Tij into a single texture map is performed using a best-view criteria. Initially for each pixel (u, v) data from the best-view Tpqðu; vÞ with a valid displacement map estimateDpqðu; vÞ are resampled into the integrated texture mapimage Tðu; vÞ: The best-view criteria uses the angle betweenthe camera view and the surface normal [7]. Any missingpixels are then filled in iteratively with the next best-viewhaving a valid displacement map value. Finally, holes withno estimated displacement map are filled with the image datafrom the best-view based on the angle between the low-resolution model ML surface normal and the viewingdirection. This process resamples the multiple view imagesIIðr; sÞ into a single texture map image T(u, v). Surfaceregions that are not visible in any of the captured imagesowing to self-occlusion will result in holes in the texture map.3. Texture hole filling: A multi-resolution Gaussian imagepyramid [29] is used to fill in any holes of the integratedtexture map. Each image in the pyramid is convolved with a5 � 5 Gaussian kernel, starting with the initial integratedtexture image. The result is a low-pass filtered version ofeach image in which the resolution is reduced by half at eachstep. The missing texture pixels at each resolution are thenfilled starting at the penultimate resolution of the pyramidby bilinear interpolation of the pixel values from theproceeding lower resolution image.

In Fig. 5 we can see texture from different stereo pairs, anintegrated texture map and the result of using a multi-resolution Gaussian image pyramid to fill in the holes. Two ofthe four derived texture maps are shown. The integratedtexture map from all stereo views is shown in Fig. 5c. Theresult of using a multi-resolution Gaussian image pyramid isshown in Fig. 5d. This process results in a single integratedsurface model, which combines the stereo data from multipleviews to represent high-resolution surface detail, as illus-trated in Fig. 2d. The resulting layered model can then beanimated efficiently via the skeleton articulation structureand control model ML: High-resolution surface detail isreconstructed from the displacement map D based on thedeformation of the control model surface.

4 Results

In this Section we present our results of animated modelsgenerated using multiple view images captured in a studioenvironment.

4.1 Multiple camera studio

Two different studio setups have been used for modelgeneration: an 8 camera configuration with narrow baseline

stereo from the front and sides; and a 13 cameraconfiguration with narrow baseline stereo from all sides.Both setups have a capture volume of approximately 2m3

with the cameras placed at a distance of 4 m from the centre.All cameras are Sony DXC9100P progressive scan 3CCDwith a PAL image resolution of 720 � 576:

In the first setup we have used a system of 8 cameras, with5 cameras placed side by side with a baseline of 50 cmfacing the person, 2 cameras on the side of the person and 1above the person facing down. All of the cameras are usedfor extracting the visual hull, but only the five frontalcameras are used for stereo. A Matlab chart-based cameracalibration toolbox is used giving an rms error ofapproximately 0.5 pixels.

The second setup consists of 13 cameras, with 4 camerasfacing the front of the model and 4 facing the back, 2cameras positioned on each side, and 1 above the personfacing down. All 13 cameras are used to extract the visualhull and only the top camera is not used for stereo. Chartcamera calibration is used to calibrate independently frontand back cameras followed by registration of the estimatedcoordinate frames using the common calibration overheadand side cameras.

Stereo for a single pair takes around 2 minutes on aPentiumIII 900 MHz machine for each camera pair, and acomplete model with manual posing could be reconstructedin less than 2 hours.

4.2 Shape reconstruction

Figure 6 shows results of three models of people generatedwith the 8 camera setup. The resulting models demonstratethe inclusion of high-resolution surface detail from themultiple view stereo. Comparison with the fitted low-resolution generic model used in previous model-basedmulti-view reconstruction [7] demonstrates significantimprovement in the reconstruction of detailed surfacegeometry such as creases in the clothing. Reconstructionof significant features is visible above the level of the stereonoise as shown in Fig. 3. The difference between the low-resolution model reconstruction [7] and the displacementmapped models reconstructed in this work is of order�0:02 m. Geometric artefacts remain in regions such as thehands and face where the stereo algorithm fails toreconstruct shape detail. The visual quality of the resultingtexture mapped models compared is comparable to thecaptured images. Stereo reconstruction results in correctalignment of the surface detail in regions with significantlocal variation in appearance such as the collar line, patternson the clothing and creases.

Figure 7 shows an example of a complete model usingthe multiple view stereo and displacement mapping to

a b c d

Fig. 5 Texture blending of stereo views

a Right viewb Left viewc Integrated textured Blended texture map

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005 487

reconstruct high-resolution surface detail. Results demon-strate representation of detail such as creases in clothing,which is a significant improvement over the fitted genericmodel. Figure 7 shows the displacement map imagetogether with a histogram of the displacement map valuesin the range �0:02 m from the low-resolution modelsurface. As expected the average displacement from thelow-resolution model is approximately zero, which confirmsa good fit to the date. The distribution indicates a biastowards negative displacements, which is expected as thevisual hull used in the model fitting is an over-estimate ofthe surface shape. A significant proportion of the displace-ment values are at the extremities (greater than 0.01m or lessthan �0:01m) indicating that the displacement mappedmodel represents significant geometric structures notpresent in the low-resolution model. The texture mappedmodels contain some visible artefacts on the sides owing tothe texture integration algorithm resulting in blurred texturein regions, which were not reconstructed in the capturedstereo pairs. Overall the multiple view stereo reconstructionresults in improved geometric accuracy and visual fidelityover previous model-based approaches [6, 7].

4.3 Model animation

One of the advantages of using displacement maps is that wecan easily animate the resulting model using commercialanimation packages such as 3D Studio MAX withdisplacement and texture mapping to animate the high-resolution surface detail. Results of animating the modelshown in Fig. 7 are presented in Fig. 8. The sequence

–0.020 –0.015 –0.010 –0.005 0 0.005 0.010 0.015 0.0200

500

1000

1500

2000

2500

num

ber

of s

ampl

es

displacement, m

a

c d

b

Fig. 7 Reconstruction from multiple all-around views

a Multiple texture mapped viewsb Multiple flat shaded viewsc Colour mapped displacement map imaged Histogram of displacement values and colour scale

a

b

Fig. 6 Reconstruction from multiple front views

a Reconstructed models with textureb Reconstructed models shape (shading)

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005488

presented demonstrates animation of the high-resolutionsurface detail such as creases in the clothing based ondeformation of the underlying generic control model. In thisexample a simple skeleton-based vertex weighting schemeis used to animate the control model surface resulting inthinning around the shoulder joint. A more sophisticatedanimation scheme could be used for animation productionwith displacement mapping used to represent and recons-truct surface detail during animation.

5 Conclusions

In this paper we have presented a framework forreconstruction of animated models of people from multipleview images captured in a studio environment. This paperextends previous model-based multiple view reconstructionof people [6, 7] to reconstruct and represent the detailedsurface shape from multiple view stereo. Stereo reconstruc-tion has been extended using multiple view constraints toachieve dense reconstruction of high-resolution surfacedetail. Robust constraints are introduced to eliminateoutliers and reduce noise in the stereo reconstruction.A novel method using displacement mapping to representand integration estimates of surface shape from multipleviews has been presented. This approach allows integrationof relatively noisy estimates of surface shape from stereoovercoming limitations of previous range image integrationalgorithms for noisy data [12, 13].

Results demonstrate the reconstruction of models ofpeople that represent the detailed surface shape of clothing.Model-based stereo achieves reconstruction of the detailedsurface shape such as creases in clothing. The size of thefeatures is in the range �0:02m representing significantgeometric details when visualising the model. Currently thestereo reconstruction of smaller features detail such asfingers, face and hair is limited by the camera imageresolution which is � 0:01m. Displacement map integration

and representation of multiple view stereo represents thehigh-resolution surface shape not present in the underlyingmodel. This overcomes limitations of previous model-based[6, 7] and non-model based [3–5, 30] approaches toreconstructing and representing people from multiplecamera views. Future research will investigate the use ofdisplacement mapping techniques to reconstruct andrepresent the detailed high-resolution surface dynamicsfrom multiple view stereo such as cloth deformation.

6 Acknowledgments

This research was supported by EU IST FW5 ProjectMELIES and EPSRC Platform Grant GR=S13576.

7 References

1 Kanade, T., and Rander, P.: ‘Virtualized reality: Constructing virtualworlds from real scenes’, IEEE Multimedia, 1977, 4, (2), pp. 34–47

2 Moezzi, S., Tai, L-C., and Gerard, P.: ‘Virtual view generation for 3ddigital video’, IEEE Multimedia, 1997, 4, (2), pp. 18–26

3 Vedula, S., Baker, S., and Kanade, T.: ‘Spatio-temporal viewinterpolation’. Eurographics Workshop on Rendering, 2002, pp. 1–11

4 Matusik, W., Buehler, C., Raskar, R., and Gortler, S.: ‘Image-basedvisual hulls’. Proc. ACM SIGGRAPH, 2000, pp. 369–374

5 Cheung, G.K.M., Baker, S., and Kanade, T.: ‘Visual hull alignment andrefinement across time: A 3D reconstruction algorithm combiningshape-from-silhouette with stereo’. Conf. Comput. Vis. PatternRecognit., 2003, pp. 375–382

6 Carranza, J., Theobalt, C., Magnor, M., and Seidel, H.-P.: ‘Free-viewpoint video of human actors’. Proc. ACM SIGGRAPH, 2003,pp. 565–577

7 Starck, J., and Hilton, A.: ‘Model-based multiple view reconstruction ofpeople’. IEEE Int. Conf. Comput. Vis., 2003, pp. 915–922

8 Krishnamurthy, V., and Levoy, M.: ‘Fitting smooth surfaces to dense.Rolygon meshes’. ACM Comput. Graphics Proc. SIGGRAPH,New Orleans, USA, 1996

9 Allen, B., Curless, B., and Popovic, Z.: ‘Articulated bodydeformation from range scan data’. Proc. ACM SIGGRAPH, 2002,pp. 612–619

10 Smith, R., Sun, W., Hilton, A., and Illingworth, J.: ‘Layered animationusing displacement maps’. IEEE Int. Conf. Comput. Animation., 2000,pp. 146–154

11 Starck, J., Collins, G., Smith, R., Hilton, A., and Illingworth, J.:‘Animated statues’, J. Mach. Vis. Appl., 2003, 14, (4), pp. 248–259

12 Curless, B., and Levoy, M.: ‘A volumetric method for building complexmodels from range images’. ACM Comput. Graph. Proc. SIGGRAPH,New Orleans, USA, 1996, pp. 303–312

13 Hilton, A., Stoddart, A.J., Illingworth, J., and Windeatt, T.:‘Implicit surface based geometric fusion’, International Journalof Computer Vision and Image Understanding, 1998, 69, (3),pp. 273–291

14 Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D.,Periera, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade,J., and Fulk, D.: ‘The digital Michelangelo project’. ACM Comput.Graphics Proc. SIGGRAPH, 2000, pp. 131–144

15 Lee, A., Moreton, H., and Hoppe, H.: ‘Displaced subdivision surfaces’.Proc. ACM SIGGRAPH, 2000, pp. 85–94

16 Sun, W., Hilton, A., Smith, R., and Illingworth, J.: ‘Layered animationof captured data’, Vis. Comput., 2001, 17, (8), pp. 457–474

17 Ju, X., and Siebert, J.P.: ‘Conformation from generic animatablemodels to 3d scanned data’. Int. Conf. 3D Scanning, Paris

18 Terzopoulos, D.: ‘From physics-based representation to functionalmodeling of highly complex objects’. NSF-ARPA Workshop on ObjectRepresentation in Computer Vision, Springer Verlag, 1994,pp. 347–359

19 Hilton, A., Beresford, D., Gentils, T., Smith, R., Sun, W., andIllingworth, J.: ‘Whole body modelling of people from multi-viewimages to populate virtual worlds’, Vis. Comput., 2000, 16, (7),pp. 411–436

20 Isidoro, J., and Sclaroff, S.: ‘Stochastic refinement of the visual hull tosatisfy photometric and silhouette consistency constraints’. Int. Conf.Comput. Vis., 2003, pp. 1335–1342

21 Plankers, R., and Fua, P.: ‘Articulated soft objects for video-based bodymodeling’. IEEE Int. Conf. Comput. Vis., 2001, pp. 394–401

22 Kobbelt, L., Campagna, S., Vorsatz, J., and Seidel, H.P.: ‘Interactivemulti-resolution modeling on arbitrary meshes’. Proc. ACM SIG-GRAPH, 1998, pp. 105–114

23 Starck, J., and Hilton, A.: ‘Reconstruction of animated modelsfrom images constrained deformable surfaces’. 10th Conf. onDiscrete Geometry for Computer Imagery, 2301. Lecture Notes inComputer Science, Springer-Verlag, Bordeaux, France, April 2002,pp. 382–391

24 Okutomi, M., and Kanade, T.: ‘A locally adaptive window for signalmatching’, Int. J. Comput. Vis., 1992, 7, (2), pp. 143–162

Fig. 8 Animation sequence of displacement mapped model

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005 489

25 Kang, S.B., and Szeliski, R.: ‘3D scene recovery using omnidirectionalmultibaseline stereo’. Conf. Comput. Vis. Pattern Recognit., 1996,pp. 364–372

26 Rusinkiewicz, S., Hall-Holt, O., and Levoy, M.: ‘Real-time 3d modelacquisition’. Proc. ACM SIGGRAPH, 2002

27 Soucy, M., and Laurendeau, D.: ‘A general surface approach to theintegration of a set of range views’, IEEE Trans. Pattern. Anal. Mach.Intell., 1995, 14, (4), pp. 344–358

28 Turk, G., and Levoy, M.: ‘Zippered polygon meshes from rangeimages’. ACM Computer Graphics Proc. SIGGRAPH, Orlando,Florida, 1994, pp. 311–318

29 Burt, P.J., and Adelson, E.H.: ‘A multiresolution spline withapplication to image mosaics’, ACM Trans. Graph., 1983, 2, (4),pp. 217–236

30 Kanade, T.: ‘Virtualized reality: putting reality into virtual reality’. 2ndInt. Workshop Object Representation Comput. Vis. ECCV, 1996

IEE Proc.-Vis. Image Signal Process., Vol. 152, No. 4, August 2005490