Creating and Animating Personalized Head Models from Digital Photographs and Video

0361-7688/04/3005- © 2004

åÄIä “Nauka

/Interperiodica”0242

Programming and Computer Software, Vol. 30, No. 5, 2004, pp. 242–257. Translated from Programmirovanie, Vol. 30, No. 5, 2004.Original Russian Text Copyright © 2004 by Zhislina, Ivanov, Kuriakin, Lempitskii, Martinova, Rodyushkin, Firsova, Khropov, Shokurov.

1. INTRODUCTION

Animated head models are used in various computerapplications, including games, movie production, vid-eoconferences, and many others. The problem of fastcreation of realistic high-quality models is one of themost complex problems of computer graphics, which isnot completely solved up to now.

On the one hand, being a solid body located in thethree-dimensional space, a particular head can be digi-talized by means of three-dimensional scanners basedon laser rangers [1] or similar technologies. Such anapproach makes it possible to develop a relatively accu-rate geometrical model and the corresponding texture;however, without additional processing, the dataobtained cannot be used for the animation, and the sub-sequent adaptation of these data is a rather time-con-suming procedure.

On the other hand, the use of a priori knowledge ofthe object structure (in the given case, a head) allowsone to improve the model. This idea is usually imple-mented in the calibration strategies that use a statisti-cally typical head model and adjust it to the input data[2–5]. Alternatively, a set of various models available isconsidered to be a formal basis in some linear vectorspace [6]. The model obtained in this way is usuallymore appropriate for the animation purposes, since itsconstruction is based on the knowledge of the headstructure.

The head model calibration methods can be classi-fied in terms of the type of data they operate on. Sincethe information about the scene depth is very expensive

to obtain, digital images are usually used for the inputdata. There exist calibration technologies based on oneimage [6], two orthogonal images [3, 7], and asequence of images or video [5, 8, 9]. Some methodsrequire certain locations of cameras (viewpoints), cer-tain lighting conditions, and other constraints, whichconsiderably restrict the applicability domain of thecorresponding algorithms.

An important feature of any calibration method isthe degree of user participation in the model develop-ment. Some methods are completely automated [2, 6];others require selection of several feature points on theimage [8, 9]; however, the majority of the technologiessuggest considerable user labor inputs [5, 7, 10]. Themanual input of certain parameters often improves themodel; however, the automated methods are more prac-tical from the user’s standpoint.

In this paper, we consider methods and algorithmsused for the construction of a head model and its subse-quent animation that are based on the followingassumptions:

• for the input data, a set of images or videosequences are used;

• the construction of a model relies on knowledge ofthe model structure (a statistically typical head model);

• the model is constructed either automatically orwith a minimal participation of the user.

The construction of a head model under the aboveassumptions proceeds in several stages organized in apipeline, as shown in Fig. 1. At each stage, the data

Creating and Animating Personalized Head Models from Digital Photographs and Video

V. G. Zhislina**, D. V. Ivanov*, V. F. Kuriakin**, V. S. Lempitskii*, E. M. Martinova**, K. V. Rodyushkin**,

T. V. Firsova**, A. A. Khropov*, and A. V. Shokurov*

*Department of Mechanics and Mathematics, Moscow State University, Vorob’evy gory, Moscow, 119992 Russia

**Intel Russian Research Center, Russian Academy of Sciences, ul. Turgeneva 30, Nizhni Novgorod, 603950 Russia

e-mail: [email protected]

Received April 27, 2004

Abstract

—In this paper, a survey is given of the approaches, methods, and algorithms used for creating per-sonalized three-dimensional models of human’s head from photographs or video. All stages of the model con-struction are considered in detail. These stages include image marking, camera registering, geometrical modeladaptation, texture formation, and modeling of additional elements, such as eyes and hair. Some technologiesof the animation of the models obtained, including that based on the MPEG-4 standard, are analyzed, and exam-ples of the applications that use these technologies are given. In conclusion, some prospects of the develop-ments in this field of the computer graphics are discussed.

https://www.researchgate.net/publication/2582574_Model_Based_Face_Reconstruction_For_Animation?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/222515492_Smooth_surface_interpolation_and_texture_adaptation_for_MPEG-4_compliant_calibration_of_3D_head_models?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/222569283_Fast_head_modeling_for_animation?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/230605026_Automatic_Face_Cloning_and_Animation?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/227716205_Rapid_modeling_of_animated_faces_from_video?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/2364239_Realistic_Modeling_for_Facial_Animation?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/2829018_Synthesizing_Realistic_Facial_Expressions_from_Photographs?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==



https://www.researchgate.net/publication/220660352_Regularized_Bundle-Adjustment_to_Model_Heads_from_Image_Sequences_without_Calibration_Data?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/2892137_A_Morphable_Model_for_the_Synthesis_of_3D_Faces?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==



PROGRAMMING AND COMPUTER SOFTWARE

Vol. 30

No. 5

2004

CREATING AND ANIMATING PERSONALIZED HEAD MODELS 243

obtained at the previous stages of the pipeline are used.The objectives of these stages are as follows.

Image marking.

The marking of an image or a videosequence identifies and selects certain elements of thehead image for their subsequent use. These elementsinclude the following objects:

• Feature points, such as corners of eyes and lips,lobes, etc., standardized in MPEG-4 for the animationof a head model in the framework of the synthetic video[3].

• Feature lines, such as eye, lip, and face contours,full-face nose contour, and the like.

• Silhouette lines, such as a profile outline (if suchan image is available).

• Head masks (an image is segmented into a back-ground and the head itself).

Image registering.

The objective of this stage con-sists in registering images and, in fact, estimating loca-tions of cameras that were used to obtain these picturesin some global coordinate system associated with theobject. When using elements selected at the previousstage, which are projections of the corresponding 3Dobjects on the picture planes in the framework of a cer-tain projection model, the problem consists in deter-mining the orientation of the camera and its internalcharacteristics such as, for example, the focal distanceat the shooting moment. In the majority of cases, a per-spective camera model is used, which is a rather accu-rate model but results in the minimization of a nonlin-ear function. In certain particular cases, such as thereconstruction from two full-face–profile images, asimplified orthogonal model can also be used.

Input images

Image marking

Camera registering

Geometrical adaptation

Texture formation

Additional processing

Personalized model

Fig. 1.

Structure of the head model construction pipeline.


244


Vol. 30

No. 5

2004

ZHISLINA

et al

.

Geometric adaptation.

At this stage, a typical headmodel is placed into a coordinate system with fixedlocations of the cameras and is subjected to deforma-tions to adjust it to the given data. After this adaptation,the projections of the required 3D elements of themodel should be as close to the corresponding elementsselected on the images as possible (with regard to theprojection model chosen).

Texture formation.

The aim of this stage is to formthe texture for the adapted (personalized) model. Theinput data here are marked images, which, after anappropriate processing, are combined into one mosaic-like texture in one way or another. For the quality crite-rion for the texture obtained, the preservation of thevisual resolution of the original images can be used;however, many methods result in considerable smooth-ing of the high-frequency details.

Additional processing.

Some applications using thehead models obtained may impose additional specificrequirements, such as, for example, modeling of aneyeball as a separate object, development of animproved hair model, and the like. In these cases, addi-tional procedures to meet these requirements areneeded. In this paper, we analyze the above-mentionedstages and consider in detail methods and algorithmsused in these stages (Sections 2–6).

In addition, in Section 7, we discuss someapproaches to the animation of the personalized modelsand possible applications of the technologies described(Section 8). In conclusion, we discuss the prospects forthis field of computer graphics.

2. IMAGE MARKING

The image marking is the first and most importantstage of the model construction pipeline, since the qual-ity of the model greatly depends on the quality of dataobtained as a result of the marking. At this stage, vari-ous feature elements of the image are selected. In themajority of the applications, the marking is made by the

user; however, an automated mode of the marking ismore preferable from the standpoint of the applicationefficiency.

Usually, the selected elements are projections offeature points, such as the corners of the eyes or thepoint of the nose, and the like. These elements includealso feature lines, such as, for example, the eye and lipcontours (Fig. 2). In this case, the contour can bedescribed by either a polyline or a compound spline.Silhouette lines are selected in a similar way. In addi-tion, some objects specific to a particular individual,such as birthmarks, wrinkles, etc., can be selected. Ifthere are several projections of one and the same ele-ment onto different images, these projections can betaken into account when registering the cameras, and the3D geometry and location of the element can be recov-ered for the sake of the subsequent model adjustment.

The automated image marking involves the use ofvarious computer vision algorithms developed recently.The problem of automated image marking in the case ofscattered viewpoints (

wide baseline matching

[11]) hasnot yet been solved completely for objects like thehuman head; therefore, the automated processingrequires the use of video sequences for the input data.Moreover, not all data mentioned above can automati-cally be extracted from the images.

The first step in the automated marking of a video isto recognize and locate the face on some scenes. Forthis purpose, an object detection method is used, whichis described in [12, 13] and implemented in theOpenCV library [14]. By means of this method, a rect-angle containing the face image is roughly estimated.

On the frames close to the frontal or profile image,some feature contours and points can be selected. To dothis, first, locations of the eyes and corners of the mouthare roughly estimated. The data obtained make it possi-ble to determine regions of the location of some ele-ments of the face, such as eyes, nose, and mouth. Then,by means of the deformable template method [15–17],the locations of the eyes, mouth, and nose in theseregions are accurately found, and the shape of lips, theform of the alae of the nose, the shape of the eyes, andthe radii and centers of the pupils are determined (Fig. 3).

By using the histogram methods, regions of skin onthe images are selected. By means of the Harris filter,specific features of these segments are distinguished.Then, these features are traced out to the neighboringframes (Fig. 4). The corresponding algorithm isdescribed in detail in [18]. Since the data selected inthis way inevitably contain errors due to noises, modelinaccuracies, and other similar effects, the subsequentuse of these data for registering cameras requires theuse of statistically stable methods.

By means of the histogram methods, it is also possi-ble to select the head image from a uniform background(Fig. 5). The

head masks

obtained in this way bearmuch information about the geometric shape of the

Fig. 2.

A part of an image with selected feature contours.

https://www.researchgate.net/publication/224725074_An_Extended_Set_of_Haar-like_Features_for_Rapid_Object_Detection?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/220659647_Feature_Extraction_Faces_Using_Deformable_Templates?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/240224936_Rapid_object_detection_using_a_boosted_cascade_of_simple_features_In_CVPR?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


Vol. 30

No. 5

2004


head and can be used at the subsequent stages of thepipeline.

3. CAMERA REGISTERING

This stage is used, as a rule, for registering camerasin the case where there are several projections of thesame 3D points onto different images.

In the majority of cases, a perspective camera modelis used. In this model, the projection of a 3D point

M

onto the picture plane is determined as

where each camera is characterized by the focal dis-tance

f

, rotation matrix

R

, and shift vector

t

. The estima-tion of these parameters for each camera with simulta-neous determination of the locations of three-dimen-sional points by their known projections is a well-studied problem of the scene structure reconstructionfrom pictures taken with different camera angles (

struc-ture-from-motion

) [19].In the given case, the solving of this problem is

facilitated through the use of a priori knowledge of thecamerawork (for example, it is known that the pictureswere first taken almost full-face and then in profile) orthe nature of some points (for example, certain pointsare known to correspond to points belonging to thehuman face, and, thus, their configuration in three-dimensional space is approximately known).

Thus, each image is associated with a position andorientation of the camera in the three-dimensionalspace related to the model being reconstructed. Then, itbecomes possible to determine 3D coordinates ofpoints of other, more complicated, objects. For exam-ple, two projections allow us to recover the 3D shapeand location of a contour.

If there are many images with certain head masks, itis possible to construct the intersection of masks in thethree-dimensional space to get a

visual hull

of the head[20]. The visual hull is obtained from a uniform voxellattice containing the desired head model by removingvoxels the projections of which do not belong to thehead masks in all frames and, hence, do not belong tothe head. If there are many (several dozens) images, thevisual hull approximates the head shape everywhereexcept for concave regions, such as, for example, theregion of the alae of the nose.

Figure 6 shows a typical section of such voxel struc-ture (the section plane is approximately the head sym-metry plane). The voxels are colored proportionally tothe number of their occurrences inside the head masksfor the images available. The white voxels occurred

mX

mY f

R11MX R12MY R13MZ tX+ + +R31MX R32MY R33MZ tZ+ + +-----------------------------------------------------------------------

fR21MX R22MY R23MZ tY+ + +R31MX R32MY R33MZ tZ+ + +-----------------------------------------------------------------------

,=

Fig. 3.

Automated frame marking.

Fig. 4.

Separating and tracing specific features.

Fig. 5.

Masking an image.

Fig. 6.

A slice cut from the voxel structure.

https://www.researchgate.net/publication/220182097_The_Visual_Hull_Concept_for_Silhouette-Based_Image_Understanding?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

246


Vol. 30

No. 5

2004

ZHISLINA

et al

.

inside the masks in all pictures and, hence, belong tothe visual hull of the head.

4. GEOMETRIC ADAPTATION

The goal of this stage is to adjust the statisticallyaverage head to the input data. In the majority of cases,the model of the statistically average head is given inthe form of a polygonal mesh. It should be noted thatthe feature points are represented by the nodes of themesh, and the feature contours are represented by con-nected sequences of the mesh edges. In addition, someparts of the head (for example, a forehead or chin) canbe identified in advance with certain regions of themesh.

4.1. Adaptation by Means of the RBF and Dirichlet Deformations

The simplest approach to the adaptation relies onlyon three-dimensional positions of the feature points.Let feature points

M

i

be known, and let

P

i

be the nodesof the original mesh corresponding to them. Thus, weknow displacements of the nodes

P

i

and need to inter-polate these displacements for all other nodes of themesh. Then, to model the desired deformation, the fol-lowing function

F

:

R

3

R

3

can be used:

where

P

is an arbitrary point in the three-dimensionalspace. The first term on the right-hand side of this equa-tion contains three-dimensional quantities and someknown function

Φ

:

R

3

R

and governs the deforma-tion of the model. The second term contains an affinetransformation matrix

A

and a shift vector

t

and isresponsible for a linear transformation of the model.The parameters

w

i

,

A

, and

t

are found from the systemof linear equations

F

(

P

i

) =

M

i

. Function

Φ

is usually aradial function (depends only on the magnitude of thevector argument); therefore, this method is referred toas the radial basis function (RBF) deformation [21].The deformation obtained depends, of course, on theparticular form of the function

Φ

; however, a number ofdisadvantages of this method manifest themselves forany choice of

Φ

. For example, far from the featurepoints (in regions like cheeks or the back of the head),the resulting surface can look unrepresentative for thegiven individual or, in some cases, even impossible (forexample, nonsmooth). Thus, this approach is not suit-able for creating a high-quality realistic head model,because the locations of the feature points do not provideall required information about the shape of the head. Theapplication of this approach to the problem of the geomet-ric adaptation is discussed in detail in [5, 22, 23].

The Dirichlet free form deformations make it possi-ble to take into account the structure of the statisticallyaverage head [7, 24]. The method is based on the con-

F P( ) wiΦ P Pi–( )i

∑ A t( )P,+=

struction of the Delaunay triangulation on the mesh sur-face with the nodes at the feature vertices of the mesh.Note that the triangulation is constructed with respectto the metrics that defines the distance between twonodes as the length of the shortest path along the meshedges (the path length is usually computed by the Dijk-stra algorithm [25]). Thus, for each node, we know thetriangle of the Delaunay triangulation, three specificsupport points, and barycentric coordinates. By knowndisplacements of the specific nodes and barycentriccoordinates, the displacements of all other nodes arefound.

This method avoids certain artifacts of the RBFtransformation, since it takes into account not onlypositions of the feature points but also the head struc-ture. Nevertheless, far from the feature points, thedeformations are still not controlled. In addition, thesilhouette lines, visual hull, and some other data are nottaken into account in this method.

4.2. Adaptation by Means of the Displacement Field

Let the original model be affinely transformed withregard to positions of the feature points. The subse-quent transformation should deform the mesh so that allinput information about the head geometry is taken intoaccount. Displacements of the nodes under such defor-mation are assumed to be small. For any node

P

i

, weintroduce the unknown displacement vector

T

i ∈ R3.Then, all information about the geometry of the headcan be expressed in terms of linear equations specifyingconstraints imposed on Ti. For example, the conditionof the deformation smoothness is described by smooth-ness equations, which equate the displacement of anode to a weighted sum of displacements of the adja-cent nodes. The knowledge of the fact that a given spe-cific node turns to a 3D point with known coordinatesis described by an equation that explicitly sets the nodedisplacement equal to the distance between the pointand the node.

In certain cases, deformations are described by non-linear equations (for example, in the case where a per-spective projection of a given node onto a given imageafter a deformation falls on a silhouette line). However,under the assumption that the deformations are small,these equations can be linearized in a neighborhood ofany point.

When silhouette lines are used in the deformationprocess, it is required to understand which nodes occuron the silhouette line on the given frame. In this case,the original undeformed model is subjected to render-ing with regard to the camera estimate. The nodes thatturned out silhouette ones under such rendering aredeformed such that they occur on the silhouette line. Ifa visual hull is used, we determine in a similar waywhich nodes fall on the boundary of the visual hull andwhich nodes belong to concave regions and, thus, do


https://www.researchgate.net/publication/221588023_Feature_Point_Based_Mesh_Deformation_Applied_to_MPEG-4_Facial_Animation?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/2297528_Creating_Surfaces_from_Scattered_Data_Using_Radial_Basis_Functions?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/242562037_A_Note_on_Two_Problems_in_Connexion_with_Graphs?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/246372870_High-quality_head_model_calibration_pipeline?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

PROGRAMMING AND COMPUTER SOFTWARE Vol. 30 No. 5 2004


not necessarily belong to this boundary after the defor-mation.

As a result, all information about the deformation isrepresented in the form of a large overdetermined sys-tem of linear equations, which is solved in the root-mean-square sense by the conjugate gradient method.The desired solutions are displacements of the meshnodes.

Different groups of equations (smoothness equa-tions, equations describing constraints imposed on dis-placements of the feature points, equations for the sil-houette lines, and so on) can be “weighted” by multi-plying them by weight coefficients. For example, if theweight of the smoothness equations increases, thedeformation of the resulting mesh becomes smoother,whereas the accuracy of the adjustment to the input datadecreases.

The algorithm described above allows incorporatingvarious types of data during deformation and, in thecase of small deformations, leads to pretty good results.Figure 7 shows an example of the model deformation.The input data (in the given case, feature points andcontours) are shown in the drawing on the left. Thedrawing in the middle shows the original model afterthe affine transformation. The result of the deformationby means of the algorithm described is shown on theright. This method is described in detail in [26].

However, this approach needs refinement if localdeformations of the surface are relatively large (defor-mations of this kind happen usually on the neck, backof the head, and near the ears). In this case, an iterativeprocess is used. At each step, the current model is, first,adjusted affinely and, then, is subjected to the deforma-tion by the algorithm described. At the beginning of theprocess, the weight of the smooth equations is verylarge. Every time when the solution is stabilized, theweight of the smooth equations reduces. Such a processmakes it possible to obtain large deformations and getreliable nice results.

5. TEXTURE FORMATION

Qualitative texture is as important for the modelreality as the qualitative geometric adaptation. The firststep in the creation of a unique texture is the extractionof fragments of textures from different images with thegoal of their subsequent merging. In doing so, it isassumed that the original polygonal mesh is equippedwith texture coordinates, which, in fact, specify a map-ping of the mesh surface onto a plane.

5.1. Texture Extraction

From each image, a fragment of a texture isextracted by means of the estimated camera model andadapted head model (Fig. 8). Together with the texture,a map of weights is created, which shows the quality ofthe extraction of the corresponding texture point. The

weight is specified as the cosine of the angle betweenthe normal to the model and the view direction. Thereexist several approaches to combining the extractedfragments into a unique texture. The simplest amongthem consists in the element-wise weighted addition ofall textures [5]. However, the quality of the textureobtained by means of this approach leaves much to bedesired, since visually noticeable smoothing of finedetails takes place. The creation of a realistic high-qual-ity head model requires more sophisticated mergingmethods.

5.2. Optimal Merging Line

In this approach, three texture fragments—two pro-files and one full-face—are used. The first step in thismethod is to find an optimal line that selects the above-mentioned texture fragments. This line is sought bydynamic programming methods such that the differ-ence in the colors of different fragments along the lineis minimal and the quality of the filled texture elementsbelonging to the resulting texture is maximal (Fig. 9).At the next step, the brightness of the profile texturefragments is corrected to even the difference betweenthe texture fragments along the boundary line. Thismethod allows us to obtain a jointless merging of tex-tures and simultaneously avoid the fuzziness and aver-aging; however, it may result in artifacts when colors ofthe fragments are considerably different (for example,if the pictures were taken under different lighting con-

Fig. 7. An example of the model deformation.

https://www.researchgate.net/publication/228730385_Creating_Personalized_Head_Models_from_Image_Series?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


248


ZHISLINA et al.

Fig. 8. Extraction of texture fragments.

Fig. 9. Optimal sewing lines (the dashed lines separate the search region).



ditions). In order words, this method copes successfullywith the processing of high-frequency details but is notable to cope with low-frequency differences in the inputimages. This method is described in detail in [27].

5.3. Pyramidal Method

This method uses an arbitrary number of texturefragments. Each fragment is represented as a Laplacepyramid (Fig. 10). The upper levels of the pyramid rep-resent low frequencies, and the lower levels, high fre-quencies. The pyramids are combined with regard totheir weights, and the texture is recovered from theresulting pyramid. This method is capable of copingwith differences in lower frequencies; however, it aver-ages high frequencies, which reduces the texture reso-lution.

5.4. Hybrid Method

This method combines the ideas of the first twoapproaches. Like the first method, it uses three—full-face and two profiles—texture fragments. At the firststep, low frequencies are smoothed in each fragment:the Laplace pyramids are built and combined up to acertain level, such that the lower levels correspondingto high frequencies are conserved in each fragment.After this, the corrected texture fragments, which cer-tainly have no differences in low frequencies, arerecovered. Next, the optimal line method is applied tothe new textures. This approach makes it possible to

sew the fragments both in low and high frequencyranges (Fig. 11).

The texture obtained contains blanks, i.e., theregions the texture of which cannot be extracted fromthe given set of images (for example, the region behindthe ears). Note that these regions may occur visibleupon the visualization of the head model from otherviewpoints specified by the end user. Therefore, theseblanks should be filled by the colors similar to those inthe neighboring regions. A more sophisticatedapproach is based on the statistical synthesis of the tex-ture by the available neighboring skin samples [28].

6. ADDITIONAL PROCESSING

The additional processing includes modeling of ahairdo and eyes. Note that both the modeling of thegeometric shapes and the texture creation are required.

Hairdos may have very different configurations. If ahairdo is not very voluminous, the following methodcan be applied. The texture is divided (automatically ormanually) by a line into hairy and skinny parts. The partof the mesh related to the hairy part is duplicated anddeformed with regard to the silhouettes or the visualhull. Note that the hairdo edge, before the deformation,is additionally subdivided and, after the deformation, is“smoothed” to the basic mesh (Fig. 12). The creation ofthe texture for the newly created hairy mesh repeats thecreation of that for the basic mesh.

Fig. 10. Construction of the Laplace pyramid for a texture fragment.

250


ZHISLINA et al.

Fig. 11. Sewing by the hybrid method.

...

Fig. 12. Hairdo creation.



To make the animation plausible, the eyes are to bemodeled as two independent sphere-like objects. In thecourse of the geometric adaptation, it is required to cal-ibrate the radii of the spheres and locations of their cen-ters. This can be done based on the eyelid contourshapes. The region of the basic mesh near the eyelids isalso modified such that the overlap of the basic mesh tothe eyes is tight.

The quality of the texture is especially important forgetting a realistic model. This texture cannot be createdin the same way as the texture for the basic meshbecause of, at least, two reasons. First, a considerablepart of the texture cannot be seen on the images; sec-ond, the original images often contain eye textures withpoor resolution. If the resolution is acceptable, thengood results can be obtained by synthesizing the texturefrom pieces extracted from the images [28]. It is alsopossible to fill the missing parts using the radial sym-metry of the eyes [26].

If the resolution of the texture is too low (for exam-ple, in the case of the processing of a video), then, forthe texture, one can take a sample with high resolutionadjusted in the color space HSV to the colors of thelow-resolution texture obtained from the image. In thelatter case, it is desirable to have several different sam-ples for eyes of different—blue, brown, etc.—colors.

7. HEAD MODEL ANIMATION

After a personalized head model has been created, itis required to model its deformation during a talk or apang of emotion. In terms of the modeling means used,the animation methods can conditionally be dividedinto interpolation, parametric (or geometric), physical-based, and image-based methods. The interpolationmethods include the widely used key frame technique.In the framework of this technique, the deformation ofthe face is determined at certain time moments corre-sponding to the key frames, and the coordinates of thenodes of the model for intermediate frames are foundby means of the interpolation. The interpolation modelis widely used in the animation movies and approveditself by demonstrating the impressive expressivenessof motions obtained. However, the success of thisapproach is explained also by the contribution of pro-fessional artists who manually trim the key frames. Inthe case of a completely automated animation, the qual-ity obtained by means of this method reduces consider-ably. In the parametric (geometric) methods [29], thedeformation of the face is obtained by changingdirectly the geometry of the surface governed by a setof parameters, the values of which may change in time.This group includes the free-form deformation modelsuggested in [30], the surface model based on the B-splines [31], and other algorithms [32]. This animationmethod can yield good results under relatively smallexpenditures. The physical-based models rely on theidea of modeling deformations of the skin and facial tis-sues under the action of the face muscles and bones.

Usually, for this purpose, a finite element mesh or a sys-tem of springs consisting of several layers are used [1,33]. In the image-based models, a set of photographs ofthe original with different facial expressions are used,and the resulting image is created by morphing andmixing them [5].

In Russia, there are several groups (which havebranches in many countries and sell their productsworldwide) working on face animation. One of them isthe SeeStorm company, which uses the parametric ani-mation model based on the fragmentation of the framecorresponding to the audio signal and on the phonemerecognition. The speech stream is divided into phoneticgroups corresponding to the basic lip motions duringthe speech that are sufficient for the realistic animation.As a result, one gets a sequence of control parametersfor the face animation.

The Life Mode company uses an animation modelbased on the so-called “virtual muscles.” The numberof these muscles is rather large (~200), which allowsthe animator to manually obtain a high-quality anima-tion in a short period of time (several hours). In addi-tion, an automated synchronization of lip motions withthe speech is possible.

The computer graphics and animation studio onREN-TV developed the program Speech Animator II,which is a technology for creating synchronous speechbyplay based on the identification (automated if possi-ble) of the current sounds pronounced by a voice. Then,the information about the sequence of sounds is trans-formed into the information about a sequence of mor-phemes. In the IFAL (Intel Facial Animation Library)library created in the iRRC (Intel Research RussianCenter), the facial animation is implemented in accor-dance with the MPEG-4 standard [34, 35]. This stan-dard defines 84 feature points on human’s face the loca-tions of which describe specific features of the face andensure spatial referencing for the animation parame-ters. The set of feature points includes, for example,corners of eyes, lips, eyebrows, and the like. The defor-mation of a face on each frame is determined by a set of68 facial animation parameters (FAP), which aredivided into high-level (emotions and visems) and low-level parameters. Each low-level FAP is responsible forthe motion of one feature point or for the transforma-tion of the entire scene object. The animation parame-ters associated with the feature points characterize lin-ear displacements of these points with respect to theirpositions on the face in an emotionally neutral position(the muscles are relaxed, mouth is closed, and eyes arewidely open). Numerical values of these displacementsare expressed in relative units, which allows us to applythe same animation parameters to faces of differentsizes and proportions. Under such an approach, the hor-izontal displacement of the mouth corner, for example,is expressed in terms of the mouth width as

FAP53 = ∆L/MouthWidth,

https://www.researchgate.net/publication/234796764_Layered_construction_for_deformable_animated_characters?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/3307946_The_facial_animation_engine_towards_a_high-level_interface_for_the_design_of_MPEG-4_compliant_animated_faces_IEEE_Trans_Circuits_Syst_Video_Technol?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==

https://www.researchgate.net/publication/220068150_Animation_of_a_B-Spline_figure?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/47862812_Face_to_Face_From_Real_Humans_to_Realistic_Facial_Animation?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


https://www.researchgate.net/publication/35254083_A_Parametric_Model_for_Human_Faces?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==


252


ZHISLINA et al.

where ∆L is the absolute displacement of the mouthcorner and MouthWidth is the mouth width.

According to the MPEG-4 requirements, the humanhead is a synthetic visual object whose representation isbased on the VRML standard [36]. In IFAL, the graphof a scene contains eight basic objects including skin,eyes, pupils, teeth, and tongue. To make the animationmore realistic, three additional objects—shoulders,hair, and glasses—are introduced.

The algorithm for the animation rule formation,which transforms the FAP values to the variations of thepositions of all nodes of the model (not only featurepoints), is not defined in the MPEG-4 standard. The stan-dard describes only a special structure (FaceDefTable)for storing the animation rules. It is assumed that theinterpretation rules of the FAP stream either are passedto the decoder in the form of FaceDefTable or are a partof the decoder itself. Algorithms for the animation ruleformation both for high-level and low-level FAPs havebeen developed in iRRC [37, 38].

The IFAL animation technology consists of twoparts:

• autonomous modeling of motions of the model and• frame-by-frame interpretation of FAPs and real-timemodel deformation.

The scheme of this technology is shown in Fig. 13.FaceDefTable is computed automatically in accor-

dance with the animation rules once for each model. Tomodel different groups of FAPs, different approachesare applied: for low-level FAPs, the geometricapproach is used, and, for modeling visems and emo-tions, either a muscle model or a combination of low-level animation parameters are employed. The influ-ence of each FAP is modeled separately. In so doing,the FAP is associated with its influence domain, which

is a set of nodes of a polygonal model that have nonzerodisplacements when the value of this FAP is nonzero.Rectangular boundaries of the influence domains arespecified in terms of the coordinates of the featurepoints of the model. This makes the animation algo-rithms independent of a particular model. The influencedomain of a FAP is defined in such a way that it doesnot include feature points to which other animationparameters specifying motion in the same directionmay be applied.

To model the effect of low-level FAPs, the followingoperations are used: transformations (rotation, transla-tion, or scaling) of scene objects; transformations ofgroups of nodes (shift, rotation); sliding on the surfaceof the model; special algorithms for processing “fine”operations, such as eyebrow motions or blinking.

The influence domains for the FAPs responsible forthe transformations of the scene objects contain allpoints of the corresponding object, and the currentpositions of the object points are computed by multi-plying the former by the matrix of the object transfor-mations immediately in the course of determining theposition of the model on the frame. For the FAPs of the

second and third groups, the shift for each ith nodebelonging to the influence domain of the FAP directedalong the r axis (at any time, only one direction—x, y,or z—is determined) is computed by the formula

Here, is the displacement of the feature point due to

the FAP and and are weight functions of the form

dir

dir d f

r Wix Wi

y.=

d fr

Wix Wi

y

Personalizedhead model

Modelgeometry

Modelsemantics

Scene graph Feature points

FAP influencedomains

Model deformationblock

Visualization

Animation rules

Off-line On-lineFAPs

Fig. 13. The IFAL animation technology.

https://www.researchgate.net/publication/264844939_Recognition_and_Synthesis_of_High-Level_MPEG-4_Facial_Animation_Parameters?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==



where xi is the node coordinate, xb is the coordinate ofthe closest boundary of the influence domain, xj is thecoordinate of the feature point, and the parameter agoverns the rate of the influence reduction.

When sliding on the model surface (for example,eyebrow motions), to compute the displacements of thenodes, not only the weight functions but also the coor-dinates of the model surface at the target point are used.For the FAPs responsible for implicit rotations (e.g.,vertical motion of the jaw or the eyelid motions uponblinking), the shift of the feature point is replaced by aweighted rotation (with the weight functions definedabove) of points on the skin in the influence region.When processing vertical eyebrow motions, coordi-nated displacements of points located in the regionbetween the upper and lower eyebrow contours areensured, which prevents vertical stretch in this region.The accurate processing of the blinking requires com-puting “target positions” of all nodes belonging to thecontour of one eyelid by using coordinates of the pointsbelonging to the other eyelid, and the positions of theinternal nodes of the eyelid are assumed to be propor-tional to the displacements of the closest contourpoints.

To model high-level FAPs, the muscle models areused, which take into account the influence of the fol-lowing layers:

(1) skin (consists of nodes of the model of the object“skin”), (2) fat tissue (nodes of the “skin” shifted“inside” the model by the distance equal to the localthickness of the fat layer), (3) muscle layer; (4) skull (isrequired for constraining displacements of nodes ofother layers and fixing the muscles).

In this model, each node of the skin is connected tothe adjacent nodes from its layer and one node of the fattissue by elastic springs. The nodes of the second layerinteract with each other in a similar way; in addition,they are affected by the muscles. The displacements ofnodes of the first two layers are governed by Newton’sequations

where mi is the mass of the node, γ and k are dampingand stiffness coefficients of the springs connecting thenode with those from the same layer, and fi is the result-ant force acting on the node. For the second-layernodes, this is

• the sum Fi(t) of all muscle forces acting on thenode, • the reaction force Si acting from the skull, and• the force gi due to variations of volume of the muscles.

For the skin nodes, the sum contains only one com-ponent, the force hi due to the variation of the distancebetween the first two layers.

Wix e

a xi xb–( )2

x f xb–( )2--------------------------–

,=

mi

d2xi

dt2--------- γ

dxi

dt------- kxi+ + f i t( ),=

The system of ordinary second-order differentialequations is solved by the Runge–Kutta method of thefourth order. For each muscle in the model, the zonesare specified where it is attached to the skull and entersthe fat tissue. The visems and emotions are modeled bymeans of a given combination of facial muscles, bothstraight and orbicular. The muscle model is used formodeling deformations of only the lower part of a face(cheeks, lips, and chin).

All computed information (which, in essence, repre-sents the animation rules) is placed into FaceDefTablein the format determined by MPEG-4. For each FAPcorresponding to a transformation of a scene object, thetable contains information about the object identifier,rotation axis, or scaling coefficients along the coordi-nate axes (to allow pupils to expand). For other FAPs, itcontains the identifier of the scene object, identifier ofthe corresponding feature point, numbers of pointsfrom the influence domain and their coordinate dis-placements grouped by the intervals of FAP possiblevalues ranging from the minimal to the maximal values.

In the case of the real-time animation with the use ofFaceDefTable, the set of FAPs is interpreted frame byframe, and the current coordinates of the model pointsare computed. The displacements of all model nodescorresponding to all FAPs with nonzero amplitudes areadded together, and the resultant transformation matri-ces for the objects of the model are computed. Dis-placements of the model nodes are measured from theirneutral states. If there is a high-level FAP, it is possibleto turn on the mode of ignoring values of low-level ani-mation parameters. In this case, instead of displace-ments for “lip” FAPs, only displacements of the nodeswritten in FaceDefTable for a high-level parameter areused. Then, special correction of the model nodes in themouth region is implemented. Since there are morethan 20 FAPs interacting in this region, their actions areto be accurately coordinated (this especially concernsthe model points located on the lip contours). Any poorcoordination of the node displacements strikes the eye,since these contours on a real human face are alwayssmooth. For the correction of deformations in themouth region and coordination of the lip motions, analgorithm based on cubic splines passing through fea-ture points on the lip contours is used [39].

In conclusion, the scene objects are transformed,and the model is ready for the final stage.

8. APPLICATIONS OF THE TECHNOLOGY

Speaking of the creation and animation of personal-ized 3D talking heads, we should remember the finalgoal, namely, the applications where these heads areused. It is the possible applications that determine therequired quality of the model personalization and ani-mation and, thus, the technologies to be used (whichalso depend on the available input data and desirableoutput results).

254


ZHISLINA et al.

Synthetic talking heads are used for creating virtualspeakers, consultants, and teachers; as characters ingames, TV shows, and movies; for communicationapplications of the “messenger” type (exchange by textor sound messages and their visualization by a talkingcharacter); and for synthetic videoconferences (syn-thetic heads completely simulate speech and byplay ofthe conference participants).

In order to successfully compete against applica-tions based on real video, all of the above applicationsrequire not only the automated creation of realisticmodels of high quality and the development of efficientanimation algorithms but also automated input dataacquisition for realistic animation and visualization.

In terms of the input data types used for the modelanimation, all existing applications can be divided intothe following groups:

• animation from data measured by special sensorsattached on the face;

• text-based animation with the use of the Text-To-Speech technology and superposition of predeterminedemotions with the desirable influence coefficient;

• animation based on speech audio sequences (thesuperposition of predetermined emotions is also possible);

• animation based on video sequences, when thesynthetic model duplicates byplay of a real human.

The animation from data measured by sensors isused basically in the motion picture industry. Thismethod gives nice results, but it is very expensive andnot applicable in many cases.

Currently, text and audio are most often used for thesources of the animation data. Basically, they use“manually” created (or formed by means of a musclemodel) visems corresponding to the phonemes sepa-rated from the speech or text and apply some knownmorphing algorithms to them.

The advantage of the audio or text animationsources is the availability of the sophisticated software(Microsoft Text-To-Speech, LIPSinc Voice-To-Anima-tion), which can also be used at the subsequent stages.The serious disadvantage of the applications based onsuch sources of animation is the deficiency of the inputinformation and, hence, the impossibility to expresspersonal byplay features of the 3D model, as well as thebyplay accompanying speech (smile, head turns, eyeand eyebrow motions).

These disadvantages are completely lacking in theapplications where the source of the animation data is avideo sequence of a talking person (not necessarily theprototype of the given synthetic model). In this case,the animation is based on a set of some animationparameters (rather than on phonemes) obtained bytracking positions of the face (determines the positionof the head as a solid body) and feature points or con-tours on it (determines the deformations related to thebyplay) on the video sequence. This approach is com-pletely compatible with the facial animation in accor-

dance with the MPEG-4 standard, which is based on theFAPs the source of which is not specified by MPEG-4.

The possibility of the animation based on automatedface tracking on a video sequence is declared by manyrenowned companies. Note that the most valuable algo-rithms are those that do not need manual frame mark-ing, do not require special markers on the face, and arenot very sensitive to variations in the shooting condi-tions (lightness). A complex of algorithms for facetracking and computing FAPs satisfying these require-ments has been developed in iRRC in the framework ofthe Intel Facial Tracking & Analysis Library (IFTA)project [15, 16, 40]. On each frame of the sequence,first, the head position is estimated (by using positionsof the mouth and eye corners), and the correspondingFAPs are computed. Then, using the information aboutthe face position, other FAPs corresponding to thebyplay are computed. The IFTA library includes twodifferent sets of algorithms for computing FAPs. One ofthem ensures only a rough estimate for a small numberof the animation parameters and works in real time (theprocessing of a 320 × 240 frame on 2.2 GHz Intel ®Pentium (R) requires approximately 30 ms, which cor-responds to the processing rate 25–30 frames per sec-ond). The second set uses the deformable templatetechnology; it allows one to compute many FAPs withhigh accuracy but spends much more time on this:about 300–400 ms for processing FAPs for the lips ad600–700 ms for processing the eye FAPs (the generalestimation of eyes and lips by an optimized version ofthese algorithms takes approximately 500 ms).

This implies that, for the state-of-the-art computers,the second set of algorithms can be used only in the off-line mode for obtaining synthetic high-quality videofrom original one (in the situations where a high degreeof personalization is required, for example, for creatingclones of public figures). The first set of the algorithmscan be used on-line for creating synthetic video ofacceptable quality (for example, for synthetic video-conferences).

The synthetic videoconferences can be used insteadof ordinary ones when the capacity of the channel con-necting the conference participants is not sufficient fortransmitting a real video (for example, in the cellularphone networks). To transmit a stream of FAPs, it issufficient to have a channel of only 2Kbod capacity!

A prototype application for videoconferences hasbeen created in iRRC. It implements the entire pipelinefor producing synthetic video that is compatible withthe MPEG-4 standard and uses the IFAL and IFTAlibraries. The conference organization is based on thefollowing scheme. The computer of each participanthas a coder and decoder. The coder traces the faces andcomputes the FAPs; the stream of these FAPs is trans-mitted via the network (by using the TCP\IP protocol)to the receiving party. The model (which is already per-sonalized and is available at the receiving computer) isanimated and visualized in accordance with the FAPs

https://www.researchgate.net/publication/249659484_Talking_Head_Synthetic_Video_Facial_Animation_in_MPEG4?el=1_x_8&enrichId=rgreq-c63e5ecfd20518d6e1b685818fd7bbef-XXX&enrichSource=Y292ZXJQYWdlOzIyNjEyMzc4NDtBUzoyNjU5MjEwNTkwOTQ1MjhAMTQ0MDQxMTkyOTM5OA==



received in the decoder of the receiving party synchro-nously with the stream of the real audio, which is alsotransmitted via the network.

The output data of the synthetic video of “talkingheads” in any application is a sequence of images of ameshed textured 3D model deformed in accordancewith the animation rules, illuminated, and projectedappropriately onto the screen. To obtain this set ofimages, the last stage of the operation of the syntheticvideo decoder—visualization—is needed. The qualityof the synthetic video is estimated just by the results ofthis operation. It is known that one and the same modelmay look different depending on the visualization algo-rithm (different illumination, texture processing, anddigitalization) used. The MPEG-4 standard of the syn-thetic video does not specify methods of model visual-ization. At the same time, in all known applications, themodels are visualized by means of the standard librar-ies OpenGL and Direct3D, which work on the real-timebasis and ensure acceptable (but not excellent) qualityof visualization.

To make personalized models more natural, theIFAL library, which uses OpenGL for the visualization,includes additional automated means working in realtime independent of the particular model. These arecreation and visualization of 3D eyelashes; specialtechnique for visualizing eyeballs, which makes it pos-

sible to show the pupil expansion; specific illuminationof the mouth cavity (tongue and teeth), which createsnatural shadowing without considerable time expendi-tures; and dynamic creation of mimic lines (for exam-ple, those appearing upon smiling or those between themouth corner and nose) [39–41].

The quality of the synthetic video created by meansof the application based on the IFAL and IFTA librariesdesigned in iRRC can be estimated from Fig. 14.

9. CONCLUSIONS

In this paper, a survey of methods and algorithmsused for creating personalized head models from pho-tographs or video sequences is given. Methods of ani-mation of these models and examples of the applica-tions where these technologies can be used are alsoconsidered. The emphasis is put on the technologiesdeveloped by Russian researchers.

By combining the technologies described, one cancreate applications that are capable of working in auto-mated mode or allow the user to control certain stagesof the head model creation and animation process. Forexample, the image marking can be done either auto-matically in the case of a video stream or semiautomat-ically in the case where images are processed sepa-rately. Of course, the interference of the user in the pro-

Fig. 14. An example of a “talking” head animated from real video.


256


ZHISLINA et al.

cess generally improves the quality of the model or itsanimation.

The appearance of commercial applications basedon the described technologies that are capable of solv-ing the problems discussed demonstrates evidentadvances (including scientific ones) in this field. On theother hand, the authors are not aware of the applicationscapable of creating and animating in an interactivemode head models identical to their prototypes on realpictures or video. It seems likely that the creation offully realistic synthetic video will require the develop-ment of new principles and approaches, which shouldbecome the subject of future research in this field.

ACKNOWLEDGMENTS

This work uses materials created in the frameworkof the joint project “Talking Head” of iRRC (Intel Rus-sian Research Center) and the Department of Mechanicsand Mathematics of Moscow State University with theparticipation of the Department of Computational Mathe-matics and Cybernetics of Moscow State University.

This work was supported in part by SamsungAdvanced Institute of Technology and by the RussianFoundation for Basic Research, project no. 03-07-90381.

REFERENCES

1. Lee, Y., Terzopoulos, D., and Waters, K., Realistic Mod-eling for Facial Animation, Proc. of SIGGRAPH’95,1995, pp. 55–62.

2. Goto, T., Kshirsagar, S., and Magnenat-Thalmann, N.,Automatic Face Cloning and Animation, IEEE SignalProcessing Magazine, 2001, vol. 18, no. 3, pp. 17–25.

3. Lavagetto, F., Pockaj, R., and Costa, M., Smooth SurfaceInterpolation and Texture Adaptation for MPEG-4 Com-pliant Calibration of 3D Head Models, Image VisionComputing J., 2000, vol. 18, no. 4, pp. 345–354.

4. Lee, W., Kalra, P., and Magnenat-Thalmann, N., ModelBased Face Reconstruction for Animation, Proc. ofMMM’97, World Sci., 1997, pp. 323–338.

5. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., andSalesin, D., Synthesizing Realistic Facial Expressionsfrom Photographs, Proc. of SIGGRAPH’98, 1998,pp. 75–84.

6. Blanz, V. and Vetter, T., A Morphable Model for the Syn-thesis of 3D Faces, Proc. of SIGGRAPH’99, 1999,pp. 187–194.

7. Lee, W. and Magnenat-Thalmann, N., Fast Head Model-ing for Animation, Image Vision Computing J., 2000,vol. 18, no. 4, pp. 355–364.

8. Cohen, M., Jacobs, C., Liu, Z., and Zhang, Z., RapidModeling of Animated Faces from Video, MicrosoftTech. Report MSR-TR-2000-11, 2000.

9. Fua, P., Regularized Bundle-Adjustment to ModelHeads from Image Sequences without Calibration Data,Int. J. Comput. Vision, 2000, vol. 38, no. 2, pp. 153–171.

10. Lee, W., Escher, M., Sannier, G., and Magnenat-Thal-mann, N., MPEG-4 Compatible Faces from OrthogonalPhotos, Proc. of CA’99, 1999, pp. 186–194.

11. Schaffalitzky, F. and Zisserman, A., Viewpoint InvariantTexture Description and Matching, Tech. Report, Dept.of Eng. Sci., Univ. of Oxford, 2001.

12. Viola, P. and Jones, M.J., Rapid Object Detection Usinga Boosted Cascade of Simple Features, IEEE CVPR,2001.

13. Lienhart, R. and Maydt, J., An Extended Set of Haar-likeFeatures for Rapid Object Detection, IEEE ICIP, 2002,vol. 1, pp. 900–903.

14. Open Source Computer Vision Library, Intel,http://www.intel.com/research/mrl/research/opencv/.

15. Bovyrin, A.B. and Rodyushkin, K.V., Statistical Estima-tion of Color Components of Lips and Face to Determinethe Mouth Contour by the Deformable TemplateMethod, Abstracts of the 11th Conf. “MathematicalMethods of Pattern Recognition,” Moscow, 2003,pp. 245–247.

16. Rodyushkin, K.V., Eye Features Tracking by Deform-able Template to Estimate Face Animation Parameters,Proc. of Advanced Concepts for Intelligent Vision Sys-tems, Ghent, Belgium, 2003, pp. 267–274.

17. Yuille, A.L., Hallinan, P.W., and Cohen, D.S., FeatureExtraction from Faces Using Deformable Templates, Int.J. Comput. Vision, 1992, vol. 8, no. 2, pp. 99–111.

18. Shokurov, A., Khropov, A., and Ivanov, D., FeatureTracking in Image and Video, Proc. of GraphiCon-2003,2003, pp. 177–179.

19. Pollefeys, M., 3D Modeling from Images, TutorialCourse on SIGGRAPH’2001, 2001.

20. Laurentini, A., The Visual Hull Concept for Silhouettes-based Image Understanding, IEEE Trans. Pattern Anal.Machine Intelligence, 1994, vol. 16, no. 2, pp. 150–162.

21. Schaback, R., Creating Surfaces from Scattered DataUsing Radial Basis Functions, Mathematical Methods inCAGD III, Daelhen, M., Lyche, T., and Schumaker, L.,Eds., 1995, pp. 1–21.

22. Ambrosini, L., Costa, M., Lavagetto, F., and Pockaj, R.,3D Head Model calibration Based on MPEG-4 Parame-ters, Proc. ISPACS’98, 1998, pp. 626–630.

23. Lempitskii, V., Ivanov, D., and Kuzmin, Ye., High-Qual-ity Head Model Calibration Pipeline, GraphiCon’2002,2002.

24. Kshirsagar, S., Garchery, S., and Magnenat-Thalmann, N.,Feature Point Based Mesh Deformation Applied toMPEG-4 Facial Animation, Proc. of Deform’2000,2000, pp. 23–34.

25. Dijkstra, E., A Note on Two Problems in Connectionwith Graphs, Numer. Math., 1959, vol. 1, pp. 269–271.

26. Ivanov, D., Lempitskii, V., Shokurov, A., Khropov, A.,and Kuzmin, Ye., Creating Personalized Head Modelsfrom Image Series, Proc. of the 13th Int. Conf. on Com-put. Graphics GraphiCon’2003, Moscow, 2003.

27. Lempitskii, V., Ivanov, D., and Kuzmin, Ye., TexturingCalibrated Head Model from Images, Proc. of Euro-Graphics’2002, 2002, pp. 281–288.

28. Turini, M., Yamauchi, H., Haber, J., and Seidel, H.-P.,Texturing Faces, Proc. of Graphics Interface 2002,2002, pp. 89–98.





























































29. Parke, F.I., A Parametric Model for Human Faces, PhDDissertation, Salt Lake City: University of Utah, 1974.

30. Chadwick, J.E., Haumann, D.R., and Parent, R.E., Lay-ered Constructions for Deformable Animated Charac-ters, in Computer Graphics (Proc. of SIGGRAPH’89Conf.), ACM SIGGRAPH, 1989, vol. 23, pp. 243–252.

31. Nahas, M., Huitric, H., and Saintourens, M., Animationof a B-spline Figure, Visual Comput., 1988, vol. 3, no. 5,pp. 272–276.

32. Lavagetto, F. and Pockaj, R., The Facial AnimationEngine: Towards a High-Level Interface for the Designof MPEG-4 Compliant Animated Faces, IEEE Trans.Circuits Systems Video Techn., 1999, vol. 9, no. 2.

33. Haber, J., Kalher, K., Abrecht, I., Yamauchi, H., andSeidel, H.-P., Face to Face: From Real Humans to Real-istic Facial Animation, Proc. Israel–Korea BinationalConf. on Geometrical Modeling and Comput. Graphics2001, 2001, pp. 73–82.

34. SNHC. “INFORMATION TECHNOLOGY—GENERIC CODING OF AUDIOVISUAL OBJECTSPart 2: Visual,” ISO/IEC 14496-2, Final Draft of Interna-tional Standard, Version of November 13, 1998,ISO/IEC JTC1/SC29/WG11 N 2502a, Atlantic City,1998.

35. Tekalp, A.M. and Ostermann, J., Face and 2-D MeshAnimation in MPEG-4, Image Communication J., TutorialIssue on the MPEG-4 Standard; http://leonardo.telecomi-talialab.com/icjfiles/mpeg-4_si/8-SNHC_visual_paper.htm.

36. The Virtual Reality Modeling Language,http://www.web3d.org/Specifications/VRML97/.

37. Kuriakin, V., Firsova, T., Martinova, E., Mindlina, O.,and Zhislina, V., MPEF-4 Compliant 3D Face Anima-tion, 11th Int. Conf. on Comput. Graphics Graphi-Con’2001, Nizhni Novgorod, 2001, pp. 54–58.

38. Martinova, E., Recognition and Synthesis of High-LevelMPEG-4 Facial Animation Parameters, Proc. of the 6thInt. Conf. on Comput. Graphics and Artificial Intelli-gence 3IA’2003, Limoges, France, 2003.

39. Firsova, T. and Zhislina, V., MPEG-4 Compliant TalkingHeads, Contour Based Approach, Proc. of the 6th Int.Conf. on Comput. Graphics and Artificial Intelligence3IA’2003, Limoges, France, 2003.

40. Fedorov, A., Firsova, T., Kuriakin, V., Martinova, E.,Mindlina, O., Rodyushkin, K., and Zhislina, V., TalkingHead: Synthetic Video Facial Animation in MPEG-4,13th Int. Conf. on Comput. Graphics and Vision Graph-iCon’2003, Moscow, 2003, pp. 37–41.

41. Firsova, T., Ivanov, D., Kuriakin, V., Martinova, E.,Rodyushkin, K., and Zhislina, V., Life-like MPEG-4 3D“Talking Head” (beyond standard), Proc. of the 5th Int.Conf. on Comput. Graphics and Artificial Intelligence3IA’2002, Limoges, France, 2002.

42. Ivanov, D., Reconstruction of Structured 3D Modelsfrom Images and Video, Proc. of 4th All-Russian Symp.on Applied and Industrial Math., 2003.

43. Ivanov, D., Lempitskii, V., and Kuzmin, Ye., 3D TalkingHeads: Personalized and Animated, NORSIGD Info (Themagazine of Norwegian Computer Society), 2003, 1/03,pp. 4–8.



































Creating and Animating Personalized Head Models from Digital Photographs and Video

Documents

Transcript of Creating and Animating Personalized Head Models from Digital Photographs and Video