Adaptation of virtual human animation and representation for MPEG

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

3B2v7:51cGML4:3:1 CAG : 1360 Prod:Type:COM

pp:1211ðcol:fig::4�6;8�9ÞED:Nagesh

PAGN: gopal SCAN: agnes

Computers & Graphics ] (]]]]) ]]]–]]]

ARTICLE IN PRESS

*Correspond

22-379-7780.

E-mail addr

(T. Di Giacom

0097-8493/$ - se

doi:10.1016/j.ca

F

Adaptation of virtual human animation and representation forMPEG

Thomas Di Giacomo*, Chris Joslin, Stephane Garchery, HyungSeok Kim,Nadia Magnenat-Thalmann

MIRALab, University of Geneva, C.U.I., 24 rue General Dufour, Geneva-41211, Switzerland

RECTED PROO

Abstract

While level of detail (LoD) methods for the representation of 3D models are efficient and established tools to manage

the trade-off between speed and quality of the rendering, LoD for animation has not yet been intensively studied by the

community, and especially virtual humans animation has not been focused in the past. Animation, a major step for

immersive and credible virtual environments, involves heavy computations and as such, it needs a control on its

complexity to be embedded into real-time systems. Today, it becomes even more critical and necessary to provide such a

control with the emergence of powerful new mobile devices and their increasing use for cyberworlds. With the help of

suitable middleware solutions, executables are becoming more and more multi-platform. However, the adaptation of

content, for various network and terminal capabilities—as well as for different user preferences, is still a key feature that

needs to be investigated. It would ensure the adoption of ‘‘Multiple Target Devices Single Content’’ concept for virtual

environments, and it would in theory provide the possibility of such virtual worlds in any possible condition without the

need for multiple content. It is on this issue that we focus, with a particular emphasis on 3D objects and animation. This

paper presents some theoretical and practical methods for adapting a virtual human’s representation and animation

stream, both for their skeleton-based body animation and their deformation-based facial animation, we also discuss

practical details to the integration of our methods into MPEG-21 and MPEG-4 architectures.

r 2004 Elsevier Ltd. All rights reserved.

Keywords: Multi-resolution animation and representation; Adaptation; Graphics standardization
R 57
59

61

63

65

UNCO1. Introduction

Since the invention of the computer, content has been

tailored towards a specific device, mainly by hand.

Computer games have been developed with specific

computer’s capabilities in mind, various sized video has

been produced so that the user can select their choice

based on the one most likely to run best, and even

different formats have been provided in order that it can

run on different types of machines. In recent years, this

67

69ing author. Tel.: +41-22-379-7618; fax: +41-

ess: [email protected]

o).

e front matter r 2004 Elsevier Ltd. All rights reserve

g.2004.04.004

trend of multiple devices, multiple content (MDMC), is

slowly shifting towards multiple devices, single content

(MDSC), with great advantage both to the content

provider and the end user. Firstly, only a single (usually

high quality) content need be provided for an entire suite

of devices, and secondly the user is provided with

content that is optimised and fits not only the require-

ments of the device, but the network, and the user’s own,

often specific, preferences. This highly motivating goal

can only be achieved within the framework of standar-

disation; and in this case the Motion Picture Experts

Group (MPEG); this is mainly due to the extensibility

and range of applications to which this can be taken.

Though early work in MPEG-7, e.g. proposed by Heuer

et al. [1], adapts certain types of content, this kind of

d.

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]]2

CAG : 1360

MDSC concept is being constructed under the frame-

work of Digital Item Adaptation (DIA)—a Digital Item

being the MPEG name given to any media type or

content, adaptation because it is being adapted, rather

than changed, from its original form into something that

is tailored towards the context of the user and the user’s

equipment. DIA is part of the MPEG-21 framework [2],

for content control, and is in its final stages of

standardisation.

In this paper, we discuss DIA relating to virtual

humans, and in this particular case animation as it is the

main influence for any session involving avatars.

Representation is discussed, but it is merely a transition

from the general conditions that it currently exists in,

towards the aforementioned context of standardisation.

Here we are mainly concerned with adapting animation,

under a variety of different conditions, to a specific

context. We split this paper into 4 major Sections;

Section 2 describes related work, focusing especially on

other types of context-based adaptation both for the

representation and the animation. Section 3 introduces

the adaptation of representation and both Face/Body

Animation (FBA), see Preda et al. [3], and Bone Based

Animation (BBA), see Preda et al. [4], the high-level

adaptation schema and the methodology. Section 4

provides some preliminary results, and Section 5

provides a conclusion and an overview of the future

work that we have planned over the next development

period.

87

89

91

93

95

97

99

101

103

105

107

109

111

UNCORREC2. Related work

This section presents relevant methods for the

adaptation of body and facial animation of virtual

human. We first discuss work to adapt the representa-

tion of these 3D objects, then methods for adaptable

animation and to conclude approaches related to

scalability within MPEG and to various attempts of

3D applications with single adaptation.

In Computer Graphics, there are many methods, such

as the one proposed by Hoppe [5] for instance, for

creating a LoD representation, a.k.a. multi-resolution

models, for virtual objects. Basically, it consists in

refining or simplifying the polygonal mesh, according to

certain criteria such as the distance of the object to the

camera, to save computation during the rendering and/

or to meet some timing requirements. Garland et al. [6]

also propose a method based on error quadrics to

simplify surfaces of 3D objects. Some of these geome-

trical methods are even specifically geared for virtual

humans, for instance, Fei et al. [7] propose LoD for the

virtual human body meshes and Seo et al. [8] for virtual

human face meshes. The main consideration for virtual

humans, over other objects, is to retain a specific

number of polygons at the joints (or near control points

TED PROOF

for the facial animation). Another approach to render

numerous virtual humans in real-time on different

devices is to convert 3D objects to 2D objects. This

procedure is also referred to as transmoding within

MPEG, and such methods have been proposed by Aubel

et al. [9] with impostors and Tecchia et al. [10].

Concerning the methods to adapt animation itself,

much work has been made mostly in the field of

physically based animation. Adaptive techniques, such

as those proposed by Wu et al. [11] using a progressive

mesh, and by Debunne et al. [12] combining it with an

adaptive time-step, reduce the processing time required

for animation. Capell et al. [13] also use a multi-

resolution hierarchical volumetric subdivision to simu-

late dynamic deformations with finite elements. Hutch-

inson et al. [14] refine mass-spring systems if certain

constraints on the system are violated. James et al. [15]

propose to animate deformations with ‘‘dynamic re-

sponsive textures’’ in a system called DyRT, with the

help of precomputations. Adaptation for non-physically

based animation field has received little attention until

now. However, some relevant methods have been

proposed, Granieri et al. [16] adapt the sampling

frequency of virtual human motions to be played back

and support basic reduction of degrees of freedom for

virtual humans. For natural scenes, Di Giacomo et al.

[17] present a framework to animate trees with the use of

a level of detail (LoD) for the animation, Guerraz et al.

[18] animates prairies with three different animation

models as three different possible levels.

A specific adaptation of an MPEG file is also possible,

depending on the media type. While scalable audio and

video are intensively studied topics, such as the work of

Kim et al. [19], and on audio, such as the method

proposed by Aggarwal et al. [20] a progression towards

graphics is slowly being initiated; however only by a few,

such as from Van Raemdonck et al. [21] and Boier-

Martin et al. [22]. It is probably even more the case with

animation, which, to our knowledge, is mainly investi-

gated in the work of Joslin et al. [23]. Though it is

slightly outside the scope of this paper, there is an

important point to be mentioned for DIA, i.e. the

impact of the adaptation. Quality of Service (QoS) is a

tool to estimate such an influence. For instance, Pham

Ngoc et al. [24] propose a QoS for 3D graphics. Yet

there is a need for a dedicated Adaptation QoS for 3D

animation, to evaluate the impact on aesthetic quality

and the usability of adaptation for animation and

specific 3D objects, e.g. virtual characters.

Finally, some research is being done to specify

adaptable architectures for 3D applications, though

most of them are not geared towards multiple scenarios.

For instance Schneider et al. [25] integrate, in their

Network Graphics Framework, various transmission

methods for downloading 3D models in a client–server

environment. A particular transmission method is

F

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

Fig. 1. Binary to bitstream description conversion.

Fig. 2. Bitstream description to binary.

T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]] 3

CAG : 1360

selected by comparing the quality of the models and the

performances of the network. Lamberti et al. [26]

propose a platform with accelerated remote rendering

on a cluster which transmits the images to a PDA. User-

navigation is then calculated on the PDA-side and

transmitted back to the server. Optimization for

particular devices are being investigated, not to mention

all the methods for standard desktop PC graphics-

boards, some dedicated optimizations, e.g. such as done

by Kolli et al. [27] for the ARM processor, are explored

for different mobile and new devices. Another way to

adapt 3D applications on different devices is to adapt

directly the underlying model of standard and known

computer graphics techniques. One of the most relevant

examples is the use of Image-Based Rendering techni-

ques by Chang et al. [28]. Another example is the work

by Stam [29], demonstrating stable fluids on PDA with

consideration for the Fixed-Point arithmetic, because of

the lack of FPU on these devices.

Most of the previously described methods perform

their respective task well, however we believe there is

actually no global and generic architecture to adapt

content, representation and animation of virtual hu-

mans, towards network and target client capabilities, as

well as user preferences. This is the focus of our work.

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

UNCORREC3. Adaptation of content

3.1. Introduction

The adaptation of content, based on the MPEG-21

Digital Item Adaptation [30] is quite complex as an

overview, but is actually quite simple and allows for

extremely practical applications. The principles of

adaptation are based on XML Schemas called Bit

Stream Description Language (BDSL) and its generic

form generic Bit Stream Description Language (gBDSL)

introduced by Amielh et al. [31,32]. The idea is that the

codec is described using these schemas; BSDL uses a

codec specific language, meaning that the adaptation

engine needs to understand the language and use a

specific XML Style Sheet (explained later) in order to

adapt the bitstream. gBSDL uses a generic language,

which means that any adaptation engine can transform

the bitstream without a specific style sheet (i.e. that

multiple style sheets are not required for each adapta-

tion).

The adaptation works on multiple levels and is very

flexible. The Bit Stream Description (BDS) is basically

an XML document that contains a description of the

bitstream at a high level (i.e. not on a bit-by-bit basis).

This can contain either the bitstream itself, represented

as hexadecimal strings, or URI links to the bitstream in

another file (usually the original file). This BDS is

generated using a Binary to Bitstream Description

TED PROO

engine (or more commonly BintoBSD), as shown in

Fig. 1.

Once the bitstream is in BSD format, it can be

adapted using an XML style sheet, which basically

contains information on how to adapt the XML

document according to a set of rules that are passed

by the adaptation engine. This adaptation basically

removes elements from the XML document according to

the adaptation schema (described in the following

sections). During this stage the header might be changed

in order to take account of the elements in the bitstream

that were removed; for example, the initial mask might

indicate the presence of all elements, but this would be

adapted to indicate which elements remain after

adaptation.

The XML document is then parsed via a BSD to Bin

converter which takes the XML document and converts

it back into a bitstream; as shown in Fig. 2. In general

the header is converted back from its human readable

form, directly into its binary representation and the

remaining elements of the payload are assembled back

into the binary stream following the header (either using

the URI data, or the hexadecimal string embedded in

the BSD).

In order to overview this at the bit stream level, Fig. 3

provides a syntactic perspective on the entire process.

Whilst the header might also be adapted, in general most

of it will be retained as it contains the information

outlining the format of the rest of the packet. In the

following sections we present adaptation schema for

both body and face animation.

3.2. Virtual human adaptation

3.2.1. Introduction

As the schema is used to define the bitstream layout

on a high level it must basically represent a decoder in

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

Fig. 3. Adaptation process at bitstream level.

Fig. 4. Illustration of clusters.


CAG : 1360

UNCORREC

XML format. This does not mean that it will decode the

bitstream, but the structure of the bitstream is impor-

tant. This means that the adaptation methods must be

inline with the lowest level defined in the schema. For

example, if the adaptation needs to skip frames, and as

MPEG codecs are byte aligned, it is practical in this case

to search for the ‘‘start code’’ for each frame; this means

that a frame can be dropped without needing to

understand most of the payload (this is possibly with

the exception of updating the frames—skipped section

of the header—as this could be avoided). However, as

will be seen in the following sections the schema is based

on a much lower level in order to be more flexible.

3.2.2. Schema

In terms of the bitstream level, the schema is defined

on a low-level in order to take into account of the

adaptation described in Sections 3.3 and 3.4. This is

necessary because the FBA codec is defined using groups

(which basically define limbs and expressive regions),

but these are not sufficiently marked in order identify

the payload. This does not pose a serious problem as the

schema generally remains at the server-side (i.e. size is

not a problem), however it does take longer to process

and obviously it is more prone to error.

3.2.3. Scalability of the FBA codec

As a final discussion, we describe the scalability of the

FBA codec. After a closer examination of the codec it

can be found that although the codec performs well for

its general operation, coding face and body movements

and even expressions, there are several short comings

mainly in the area of scalability. This rests mainly in the

grouping of the parameters; this grouping is mainly

suitable for Body/Face Interest Zones, detailed in

Sections 3.4.2 and 3.4.3, but not for Level of Articula-

tion (LoA), which would be just as useful. This means

that schema must be defined on quite a low-level in

order for it to adapt properly. In addition the codec

defines each frame based on quite a high level of

independence, which means that the coding scheme can

be swapped on practically a frame-by-frame basis, in

fact each region of the face can be coded differently

(some as expressions, and some as visemes). This means

that codec is quite flexible in its approach, but again

TED PROOF

quite impractical for adaptation as the schema must

account for this diversity.

3.3. Shape representation

The adaptation operation should be simple enough to

be implemented in light-weight environment. There have

been many researches regarding on multi-resolution

shape representation [33]. So far, they are either

consumes too much spaces or require relatively complex

decoding mechanisms. In this research we devise a

simple representation conforms to the current standards

by clustering all the data so that a specific complexity

can be obtained by simply choosing a set of clusters.

From the complex mesh Mn; it is sequentially

simplified to Mn�1;y;M1;M0: By following the se-

quence, we can identify a set of vertices and faces that

are removed from a mesh of the level i to make a mesh

of the level i � 1; denoted by C(i) where M0 ¼ Cð0Þ:Also there is a set of vertices and faces that are newly

generated by simplification, denoted by N(i). By the

properties of the simplification, it ensures N(i) to be a

subset of unions of C(j) for all i > j:Using this property,

the cluster C(i) is sub-clustered into a set of Cði; jÞ; whichbelongs to N(j) where j>i and Cði; iÞ which does not

belongs to any N(j). Thus, the level i mesh is represented

as the following equation, which requires simple set

selections for the clusters of Cði; jÞ:

Mi ¼Xi

k¼0

ðCðk; kÞ þXn

j¼iþ1

Cðk; jÞÞ: ð1Þ

The clusters are ordered to have small number of

selections for each level. Fig. 4 shows the ordered

example of clusters. The representation does not have

any explicit redundancies and an adaptation process

would be a selection of part of the stream. Streaming

and decoding process can be stopped at the almost any

clusters, except clusters of Cð0Þ; and it is guaranteed that

at least one LoD can be composed from the current

subset of clusters. If the stream is processed more, higher

level is provided.

In the mesh, there are other properties that have to be

taken into account, such as normal, color, and texture

PROOF

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

Table 3

Definitions of the new LoA profiles

Name Definition

LoA High All joints are animatable.

LoA Medium Fingers, Toes, and Spine Joints are

composed into a single joint.

LoA Low The only animatable joints are from the

Very Low Profile, plus the neck joint,

the elbow joints and the knee joints.

LoA VeryLow The only animatable joints are the

shoulders, the root, and the hip joints.

Table 4

Elementary LoA zones

Zone Brief description No. of joints

1 Face only 2

2 Right and left arms 49

3 Torso 25

4 Pelvis 4

5 Legs 4

6 Feet 4


CAG : 1360

EC

coordinates. Based on the mesh structure, we make a

unique mapping from a pair of vertex and face to a value

of property. By assigning the property to the lowest level

vertex and its corresponding clusters, property values

can be represented into the cluster. In real applications,

it is often not necessary to generate all levels of details.

Differences by a few polygons do not usually give

significant differences either in performances or in

qualities. By using the proposed representation, the

modeling system is able to generate a set of levels with

any combination depending on a model and application.

3.4. Body animation

3.4.1. Introduction

Body animation is commonly performed by skeleton-

based animation; where the hierarchical structure of the

bones and joints allows a rather straightforward level of

complexity, which basically corresponds to the level of

depth in this hierarchy. The H-Anim group specifies

some basic LoA [34], see Table 1, which we extend with

modifications and new methods in the following

sections.

3.4.2. Modified LoA

Following the H-Anim specification, we redefine LoA,

as shown in Table 2, linked to original values but being

more suitable to our global adaptation architecture. For

instance, the lower level includes shoulders and hips to

enable minimalist motions even at this level. The

medium level is also a higher simplification of the spine

compared to the primary H-Anim definition.

The detailed definition of our new LoA is given in

Table 3, detailing the possible joints at each level.

Such levels of complexity are used to select an adapted

LoD for the animation of the virtual human. For

UNCORR95

97

99

101

103

105

107

109

111

Table 1

H-Anim LoA profiles

Level Brief description No. of joints

0 Minimum 1

1 Low-end/real-time 18

2 Simplified spine 71

3 Full hierarchy 89

Table 2

Newly defined LoA profiles

Level brief description No. of joints

Very low Minimum 5

Low Low-end/real-time 10

Medium Simplified spine 35

High Full hierarchy 89

TED instance, some virtual humans, which are part of an

audience in a theatre, do not require a high LOA, thus

using LoA VeryLow is enough, while the main character

on stage, which is moving, talking and the focus of the

scene, requires a high LOA (thus using LoA Medium or

High). Though it is not such a significant gain for a

single virtual human, this feature provides efficient

control for the animation of a whole crowd and/or

adapting animations according to terminal and network

capabilities.

3.4.3. Body regions

The main purpose of defining interest zones is to allow

the user to select the part they are most interested in, as

shown in Table 4.

For instance, if the virtual human is presenting a

complete script on how to do cooking, but the user is

only interested in listening to the presentation (and not

at the motions), then the animation can be adapted to

either the upper body, or even just the face and the

shoulders.

Another example would be the tourist guide, pointing

at several locations such as theatres or parks, the user

may only wish to focus on the arms and hands of the

presenter.

The lower level interest zones, consisting of six zones

shown in Fig. 4, can be composed into higher level ones

by two different methods:

TED PROOF

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

Table 5

Predefined zones and their correspondence with elementary

zones

Name Definition

LoAZone All 1+2+3+4+5+6

LoAZone Face 1

LoAZone FaceShoulders 1+2+3

LoAZone UpperBody 1+2+3+4

LoAZone LowerBody 4+5+6

Table 6

Mask to select elementary zones

Predefined Mask No. of joints

All 0 94

Face 1 1

Shoulders 7 77

UpperBody 15 82

LowerBody 45 17

Fig. 5. Body interest zones.

Table 7

Defined facial regions/zones

Group/Zone Contains

0 Jaw (jaw, chin, lips and tongue)

1 Eyeballs (eyeballs and eyelids)

2 Eyebrow

3 Cheeks

4 Head rotation

5 Nose

6 Ears


CAG : 1360

UNCORREC

* Predefined zones—This is a set of higher level zones

predefined as a combination of elementary zones, as

described in Table 5.* Mask—This is a value representing which elementary

zones are active. This mask is a binary accumulator

which can take different value, a few of which, the

ones related to predefined zones, are presented in

Table 6.

3.5. Face animation

3.5.1. Introduction

The body LoA is based on a skeleton. For the face

part we do not have this kind of base information, but

we define different areas on the face, and different levels

of detail for each of these areas.

The different parts of the face, as shown in Fig. 5, are

initially specified, and then we will explain in detail the

different LoA values for the face before introducing the

applications possible of these definitions according to

the different platform or systems used.

3.5.2. Face regions

The face is segmented into different interest zones.

This segmentation makes it possible to group the Facial

Animation Parameters (FAP) according to their influ-

ence zone with regards to deformation. For each zone

we also define different levels of complexity, as shown in

Fig. 5. The zone segmentation is based on MPEG-4

FAP grouping; although we have grouped the tongue,

inner and outer lips into only one group because these

displacements are very much linked. The zones are

defined as shown in Table 7.

Note that the LOA does not influence the head

rotation value. Depending on the user interest, we can

further increase or reduce complexity in specific zones.

For example, an application based on speech requires

precise animation around area of the lips and less in

other parts of the face. In this case, we can reduce the

complexity to a very low level for the zones containing

eyeballs, eyebrows, nose and ears zones; a medium level

for the cheek zones and a very high level in the jaw zone.

3.5.3. Levels of complexity

For the face, we define 4 levels—as shown in Table 8,

with the direct hierarchy as defined for the body but

based on the FAP influence according to the desired

complexity or user preferences. Two different techniques

were used to reduce the level of complexity. The first

consists of grouping FAP values together (i.e. all upper

ED PROOF

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

Table 8

Level of articulation for face

Name Definition

LoA High All FAP values

LoA Medium Group Equivalent FAP values

together 2 by 2

LoA Low Starting by suppressing less important

FAP values and grouping FAP or

LoA Medium FAP results together

LoA VeryLow Suppressing unimportant facial

animation parameters and adding full

symmetry of the face parametersFig. 6. Face zones of interest.

Table 9

No. of FAP values shown against LoA profile

High Medium Low V. Low

High Level 2 2 2 2

Jaw 31 19 12 6

Eyeballs 12 8 4 1

Eyebrows 8 3 2 1

Cheeks 4 4 1 1

Head rotation 3 3 3 1

Nose 4 3 0 0

Ears 4 2 2 0

Total 68 44 26 14

Ratio 65% 38% 21%


CAG : 1360

UNCORREC

lip values can be grouped into one value). To select

which FAP should be grouped, we defined certain

constraints, as follows:

1. All grouped FAP values must be in the same area.

2. All grouped FAP values must be under the influence

of the same FAP units.

3. When grouped by symmetry, two FAP values or

groups, we define the controlling FAP values

included in the right part of the face.

The second set of constraints is more destructive to

the overall set of values, meaning that quality is reduced

more rapidly. After a certain distance from the viewing

camera some FAP values become insignificant (e.g. at

low level we remove FAP values pertaining to the

dilation of pupils, this deformation becomes invisible

after a short distance).

* LoA High—This represents the use of all FAP

values, i.e. 2 high level FAP values (never reduced

because they already adapt to level of complexity),

and 66 low level FAP values* LoA Medium—Here we group together certain FAP

values, in order to maintain a high LoD. At this level,

after regrouping we obtain 44 FAP values rather

than maximum 68, a reduction of over 35%.* LoA Low—At this level, we remove many of the

FAP values that become unimportant for face

animation and continue to group FAP values

together. After regrouping/deleting FAP values, 26

FAP values remain against the maximum 68, a

reduction of 62%.* LoA VeryLow—Here, we remove most FAP values

and concentrate on the base values necessary for

minimal animation. This level represents the mini-

mum set of parameters required for animating a face:

we mostly link symmetric LoA Low parameters

together and remove all unimportant values. At this

level, the number of FAP values is reduced to 14,

resulting in an overall reduction of 79%

T3.5.4. Levels of articulation-overview

Fig. 6 shows a global view of all FAP values and the

links between all of them for each LoA profiles. Table 9

shows, for each LoA, how many FAP values drive each

part of the face and Fig. 7 indicating the overall

reduction of FAP values (Fig. 8).

3.5.5. Frame rate reduction

Depending on capabilities of the target platform to

reproduce the exact and complex face animation, we can

also reduce the frame rate in order to reduce network

and animation load; it is done for each LoA, as shown in

Table 10. In the case of different LoA for each zone of

interest, we assume that the maximum frame rate. For

each frame we assume a bit stream less than 1KBits.

3.5.6. Advantages of adaptation

According to the simplification explained above, we

can transmit the minimum facial information according

to the context. After simplification, the FAP stream

remains compatible with MPEG-4 specification, but

with a reduced size; on the client, there are two

possibilities:

The first consists of decoding the stream and

reconstructing all FAP values according to the LOA

UNCORREC

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

Fig. 7. FAP grouping.

Fig. 8. Graph showing reduction of FAP values.

Table 10

Frame rate reduction

Frame Rate (frames/second) (fps)

LoA High 25

LoA Medium 20

LoA Low 18

LoA VeryLow 15

Table 11

Size of data (bytes)

# of polygons (K) Original model (M) Propos

Body model 71 12.7/4.5 17.5/6.5

Face model 7 0.8/0.3 1.0/0.5


CAG : 1360

rules. In this case, we have just reduced the bit stream.

The second technique also consists of simplifying the

deformation. In most MPEG-4 facial animation engines,

they compute deformations for each FAP value and

group them together and in mobile platforms the lower

number of computations that are made during anima-

tion the better. We could simply apply the FAP stream

to the deformation engine; in this case we do not

compute 66 deformations areas and group them

together, but design directly the deformation area for

each level of complexity. This work of regrouping could

be done automatically during the pre-processing step or

be transmitted with the model. In the case of very low

LoD, we should have only 11 FAP values included in

linear deformation to be combined on the face rather

than 58.
F 73
75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

TED PROO

4. Results

Table 11 shows results of multiresolution model

representation. The progressive mesh (PM) approach is

known as a near-optimal method for its compactness in

the size. The discrete mesh is a set of discrete levels,

which is still quite common in real-world applications.

The numbers are the size of the VRML and BIFS files,

respectively. Since the PM cannot be encoded to BIFS

format, only the approximated size for the text file is

noted. The highest details have a number of polygons of

71K and 7K for each model whilst the lowest details

have 1K and 552 polygons each. The models are

constructed to have 5 different levels. The proposed

method is located in-between these approaches, and is

flexible and simple enough to allow adaptation with

relatively small file size. More importantly, it is able to

be transmittable via standard MPEG streams. It also

utilizes a simple adaptation mechanism, which is very

similar to the simplest discrete level selection.

To verify the effectiveness of the adaptation in terms

of the network performance, we performed some specific

comparisons using several standard animation se-

quences (Table 12).

We made server experiments with different com-

pressed FAP files (using the same frame rate, and

compression methods) to estimate the influence of LoA

on complete animation, comparing the overall size (Fig.

9).

107

109

111

ed method (M) Progressive mesh (M) Discrete mesh (M)

B13 51.6/12.0

B0.9 3.9/1.6

CORREC

F

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

Table 12

Overall data size comparison

Wow Baf Face23 Macro Lips

Size

High 21255 32942 137355 18792 44983

Medium 16493 24654 104273 14445 36532

Low 13416 20377 78388 11681 30948

Very low 12086 17033 70498 8788 29322

Ratio (%)

High 100% 100% 100% 100% 100%

Medium 78% 75% 76% 77% 81%

Low 63% 62% 57% 62% 69%

Very Low 57% 52% 51% 47% 65%

Fig. 9. Facial results with no symmetry ((a) LoA High, (b) LoA

medium, (c) LoA Low, (d) LoA Very Low).

Table 13

Overall time computation comparison

Wow Baf Face23 Macro Lips

Time

High 4.17 6.10 25.04 4.08 10.90

Medium 3.56 5.23 21.23 3.54 9.81

Low 3.09 4.56 17.58 3.10 8.81

Very low 2.87 4.11 15.98 2.60 8.40

Ratio (%)

High 100% 100% 100% 100% 100%

Medium 85% 86% 85% 87% 90%

Low 74% 75% 70% 76% 81%

Very low 69% 67% 64% 64% 77%


CAG : 1360

UNOverall, the profiles provide a mean value of 77% for

the medium level profile, 63% for the low profile, and

55% for the very low profile. All these files, except the

last one, represent animation from all parts of the face.

For each of them, we obtain the same reduction factor.

For the last file, which only animates the lips, we observe

a smaller reduction factor due absence of FAP suppres-

sion (Table 13).

TED PROO5. Conclusion and future work

In conclusion we have provided a completely new area

of adaptation within the MPEG framework. We have

shown that whilst in most media types the adaptation

process is quite extensive, virtual humans provide an

even greater multitude of adaptation and variety of

context-based situations.

We are currently concentrating in applying this work

into an entire system including video, audio, 2D and 3D

graphics in a European Project called ISIS. In addition,

we are also concentrating on developing a new codec

that is much more scalable and therefore better suited

for adaptation; this will also include better integration of

body and face animation in the context of virtual

humans (which is currently quite separated). As

discussed in Section 3.2.3, it can be seen that the codec

does not provide very much flexibility in terms of

adaptation; hence we are currently working on a similar

codec that will provide the same kind of framework as it

currently contains, but with a better structure for

adaptation.

Acknowledgements

This research has been funded through the European

Project ISIS (IST-2001-34545) by the Swiss Federal

Office for Education and Science (OFES). The authors

would like to thank T. Molet for his consultation.

111

References

[1] Heuer J, Casas J, Kaup A. Adaptive Multimedia Messa-

ging based on MPEG-7—The M3-Box, Proceedings of the

second International Symposium on Mobile Multimedia

Systems & Applications, 2000. p. 6–13.

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111


CAG : 1360

UNCORREC

[2] MPEG-21, ISO/IEC 21000-7 Committee Draft ISO/IEC/

JTC1/SC29/WG11/N5534, March 2003.

[3] Preda M, Preteux F. Critic review on MPEG-4 face and

body animation. Proceedings IEEE International Con-

ference Image Processing ICIP, 2002.

[4] Preda M, Preteux F. Advanced animation framework for

virtual character within the MPEG-4 Standard. Proceed-

ings IEEE International Conference Image Processing,

September 2002.

[5] Hoppe H. Progressive meshes, SIGGRAPH, 1996, p. 99–

108.

[6] Garland M, Heckbert P. Surface Simplification Using

Quadric Error Metrics, SIGGRAPH, 1997.

[7] Fei G, Wu E. A real-time generation algorithm of

progressive mesh with multiple properties. Proceedings

Symposium on Virtual Reality Software and Technology,

1999.

[8] Seo H, Magnenat-Thalmann N. LoD management on

animating face models. Proceedings IEEE Virtual Reality,

2000.

[9] Aubel A, Boulic R, Thalmann D. Real-time display of

virtual humans: level of details and impostors. IEEE

Transactions on Circuits and Systems for Video Technol-

ogy, 2000.

[10] Tecchia F, Loscos C, Chrysanthou Y. Image-based crowd

rendering, IEEE Computer Graphics and Applications

2002:36–43.

[11] Wu X, Downes MS, Goktekin T, Tendick F. Adaptive

nonlinear finite elements for deformable body simulation

using dynamic progressive meshes. Proceedings of Euro-

graphics EG’01, p. 349–358, 2001.

[12] Debunne G, Desbrun M, Cani MP, Barr A. Dynamic real-

time deformations using space & time adaptive sampling.

SIGGRAPH, 2001. p. 31–36.

[13] Capell S, Green S, Curless B, Duchamp T, Popovic Z. A

Multiresolution framework for dynamic deformations.

Proceedings of ACM SIGGRAPH Symposium on Com-

puter Animation, 2002.

[14] Hutchinson D, Preston M, Hewitt T. Adaptive refinement

for mass/spring simulations. Proceedings Eurographics

Workshop on Computer Animation and Simulation, 1996.

[15] James DL, Pai DK. DyRT, Dynamic response textures for

real time deformation simulation with graphics hardware.

Proceedings of SIGGRAPH, 2002.

[16] Granieri JP, Crabtree J, Badler N. Production and

playback of human figure motion for visual simulation.

ACM Transaction on Modeling and Computer Simula-

tion, 1995. p. 222–241.

[17] Di Giacomo T, Capo S, Faure F. An Interactive forest.

Proceedings Eurographics Workshop on Computer Ani-

mation and Simulation, 2001.

[18] Guerraz S, Perbet F, Raulo D, Faure F, Cani MP. A

procedural approach to animate interactive natural sce-

neries. Proceedings Of Computer Animation and Social

Agents, 2003.

[19] Kim J, Wang Y, Chang S. Content-adaptive utility based

video adaptation. Proceedings of the IEEE International

Conference on Multimedia & Expo, 2003.

[20] Aggarwal A, Rose K, Regunathan S. Commander Domain

Approach to Scalable AAC. Proceedings of the 110th

Audio Engineering Society Convention, 2001.

TED PROOF

[21] Van Raemdonck W, Lafruit G, Steffens E, Otero-Perez C,

Bril R. Scalable 3D graphics processing in consumer

terminals. IEEE International Conference on Multimedia

and Expo, 2002.

[22] Boier-Martin I. Adaptive Graphics. IEEE Computer

Graphics and Applications, 2003;6–10.

[23] Joslin C, Magnenat-Thalmann N. MPEG-4 animation

clustering for networked virtual environments. IEEE

International Conference on Multimedia and Expo, 2002.

[24] Pham Ngoc N, Van Raemdonck W, Lafruit G, Deconinck

G, Lauwereins R. A QoS Framework for Interactive 3D

Applications. Proceedings Winter School of Computer

Graphics, 2002.

[25] Schneider B, Martin I. An adaptive framework for 3D

graphics in networked and mobile environments. Proceed-

ings Interactive Applications on Mobile Computing, 1998.

[26] Lamberti F, Zunino C, Sanna A, Fiume A, Maniezzo M.

An Accelerated remote graphics architecture for PDAs.

Proceedings of the Web3D 2003 Symposium, 2003.

[27] Kolli G, Junkins S, Barad H. 3D Graphics optimizations

for ARM architecture. Proceedings Game Developers

Conference, 2002.

[28] Chang C, Ger S. Enhancing 3D Graphics on Mobile

Devices by Image-Based Rendering, Proceedings of the

third IEEE Pacific-Rim Conference on Multimedia, 2002.

[29] Stam J. Stable Fluids. SIGGRAPH, 1999, p. 121–128.

[30] MPEG-21 DIA, ISO/IEC/JTC1/SC29/WG11/N5612,

March 2003.

[31] Amielh M, Devillers S. Multimedia content adaptation

with XML. Proceedings International Conference on

MultiMedia Modeling, 2001.

[32] Amielh M, Devillers S. Bitstream syntax description

language: application of XML-schema to multimedia

content adaptation. Proceedings International WWW

Conference, 2002.

[33] Heckbert P. et al. Multiresolution Surface Modeling,

ACM SIGGRAPH Course Note No. 25, 1997

[34] H-Anim version LoA 1.1 Specification, http://h-anim.org/

Specifications/H-Anim1.1/appendices.html, 2000.

Thomas Di Giacomo completed a Master’s degree on multi-

resolution methods for animation with iMAGIS lab and Atari

ex-Infogrames R&D department. He is now a research assistant

and a Ph.D. candidate at MIRALab, University of Geneva. His

work focuses on level of detail for animation and physically

based animation

Chris Joslin obtained his Master’s degree from the University of

Bath, UK and his Ph.D. in Computer Science from the

University of Geneva, Switzerland. His research has been

focused on Networked Virtual Environment systems and real-

time 3D spatial audio, and he is currently a Senior Research

Assistant developing scalable 3D graphics codecs for animation

and representation for specifically for MPEG-4 and MPEG-21

adaptation.

Stephane Garchery is a computer scientist who studied in

University of Grenoble and Lyon, in France. He is working at

the University of Geneva as a senior research assistant in

MIRALab, participating in research on Facial Animation for

real time applications. One of his main tasks is focused on

ARTICLE IN PRESS

1

3

5

7

9

11

13

15

17

19

21


CAG : 1360

developing MPEG-4 facial animation engine, applications and

tools to automatic facial data construction. He has developed

different kind of facial animation engines based on MPEG-4

Facial Animation Parameters for different platform (stand

alone, web applet and mobile device), and different tools to

design quickly and in interactive way.

HyungSeok Kim is a post-doctoral assistant at MIRALab,

University of Geneva. He received his Ph.D. in Computer

Science in February 2003 at VRLab, KAIST : ‘‘Multiresolution

model generation of texture-geometry for the real-time render-

ing’’. His main research field is Real-time Rendering for Virtual

UNCORREC

Environments, more specifically Multiresolution Modeling for

Geometry and Texture. He is also interested in 3D Interaction

techniques and Virtual Reality Systems.

Nadia Magnenat-Thalmann has pioneered research into virtual

humans over the last 20 years. She obtained several Bachelor’s

and Master’s degrees in various disciplines and a Ph.D. in

Quantum Physics from the University of Geneva. From 1977 to

1989, she was a Professor at the University of Montreal in

Canada. In 1989, she founded MIRALab at the University of

Geneva.

TED PROOF

Adaptation of virtual human animation and representation for MPEG

Documents

Transcript of Adaptation of virtual human animation and representation for MPEG