Adaptation of virtual human animation and representation for MPEG
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Adaptation of virtual human animation and representation for MPEG
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
3B2v7:51cGML4:3:1 CAG : 1360 Prod:Type:COM
pp:1211ðcol:fig::4�6;8�9ÞED:Nagesh
PAGN: gopal SCAN: agnes
Computers & Graphics ] (]]]]) ]]]–]]]
ARTICLE IN PRESS
*Correspond
22-379-7780.
E-mail addr
(T. Di Giacom
0097-8493/$ - se
doi:10.1016/j.ca
F
Adaptation of virtual human animation and representation forMPEG
Thomas Di Giacomo*, Chris Joslin, Stephane Garchery, HyungSeok Kim,Nadia Magnenat-Thalmann
MIRALab, University of Geneva, C.U.I., 24 rue General Dufour, Geneva-41211, Switzerland
RECTED PROO
Abstract
While level of detail (LoD) methods for the representation of 3D models are efficient and established tools to manage
the trade-off between speed and quality of the rendering, LoD for animation has not yet been intensively studied by the
community, and especially virtual humans animation has not been focused in the past. Animation, a major step for
immersive and credible virtual environments, involves heavy computations and as such, it needs a control on its
complexity to be embedded into real-time systems. Today, it becomes even more critical and necessary to provide such a
control with the emergence of powerful new mobile devices and their increasing use for cyberworlds. With the help of
suitable middleware solutions, executables are becoming more and more multi-platform. However, the adaptation of
content, for various network and terminal capabilities—as well as for different user preferences, is still a key feature that
needs to be investigated. It would ensure the adoption of ‘‘Multiple Target Devices Single Content’’ concept for virtual
environments, and it would in theory provide the possibility of such virtual worlds in any possible condition without the
need for multiple content. It is on this issue that we focus, with a particular emphasis on 3D objects and animation. This
paper presents some theoretical and practical methods for adapting a virtual human’s representation and animation
stream, both for their skeleton-based body animation and their deformation-based facial animation, we also discuss
practical details to the integration of our methods into MPEG-21 and MPEG-4 architectures.
r 2004 Elsevier Ltd. All rights reserved.
Keywords: Multi-resolution animation and representation; Adaptation; Graphics standardization
R 5759
61
63
65
UNCO1. Introduction
Since the invention of the computer, content has been
tailored towards a specific device, mainly by hand.
Computer games have been developed with specific
computer’s capabilities in mind, various sized video has
been produced so that the user can select their choice
based on the one most likely to run best, and even
different formats have been provided in order that it can
run on different types of machines. In recent years, this
67
69ing author. Tel.: +41-22-379-7618; fax: +41-
ess: [email protected]
o).
e front matter r 2004 Elsevier Ltd. All rights reserve
g.2004.04.004
trend of multiple devices, multiple content (MDMC), is
slowly shifting towards multiple devices, single content
(MDSC), with great advantage both to the content
provider and the end user. Firstly, only a single (usually
high quality) content need be provided for an entire suite
of devices, and secondly the user is provided with
content that is optimised and fits not only the require-
ments of the device, but the network, and the user’s own,
often specific, preferences. This highly motivating goal
can only be achieved within the framework of standar-
disation; and in this case the Motion Picture Experts
Group (MPEG); this is mainly due to the extensibility
and range of applications to which this can be taken.
Though early work in MPEG-7, e.g. proposed by Heuer
et al. [1], adapts certain types of content, this kind of
d.
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]]2
CAG : 1360
MDSC concept is being constructed under the frame-
work of Digital Item Adaptation (DIA)—a Digital Item
being the MPEG name given to any media type or
content, adaptation because it is being adapted, rather
than changed, from its original form into something that
is tailored towards the context of the user and the user’s
equipment. DIA is part of the MPEG-21 framework [2],
for content control, and is in its final stages of
standardisation.
In this paper, we discuss DIA relating to virtual
humans, and in this particular case animation as it is the
main influence for any session involving avatars.
Representation is discussed, but it is merely a transition
from the general conditions that it currently exists in,
towards the aforementioned context of standardisation.
Here we are mainly concerned with adapting animation,
under a variety of different conditions, to a specific
context. We split this paper into 4 major Sections;
Section 2 describes related work, focusing especially on
other types of context-based adaptation both for the
representation and the animation. Section 3 introduces
the adaptation of representation and both Face/Body
Animation (FBA), see Preda et al. [3], and Bone Based
Animation (BBA), see Preda et al. [4], the high-level
adaptation schema and the methodology. Section 4
provides some preliminary results, and Section 5
provides a conclusion and an overview of the future
work that we have planned over the next development
period.
87
89
91
93
95
97
99
101
103
105
107
109
111
UNCORREC2. Related work
This section presents relevant methods for the
adaptation of body and facial animation of virtual
human. We first discuss work to adapt the representa-
tion of these 3D objects, then methods for adaptable
animation and to conclude approaches related to
scalability within MPEG and to various attempts of
3D applications with single adaptation.
In Computer Graphics, there are many methods, such
as the one proposed by Hoppe [5] for instance, for
creating a LoD representation, a.k.a. multi-resolution
models, for virtual objects. Basically, it consists in
refining or simplifying the polygonal mesh, according to
certain criteria such as the distance of the object to the
camera, to save computation during the rendering and/
or to meet some timing requirements. Garland et al. [6]
also propose a method based on error quadrics to
simplify surfaces of 3D objects. Some of these geome-
trical methods are even specifically geared for virtual
humans, for instance, Fei et al. [7] propose LoD for the
virtual human body meshes and Seo et al. [8] for virtual
human face meshes. The main consideration for virtual
humans, over other objects, is to retain a specific
number of polygons at the joints (or near control points
TED PROOF
for the facial animation). Another approach to render
numerous virtual humans in real-time on different
devices is to convert 3D objects to 2D objects. This
procedure is also referred to as transmoding within
MPEG, and such methods have been proposed by Aubel
et al. [9] with impostors and Tecchia et al. [10].
Concerning the methods to adapt animation itself,
much work has been made mostly in the field of
physically based animation. Adaptive techniques, such
as those proposed by Wu et al. [11] using a progressive
mesh, and by Debunne et al. [12] combining it with an
adaptive time-step, reduce the processing time required
for animation. Capell et al. [13] also use a multi-
resolution hierarchical volumetric subdivision to simu-
late dynamic deformations with finite elements. Hutch-
inson et al. [14] refine mass-spring systems if certain
constraints on the system are violated. James et al. [15]
propose to animate deformations with ‘‘dynamic re-
sponsive textures’’ in a system called DyRT, with the
help of precomputations. Adaptation for non-physically
based animation field has received little attention until
now. However, some relevant methods have been
proposed, Granieri et al. [16] adapt the sampling
frequency of virtual human motions to be played back
and support basic reduction of degrees of freedom for
virtual humans. For natural scenes, Di Giacomo et al.
[17] present a framework to animate trees with the use of
a level of detail (LoD) for the animation, Guerraz et al.
[18] animates prairies with three different animation
models as three different possible levels.
A specific adaptation of an MPEG file is also possible,
depending on the media type. While scalable audio and
video are intensively studied topics, such as the work of
Kim et al. [19], and on audio, such as the method
proposed by Aggarwal et al. [20] a progression towards
graphics is slowly being initiated; however only by a few,
such as from Van Raemdonck et al. [21] and Boier-
Martin et al. [22]. It is probably even more the case with
animation, which, to our knowledge, is mainly investi-
gated in the work of Joslin et al. [23]. Though it is
slightly outside the scope of this paper, there is an
important point to be mentioned for DIA, i.e. the
impact of the adaptation. Quality of Service (QoS) is a
tool to estimate such an influence. For instance, Pham
Ngoc et al. [24] propose a QoS for 3D graphics. Yet
there is a need for a dedicated Adaptation QoS for 3D
animation, to evaluate the impact on aesthetic quality
and the usability of adaptation for animation and
specific 3D objects, e.g. virtual characters.
Finally, some research is being done to specify
adaptable architectures for 3D applications, though
most of them are not geared towards multiple scenarios.
For instance Schneider et al. [25] integrate, in their
Network Graphics Framework, various transmission
methods for downloading 3D models in a client–server
environment. A particular transmission method is
F
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
Fig. 1. Binary to bitstream description conversion.
Fig. 2. Bitstream description to binary.
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]] 3
CAG : 1360
selected by comparing the quality of the models and the
performances of the network. Lamberti et al. [26]
propose a platform with accelerated remote rendering
on a cluster which transmits the images to a PDA. User-
navigation is then calculated on the PDA-side and
transmitted back to the server. Optimization for
particular devices are being investigated, not to mention
all the methods for standard desktop PC graphics-
boards, some dedicated optimizations, e.g. such as done
by Kolli et al. [27] for the ARM processor, are explored
for different mobile and new devices. Another way to
adapt 3D applications on different devices is to adapt
directly the underlying model of standard and known
computer graphics techniques. One of the most relevant
examples is the use of Image-Based Rendering techni-
ques by Chang et al. [28]. Another example is the work
by Stam [29], demonstrating stable fluids on PDA with
consideration for the Fixed-Point arithmetic, because of
the lack of FPU on these devices.
Most of the previously described methods perform
their respective task well, however we believe there is
actually no global and generic architecture to adapt
content, representation and animation of virtual hu-
mans, towards network and target client capabilities, as
well as user preferences. This is the focus of our work.
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
UNCORREC3. Adaptation of content
3.1. Introduction
The adaptation of content, based on the MPEG-21
Digital Item Adaptation [30] is quite complex as an
overview, but is actually quite simple and allows for
extremely practical applications. The principles of
adaptation are based on XML Schemas called Bit
Stream Description Language (BDSL) and its generic
form generic Bit Stream Description Language (gBDSL)
introduced by Amielh et al. [31,32]. The idea is that the
codec is described using these schemas; BSDL uses a
codec specific language, meaning that the adaptation
engine needs to understand the language and use a
specific XML Style Sheet (explained later) in order to
adapt the bitstream. gBSDL uses a generic language,
which means that any adaptation engine can transform
the bitstream without a specific style sheet (i.e. that
multiple style sheets are not required for each adapta-
tion).
The adaptation works on multiple levels and is very
flexible. The Bit Stream Description (BDS) is basically
an XML document that contains a description of the
bitstream at a high level (i.e. not on a bit-by-bit basis).
This can contain either the bitstream itself, represented
as hexadecimal strings, or URI links to the bitstream in
another file (usually the original file). This BDS is
generated using a Binary to Bitstream Description
TED PROO
engine (or more commonly BintoBSD), as shown in
Fig. 1.
Once the bitstream is in BSD format, it can be
adapted using an XML style sheet, which basically
contains information on how to adapt the XML
document according to a set of rules that are passed
by the adaptation engine. This adaptation basically
removes elements from the XML document according to
the adaptation schema (described in the following
sections). During this stage the header might be changed
in order to take account of the elements in the bitstream
that were removed; for example, the initial mask might
indicate the presence of all elements, but this would be
adapted to indicate which elements remain after
adaptation.
The XML document is then parsed via a BSD to Bin
converter which takes the XML document and converts
it back into a bitstream; as shown in Fig. 2. In general
the header is converted back from its human readable
form, directly into its binary representation and the
remaining elements of the payload are assembled back
into the binary stream following the header (either using
the URI data, or the hexadecimal string embedded in
the BSD).
In order to overview this at the bit stream level, Fig. 3
provides a syntactic perspective on the entire process.
Whilst the header might also be adapted, in general most
of it will be retained as it contains the information
outlining the format of the rest of the packet. In the
following sections we present adaptation schema for
both body and face animation.
3.2. Virtual human adaptation
3.2.1. Introduction
As the schema is used to define the bitstream layout
on a high level it must basically represent a decoder in
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
Fig. 3. Adaptation process at bitstream level.
Fig. 4. Illustration of clusters.
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]]4
CAG : 1360
UNCORREC
XML format. This does not mean that it will decode the
bitstream, but the structure of the bitstream is impor-
tant. This means that the adaptation methods must be
inline with the lowest level defined in the schema. For
example, if the adaptation needs to skip frames, and as
MPEG codecs are byte aligned, it is practical in this case
to search for the ‘‘start code’’ for each frame; this means
that a frame can be dropped without needing to
understand most of the payload (this is possibly with
the exception of updating the frames—skipped section
of the header—as this could be avoided). However, as
will be seen in the following sections the schema is based
on a much lower level in order to be more flexible.
3.2.2. Schema
In terms of the bitstream level, the schema is defined
on a low-level in order to take into account of the
adaptation described in Sections 3.3 and 3.4. This is
necessary because the FBA codec is defined using groups
(which basically define limbs and expressive regions),
but these are not sufficiently marked in order identify
the payload. This does not pose a serious problem as the
schema generally remains at the server-side (i.e. size is
not a problem), however it does take longer to process
and obviously it is more prone to error.
3.2.3. Scalability of the FBA codec
As a final discussion, we describe the scalability of the
FBA codec. After a closer examination of the codec it
can be found that although the codec performs well for
its general operation, coding face and body movements
and even expressions, there are several short comings
mainly in the area of scalability. This rests mainly in the
grouping of the parameters; this grouping is mainly
suitable for Body/Face Interest Zones, detailed in
Sections 3.4.2 and 3.4.3, but not for Level of Articula-
tion (LoA), which would be just as useful. This means
that schema must be defined on quite a low-level in
order for it to adapt properly. In addition the codec
defines each frame based on quite a high level of
independence, which means that the coding scheme can
be swapped on practically a frame-by-frame basis, in
fact each region of the face can be coded differently
(some as expressions, and some as visemes). This means
that codec is quite flexible in its approach, but again
TED PROOF
quite impractical for adaptation as the schema must
account for this diversity.
3.3. Shape representation
The adaptation operation should be simple enough to
be implemented in light-weight environment. There have
been many researches regarding on multi-resolution
shape representation [33]. So far, they are either
consumes too much spaces or require relatively complex
decoding mechanisms. In this research we devise a
simple representation conforms to the current standards
by clustering all the data so that a specific complexity
can be obtained by simply choosing a set of clusters.
From the complex mesh Mn; it is sequentially
simplified to Mn�1;y;M1;M0: By following the se-
quence, we can identify a set of vertices and faces that
are removed from a mesh of the level i to make a mesh
of the level i � 1; denoted by C(i) where M0 ¼ Cð0Þ:Also there is a set of vertices and faces that are newly
generated by simplification, denoted by N(i). By the
properties of the simplification, it ensures N(i) to be a
subset of unions of C(j) for all i > j:Using this property,
the cluster C(i) is sub-clustered into a set of Cði; jÞ; whichbelongs to N(j) where j>i and Cði; iÞ which does not
belongs to any N(j). Thus, the level i mesh is represented
as the following equation, which requires simple set
selections for the clusters of Cði; jÞ:
Mi ¼Xi
k¼0
ðCðk; kÞ þXn
j¼iþ1
Cðk; jÞÞ: ð1Þ
The clusters are ordered to have small number of
selections for each level. Fig. 4 shows the ordered
example of clusters. The representation does not have
any explicit redundancies and an adaptation process
would be a selection of part of the stream. Streaming
and decoding process can be stopped at the almost any
clusters, except clusters of Cð0Þ; and it is guaranteed that
at least one LoD can be composed from the current
subset of clusters. If the stream is processed more, higher
level is provided.
In the mesh, there are other properties that have to be
taken into account, such as normal, color, and texture
PROOF
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
Table 3
Definitions of the new LoA profiles
Name Definition
LoA High All joints are animatable.
LoA Medium Fingers, Toes, and Spine Joints are
composed into a single joint.
LoA Low The only animatable joints are from the
Very Low Profile, plus the neck joint,
the elbow joints and the knee joints.
LoA VeryLow The only animatable joints are the
shoulders, the root, and the hip joints.
Table 4
Elementary LoA zones
Zone Brief description No. of joints
1 Face only 2
2 Right and left arms 49
3 Torso 25
4 Pelvis 4
5 Legs 4
6 Feet 4
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]] 5
CAG : 1360
EC
coordinates. Based on the mesh structure, we make a
unique mapping from a pair of vertex and face to a value
of property. By assigning the property to the lowest level
vertex and its corresponding clusters, property values
can be represented into the cluster. In real applications,
it is often not necessary to generate all levels of details.
Differences by a few polygons do not usually give
significant differences either in performances or in
qualities. By using the proposed representation, the
modeling system is able to generate a set of levels with
any combination depending on a model and application.
3.4. Body animation
3.4.1. Introduction
Body animation is commonly performed by skeleton-
based animation; where the hierarchical structure of the
bones and joints allows a rather straightforward level of
complexity, which basically corresponds to the level of
depth in this hierarchy. The H-Anim group specifies
some basic LoA [34], see Table 1, which we extend with
modifications and new methods in the following
sections.
3.4.2. Modified LoA
Following the H-Anim specification, we redefine LoA,
as shown in Table 2, linked to original values but being
more suitable to our global adaptation architecture. For
instance, the lower level includes shoulders and hips to
enable minimalist motions even at this level. The
medium level is also a higher simplification of the spine
compared to the primary H-Anim definition.
The detailed definition of our new LoA is given in
Table 3, detailing the possible joints at each level.
Such levels of complexity are used to select an adapted
LoD for the animation of the virtual human. For
UNCORR95
97
99
101
103
105
107
109
111
Table 1
H-Anim LoA profiles
Level Brief description No. of joints
0 Minimum 1
1 Low-end/real-time 18
2 Simplified spine 71
3 Full hierarchy 89
Table 2
Newly defined LoA profiles
Level brief description No. of joints
Very low Minimum 5
Low Low-end/real-time 10
Medium Simplified spine 35
High Full hierarchy 89
TED instance, some virtual humans, which are part of an
audience in a theatre, do not require a high LOA, thus
using LoA VeryLow is enough, while the main character
on stage, which is moving, talking and the focus of the
scene, requires a high LOA (thus using LoA Medium or
High). Though it is not such a significant gain for a
single virtual human, this feature provides efficient
control for the animation of a whole crowd and/or
adapting animations according to terminal and network
capabilities.
3.4.3. Body regions
The main purpose of defining interest zones is to allow
the user to select the part they are most interested in, as
shown in Table 4.
For instance, if the virtual human is presenting a
complete script on how to do cooking, but the user is
only interested in listening to the presentation (and not
at the motions), then the animation can be adapted to
either the upper body, or even just the face and the
shoulders.
Another example would be the tourist guide, pointing
at several locations such as theatres or parks, the user
may only wish to focus on the arms and hands of the
presenter.
The lower level interest zones, consisting of six zones
shown in Fig. 4, can be composed into higher level ones
by two different methods:
TED PROOF
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
Table 5
Predefined zones and their correspondence with elementary
zones
Name Definition
LoAZone All 1+2+3+4+5+6
LoAZone Face 1
LoAZone FaceShoulders 1+2+3
LoAZone UpperBody 1+2+3+4
LoAZone LowerBody 4+5+6
Table 6
Mask to select elementary zones
Predefined Mask No. of joints
All 0 94
Face 1 1
Shoulders 7 77
UpperBody 15 82
LowerBody 45 17
Fig. 5. Body interest zones.
Table 7
Defined facial regions/zones
Group/Zone Contains
0 Jaw (jaw, chin, lips and tongue)
1 Eyeballs (eyeballs and eyelids)
2 Eyebrow
3 Cheeks
4 Head rotation
5 Nose
6 Ears
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]]6
CAG : 1360
UNCORREC
* Predefined zones—This is a set of higher level zones
predefined as a combination of elementary zones, as
described in Table 5.* Mask—This is a value representing which elementary
zones are active. This mask is a binary accumulator
which can take different value, a few of which, the
ones related to predefined zones, are presented in
Table 6.
3.5. Face animation
3.5.1. Introduction
The body LoA is based on a skeleton. For the face
part we do not have this kind of base information, but
we define different areas on the face, and different levels
of detail for each of these areas.
The different parts of the face, as shown in Fig. 5, are
initially specified, and then we will explain in detail the
different LoA values for the face before introducing the
applications possible of these definitions according to
the different platform or systems used.
3.5.2. Face regions
The face is segmented into different interest zones.
This segmentation makes it possible to group the Facial
Animation Parameters (FAP) according to their influ-
ence zone with regards to deformation. For each zone
we also define different levels of complexity, as shown in
Fig. 5. The zone segmentation is based on MPEG-4
FAP grouping; although we have grouped the tongue,
inner and outer lips into only one group because these
displacements are very much linked. The zones are
defined as shown in Table 7.
Note that the LOA does not influence the head
rotation value. Depending on the user interest, we can
further increase or reduce complexity in specific zones.
For example, an application based on speech requires
precise animation around area of the lips and less in
other parts of the face. In this case, we can reduce the
complexity to a very low level for the zones containing
eyeballs, eyebrows, nose and ears zones; a medium level
for the cheek zones and a very high level in the jaw zone.
3.5.3. Levels of complexity
For the face, we define 4 levels—as shown in Table 8,
with the direct hierarchy as defined for the body but
based on the FAP influence according to the desired
complexity or user preferences. Two different techniques
were used to reduce the level of complexity. The first
consists of grouping FAP values together (i.e. all upper
ED PROOF
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
Table 8
Level of articulation for face
Name Definition
LoA High All FAP values
LoA Medium Group Equivalent FAP values
together 2 by 2
LoA Low Starting by suppressing less important
FAP values and grouping FAP or
LoA Medium FAP results together
LoA VeryLow Suppressing unimportant facial
animation parameters and adding full
symmetry of the face parametersFig. 6. Face zones of interest.
Table 9
No. of FAP values shown against LoA profile
High Medium Low V. Low
High Level 2 2 2 2
Jaw 31 19 12 6
Eyeballs 12 8 4 1
Eyebrows 8 3 2 1
Cheeks 4 4 1 1
Head rotation 3 3 3 1
Nose 4 3 0 0
Ears 4 2 2 0
Total 68 44 26 14
Ratio 65% 38% 21%
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]] 7
CAG : 1360
UNCORREC
lip values can be grouped into one value). To select
which FAP should be grouped, we defined certain
constraints, as follows:
1. All grouped FAP values must be in the same area.
2. All grouped FAP values must be under the influence
of the same FAP units.
3. When grouped by symmetry, two FAP values or
groups, we define the controlling FAP values
included in the right part of the face.
The second set of constraints is more destructive to
the overall set of values, meaning that quality is reduced
more rapidly. After a certain distance from the viewing
camera some FAP values become insignificant (e.g. at
low level we remove FAP values pertaining to the
dilation of pupils, this deformation becomes invisible
after a short distance).
* LoA High—This represents the use of all FAP
values, i.e. 2 high level FAP values (never reduced
because they already adapt to level of complexity),
and 66 low level FAP values* LoA Medium—Here we group together certain FAP
values, in order to maintain a high LoD. At this level,
after regrouping we obtain 44 FAP values rather
than maximum 68, a reduction of over 35%.* LoA Low—At this level, we remove many of the
FAP values that become unimportant for face
animation and continue to group FAP values
together. After regrouping/deleting FAP values, 26
FAP values remain against the maximum 68, a
reduction of 62%.* LoA VeryLow—Here, we remove most FAP values
and concentrate on the base values necessary for
minimal animation. This level represents the mini-
mum set of parameters required for animating a face:
we mostly link symmetric LoA Low parameters
together and remove all unimportant values. At this
level, the number of FAP values is reduced to 14,
resulting in an overall reduction of 79%
T3.5.4. Levels of articulation-overview
Fig. 6 shows a global view of all FAP values and the
links between all of them for each LoA profiles. Table 9
shows, for each LoA, how many FAP values drive each
part of the face and Fig. 7 indicating the overall
reduction of FAP values (Fig. 8).
3.5.5. Frame rate reduction
Depending on capabilities of the target platform to
reproduce the exact and complex face animation, we can
also reduce the frame rate in order to reduce network
and animation load; it is done for each LoA, as shown in
Table 10. In the case of different LoA for each zone of
interest, we assume that the maximum frame rate. For
each frame we assume a bit stream less than 1KBits.
3.5.6. Advantages of adaptation
According to the simplification explained above, we
can transmit the minimum facial information according
to the context. After simplification, the FAP stream
remains compatible with MPEG-4 specification, but
with a reduced size; on the client, there are two
possibilities:
The first consists of decoding the stream and
reconstructing all FAP values according to the LOA
UNCORREC
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
Fig. 7. FAP grouping.
Fig. 8. Graph showing reduction of FAP values.
Table 10
Frame rate reduction
Frame Rate (frames/second) (fps)
LoA High 25
LoA Medium 20
LoA Low 18
LoA VeryLow 15
Table 11
Size of data (bytes)
# of polygons (K) Original model (M) Propos
Body model 71 12.7/4.5 17.5/6.5
Face model 7 0.8/0.3 1.0/0.5
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]]8
CAG : 1360
rules. In this case, we have just reduced the bit stream.
The second technique also consists of simplifying the
deformation. In most MPEG-4 facial animation engines,
they compute deformations for each FAP value and
group them together and in mobile platforms the lower
number of computations that are made during anima-
tion the better. We could simply apply the FAP stream
to the deformation engine; in this case we do not
compute 66 deformations areas and group them
together, but design directly the deformation area for
each level of complexity. This work of regrouping could
be done automatically during the pre-processing step or
be transmitted with the model. In the case of very low
LoD, we should have only 11 FAP values included in
linear deformation to be combined on the face rather
than 58.
F 7375
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
TED PROO
4. Results
Table 11 shows results of multiresolution model
representation. The progressive mesh (PM) approach is
known as a near-optimal method for its compactness in
the size. The discrete mesh is a set of discrete levels,
which is still quite common in real-world applications.
The numbers are the size of the VRML and BIFS files,
respectively. Since the PM cannot be encoded to BIFS
format, only the approximated size for the text file is
noted. The highest details have a number of polygons of
71K and 7K for each model whilst the lowest details
have 1K and 552 polygons each. The models are
constructed to have 5 different levels. The proposed
method is located in-between these approaches, and is
flexible and simple enough to allow adaptation with
relatively small file size. More importantly, it is able to
be transmittable via standard MPEG streams. It also
utilizes a simple adaptation mechanism, which is very
similar to the simplest discrete level selection.
To verify the effectiveness of the adaptation in terms
of the network performance, we performed some specific
comparisons using several standard animation se-
quences (Table 12).
We made server experiments with different com-
pressed FAP files (using the same frame rate, and
compression methods) to estimate the influence of LoA
on complete animation, comparing the overall size (Fig.
9).
107
109
111
ed method (M) Progressive mesh (M) Discrete mesh (M)
B13 51.6/12.0
B0.9 3.9/1.6
CORREC
F
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
Table 12
Overall data size comparison
Wow Baf Face23 Macro Lips
Size
High 21255 32942 137355 18792 44983
Medium 16493 24654 104273 14445 36532
Low 13416 20377 78388 11681 30948
Very low 12086 17033 70498 8788 29322
Ratio (%)
High 100% 100% 100% 100% 100%
Medium 78% 75% 76% 77% 81%
Low 63% 62% 57% 62% 69%
Very Low 57% 52% 51% 47% 65%
Fig. 9. Facial results with no symmetry ((a) LoA High, (b) LoA
medium, (c) LoA Low, (d) LoA Very Low).
Table 13
Overall time computation comparison
Wow Baf Face23 Macro Lips
Time
High 4.17 6.10 25.04 4.08 10.90
Medium 3.56 5.23 21.23 3.54 9.81
Low 3.09 4.56 17.58 3.10 8.81
Very low 2.87 4.11 15.98 2.60 8.40
Ratio (%)
High 100% 100% 100% 100% 100%
Medium 85% 86% 85% 87% 90%
Low 74% 75% 70% 76% 81%
Very low 69% 67% 64% 64% 77%
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]] 9
CAG : 1360
UNOverall, the profiles provide a mean value of 77% for
the medium level profile, 63% for the low profile, and
55% for the very low profile. All these files, except the
last one, represent animation from all parts of the face.
For each of them, we obtain the same reduction factor.
For the last file, which only animates the lips, we observe
a smaller reduction factor due absence of FAP suppres-
sion (Table 13).
TED PROO5. Conclusion and future work
In conclusion we have provided a completely new area
of adaptation within the MPEG framework. We have
shown that whilst in most media types the adaptation
process is quite extensive, virtual humans provide an
even greater multitude of adaptation and variety of
context-based situations.
We are currently concentrating in applying this work
into an entire system including video, audio, 2D and 3D
graphics in a European Project called ISIS. In addition,
we are also concentrating on developing a new codec
that is much more scalable and therefore better suited
for adaptation; this will also include better integration of
body and face animation in the context of virtual
humans (which is currently quite separated). As
discussed in Section 3.2.3, it can be seen that the codec
does not provide very much flexibility in terms of
adaptation; hence we are currently working on a similar
codec that will provide the same kind of framework as it
currently contains, but with a better structure for
adaptation.
Acknowledgements
This research has been funded through the European
Project ISIS (IST-2001-34545) by the Swiss Federal
Office for Education and Science (OFES). The authors
would like to thank T. Molet for his consultation.
111
References
[1] Heuer J, Casas J, Kaup A. Adaptive Multimedia Messa-
ging based on MPEG-7—The M3-Box, Proceedings of the
second International Symposium on Mobile Multimedia
Systems & Applications, 2000. p. 6–13.
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]]10
CAG : 1360
UNCORREC
[2] MPEG-21, ISO/IEC 21000-7 Committee Draft ISO/IEC/
JTC1/SC29/WG11/N5534, March 2003.
[3] Preda M, Preteux F. Critic review on MPEG-4 face and
body animation. Proceedings IEEE International Con-
ference Image Processing ICIP, 2002.
[4] Preda M, Preteux F. Advanced animation framework for
virtual character within the MPEG-4 Standard. Proceed-
ings IEEE International Conference Image Processing,
September 2002.
[5] Hoppe H. Progressive meshes, SIGGRAPH, 1996, p. 99–
108.
[6] Garland M, Heckbert P. Surface Simplification Using
Quadric Error Metrics, SIGGRAPH, 1997.
[7] Fei G, Wu E. A real-time generation algorithm of
progressive mesh with multiple properties. Proceedings
Symposium on Virtual Reality Software and Technology,
1999.
[8] Seo H, Magnenat-Thalmann N. LoD management on
animating face models. Proceedings IEEE Virtual Reality,
2000.
[9] Aubel A, Boulic R, Thalmann D. Real-time display of
virtual humans: level of details and impostors. IEEE
Transactions on Circuits and Systems for Video Technol-
ogy, 2000.
[10] Tecchia F, Loscos C, Chrysanthou Y. Image-based crowd
rendering, IEEE Computer Graphics and Applications
2002:36–43.
[11] Wu X, Downes MS, Goktekin T, Tendick F. Adaptive
nonlinear finite elements for deformable body simulation
using dynamic progressive meshes. Proceedings of Euro-
graphics EG’01, p. 349–358, 2001.
[12] Debunne G, Desbrun M, Cani MP, Barr A. Dynamic real-
time deformations using space & time adaptive sampling.
SIGGRAPH, 2001. p. 31–36.
[13] Capell S, Green S, Curless B, Duchamp T, Popovic Z. A
Multiresolution framework for dynamic deformations.
Proceedings of ACM SIGGRAPH Symposium on Com-
puter Animation, 2002.
[14] Hutchinson D, Preston M, Hewitt T. Adaptive refinement
for mass/spring simulations. Proceedings Eurographics
Workshop on Computer Animation and Simulation, 1996.
[15] James DL, Pai DK. DyRT, Dynamic response textures for
real time deformation simulation with graphics hardware.
Proceedings of SIGGRAPH, 2002.
[16] Granieri JP, Crabtree J, Badler N. Production and
playback of human figure motion for visual simulation.
ACM Transaction on Modeling and Computer Simula-
tion, 1995. p. 222–241.
[17] Di Giacomo T, Capo S, Faure F. An Interactive forest.
Proceedings Eurographics Workshop on Computer Ani-
mation and Simulation, 2001.
[18] Guerraz S, Perbet F, Raulo D, Faure F, Cani MP. A
procedural approach to animate interactive natural sce-
neries. Proceedings Of Computer Animation and Social
Agents, 2003.
[19] Kim J, Wang Y, Chang S. Content-adaptive utility based
video adaptation. Proceedings of the IEEE International
Conference on Multimedia & Expo, 2003.
[20] Aggarwal A, Rose K, Regunathan S. Commander Domain
Approach to Scalable AAC. Proceedings of the 110th
Audio Engineering Society Convention, 2001.
TED PROOF
[21] Van Raemdonck W, Lafruit G, Steffens E, Otero-Perez C,
Bril R. Scalable 3D graphics processing in consumer
terminals. IEEE International Conference on Multimedia
and Expo, 2002.
[22] Boier-Martin I. Adaptive Graphics. IEEE Computer
Graphics and Applications, 2003;6–10.
[23] Joslin C, Magnenat-Thalmann N. MPEG-4 animation
clustering for networked virtual environments. IEEE
International Conference on Multimedia and Expo, 2002.
[24] Pham Ngoc N, Van Raemdonck W, Lafruit G, Deconinck
G, Lauwereins R. A QoS Framework for Interactive 3D
Applications. Proceedings Winter School of Computer
Graphics, 2002.
[25] Schneider B, Martin I. An adaptive framework for 3D
graphics in networked and mobile environments. Proceed-
ings Interactive Applications on Mobile Computing, 1998.
[26] Lamberti F, Zunino C, Sanna A, Fiume A, Maniezzo M.
An Accelerated remote graphics architecture for PDAs.
Proceedings of the Web3D 2003 Symposium, 2003.
[27] Kolli G, Junkins S, Barad H. 3D Graphics optimizations
for ARM architecture. Proceedings Game Developers
Conference, 2002.
[28] Chang C, Ger S. Enhancing 3D Graphics on Mobile
Devices by Image-Based Rendering, Proceedings of the
third IEEE Pacific-Rim Conference on Multimedia, 2002.
[29] Stam J. Stable Fluids. SIGGRAPH, 1999, p. 121–128.
[30] MPEG-21 DIA, ISO/IEC/JTC1/SC29/WG11/N5612,
March 2003.
[31] Amielh M, Devillers S. Multimedia content adaptation
with XML. Proceedings International Conference on
MultiMedia Modeling, 2001.
[32] Amielh M, Devillers S. Bitstream syntax description
language: application of XML-schema to multimedia
content adaptation. Proceedings International WWW
Conference, 2002.
[33] Heckbert P. et al. Multiresolution Surface Modeling,
ACM SIGGRAPH Course Note No. 25, 1997
[34] H-Anim version LoA 1.1 Specification, http://h-anim.org/
Specifications/H-Anim1.1/appendices.html, 2000.
Thomas Di Giacomo completed a Master’s degree on multi-
resolution methods for animation with iMAGIS lab and Atari
ex-Infogrames R&D department. He is now a research assistant
and a Ph.D. candidate at MIRALab, University of Geneva. His
work focuses on level of detail for animation and physically
based animation
Chris Joslin obtained his Master’s degree from the University of
Bath, UK and his Ph.D. in Computer Science from the
University of Geneva, Switzerland. His research has been
focused on Networked Virtual Environment systems and real-
time 3D spatial audio, and he is currently a Senior Research
Assistant developing scalable 3D graphics codecs for animation
and representation for specifically for MPEG-4 and MPEG-21
adaptation.
Stephane Garchery is a computer scientist who studied in
University of Grenoble and Lyon, in France. He is working at
the University of Geneva as a senior research assistant in
MIRALab, participating in research on Facial Animation for
real time applications. One of his main tasks is focused on
ARTICLE IN PRESS
1
3
5
7
9
11
13
15
17
19
21
T. Di Giacomo et al. / Computers & Graphics ] (]]]]) ]]]–]]] 11
CAG : 1360
developing MPEG-4 facial animation engine, applications and
tools to automatic facial data construction. He has developed
different kind of facial animation engines based on MPEG-4
Facial Animation Parameters for different platform (stand
alone, web applet and mobile device), and different tools to
design quickly and in interactive way.
HyungSeok Kim is a post-doctoral assistant at MIRALab,
University of Geneva. He received his Ph.D. in Computer
Science in February 2003 at VRLab, KAIST : ‘‘Multiresolution
model generation of texture-geometry for the real-time render-
ing’’. His main research field is Real-time Rendering for Virtual
UNCORREC
Environments, more specifically Multiresolution Modeling for
Geometry and Texture. He is also interested in 3D Interaction
techniques and Virtual Reality Systems.
Nadia Magnenat-Thalmann has pioneered research into virtual
humans over the last 20 years. She obtained several Bachelor’s
and Master’s degrees in various disciplines and a Ph.D. in
Quantum Physics from the University of Geneva. From 1977 to
1989, she was a Professor at the University of Montreal in
Canada. In 1989, she founded MIRALab at the University of
Geneva.
TED PROOF