Gesture and Speech Multimodal Conversational Interaction

42
Gesture and Speech Multimodal Conversational Interaction Francis Quek , David McNeill , Robert Bryll , Susan Duncan , Xin-Feng Ma , Cemil Kirbas , Karl E. McCullough , and Rashid Ansari Vision Interfaces and Systems Laboratory Wright State University, Dayton OH Department of Psychology University of Chicago Signal and Image Research Laboratory The University of Illinois at Chicago Correspondence: [email protected] February 1, 2001 Robert Bryll; Department of Computer Science and Engineering, 3640 Colonel Glenn Hwy., Dayton, OH 45435-0001; phone: (937) 775-3930, fax: (937) 775-5133.

Transcript of Gesture and Speech Multimodal Conversational Interaction

Gesture and Speech Multimodal ConversationalInteraction

Francis Quek?, David McNeilly, Robert Bryll?z, Susan Duncan?,Xin-Feng Ma?, Cemil Kirbas?, Karl E. McCulloughy, and

Rashid Ansari4?Vision Interfaces and Systems Laboratory

Wright State University, Dayton OHyDepartment of Psychology

University of Chicago4Signal and Image Research Laboratory

The University of Illinois at ChicagoCorrespondence: [email protected] 1

February 1, 2001

1Robert Bryll; Department of Computer Science and Engineering, 3640 Colonel Glenn Hwy., Dayton,OH 45435-0001; phone: (937) 775-3930, fax: (937) 775-5133.

VISLab
Submitted to ACM Transactions on Computer-Human Interaction. Also as VISLab Report: VISLab-01-01.

Categories and subject descriptors; general terms: User interfaces (H.5.2): theory and meth-

ods, interaction styles (conversational interaction).

Keywords and phrases: Multimodal interaction, conversational interaction, gesture, speech, dis-

course, human interaction models, gesture analysis.

1

Abstract

Gesture and speech combine to form a rich basis for human conversational interaction. To exploit

these modalities in HCI, we need to understand the interplay between them and the way in which

they support communication. We propose a framework for the gesture research done to date, and

present our work on the cross-modal cues for discourse segmentation in free-form gesticulation

accompanying speech in natural conversation as a new paradigm for such multimodal interaction.

The basis for this integration is the psycholinguistic concept of the co-equal generation of gesture

and speech from the same semantic intent. We present a detailed case study of a gesture and speech

elicitation experiment in which a subject describes her living space to an interlocutor. We perform

two independent sets of analyses on the video and audio data: video and audio analysis to extract

segmentation cues, and expert transcription of the speech and gesture data by micro-analyzing the

video tape using a frame-accurate video player to correlate the speech with the gestural entities.

We compare the results of both analyses to identify the cues accessible in the gestural and audio

data that correlate well with the expert psycholinguistic analysis. We show that ‘handedness’ and

the kind of symmetry in two-handed gestures provide effective supersegmental discourse cues.

1 Introduction

Natural human conversation is a rich interaction among multiple verbal and non-verbal channels.

Gestures are an important part of human conversational interaction [1] and they have been studied

extensively in recent years in efforts to build computer-human interfaces that go beyond traditional

input devices such as keyboard and mouse manipulations. In a broader sense “gesture” includes

not only hand movements, but also facial expressions and gaze shifts. For human-computer inter-

action to approach the level of transparency of inter-human discourse, we need to understand the

phenomenology of conversational interaction and the kinds of extractable features that can aid in

its comprehension. In fact, we believe that we need new paradigms for looking at this nexus of

multimodal interaction before we can properly exploit the human facility with this rich interplay

of verbal and non-verbal behavior in an instrumental fashion.

In this paper, we present our work on hand-gesture and speech interaction in human discourse

segmentation. In the course of our presentation we hope to motivate a view of this multimodal

interaction that gives us a glimpse into the construction of discourse itself.

1.1 Previous Work

There is a growing body of literature on the instrumental comprehension of human gestures.

Predominantly, research efforts are clustered around two kinds of gestures: manipulative and

semaphoric. We define manipulative gestures as those whose intended purpose is to control some

entity by applying a tight relationship between the actual movements of the gesturing hand/arm

with the entity being manipulated. Semaphores are systems of signalling using flags, lights or

arms [2]. By extension, we define semaphoric gestures to be any gesturing system that employs a

stylized dictionary of static or dynamic hand or arm gestures.

1.1.1 Manipulative Gestures

Research employing manipulative gesture paradigm may be thought of as following the seminal

“Put-That-There” work by Richard Bolt [3, 4]. Since then, there have been a plethora of systems

that implement finger tracking/pointing [5, 6, 7, 8, 9, 10], a variety of ‘finger flying’ style naviga-

tion in virtual spaces or direct-manipulation interfaces [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,

23], control of appliances [24], in computer games [25, 26, 27], and robot control [28, 29, 30, 31].

1

Other manipulative applications include interaction with wind-tunnel simulations [32, 33], voice

synthesizers [34, 35, 36] and an optical flow-based estimates of one of 6 gross full-body gestures

(jumping waving, clapping, drumming, flapping, marching) for controlling a musical instrument

[37]. Some of these approaches cited (e.g. [28, 38, 39, 34, 35]) use special gloves or trackers, while

others employ only camera-based visual tracking. Such manipulative gesture systems typically use

the shape of the hand to determine the mode of action (e.g. to navigate, pick something up, point

etc.) while the motion indicates the path or extent of the motion.

Gestures used in communication/conversation differ from manipulative gestures in several sig-

nificant ways [40, 41]. First, because the intent of the latter is for manipulation, there is no guar-

antee that the salient features of the hands are visible. Second, the dynamics of hand movement in

manipulative gestures differ significantly from conversational gestures. Third, manipulative ges-

tures may typically be aided by visual, tactile or force feedback from the object (virtual or real)

being manipulated, while conversational gestures are typically performed without such constraints.

Gesture and manipulation are clearly different entities sharing between them possibly only the fea-

ture that both may utilize the same bodily parts.

There is a class of gestures that sits between pure manipulation and natural gesticulation (see

section 1.3. This class of gestures, broadly termed deictics or pointing gestures, have some of

the flavor of manipulation in its capacity of immediate spatial reference. Deictics also facilitate

the ‘concretization’ of abstract or distant entities in discourse, and so are the subject of much

study in psychology and linguistics. Following [3, 4], work done in the area of integrating direct

manipulation with natural language and speech has shown some promise in such combination.

Earlier work by Cohen et al [42, 43] involved the combination of the use of a pointing device

and typed natural language to resolve anaphoric references. By constraining the space of possible

referents by menu enumeration, the deictic component of direct manipulation was used to augment

the natural language interpretation. [44, 45] describe similar work employing mouse pointing for

deixis and spoken and typed speech in a system for querying geographical databases. Oviatt et al

[46, 47, 48] extended this research direction by combining speech and natural language processing

and pen-based gestures. We have argued that pen-based gestures retain some of the temporal

coherence with speech as with natural gesticulation [49], and this co-temporality was employed

in [46, 47, 48] to support mutual disambiguation of the multimodal channels and the issuing of

spatial commands to a map interface. Koons et al [50, 51] describe a system for integrating deictic

2

gestures, speech and eye gaze to manipulate spatial objects on a map. Employing a tracked glove,

they extracted the gross motions of the hand to determine such elements as ‘attack’ (motion toward

the gesture space over the map), ‘sweep’ (side-to-side motion), and ‘end reference space’ (the

terminal position of the hand motion). They relate these spatial gestural references to the gaze

direction on the display, and to speech to perform a series of ‘pick-and-place’ operations. This

body of research differs from that reported in this paper in that we address more free-flowing

gestures accompanying speech, and are not constrained to the two-dimensional reference to screen

or pen-tablet artifacts of pen or mouse gestures.

1.1.2 Semaphoric Gestures

Semaphoric gestures are typified by the application of some recognition-based approach to identify

some gesture gi 2 G where G is a set of predefined gestures. Semaphoric approaches may be

termed as ‘communicative’ in that gestures serve as a universe of symbols to be communicated

to the machine. A pragmatic distinction between semaphoric gestures and manipulative ones is

that the former does not require the feedback control (e.g. hand-eye, force-feedback, or haptic)

necessitated for manipulation. Semaphoric gestures may be further categorized as being static or

dynamic. Static semaphore gesture systems interpret the pose of a static hand to communicate the

intended symbol. Examples of such systems include color-based recognition of the stretched-open

palm where flexing specific fingers indicate menu selection [52], Zernike moments-based hand

pose estimation [53], the application of orientation histograms (histograms of directional edges) for

hand shape recognition [54], graph-labeling approaches where labeled edge segments are matched

against a predefined hand graph [55] (they show recognition of American Sign Language, ASL-

like, finger spelling poses), a ‘flexible-modeling’ in which a feature-average of a set of hand poses

are computed and each individual hand pose is recognized as a deviation from this mean (principal

component analysis, PCA, of the feature covariance matrix is used to determine the main modes of

deviation from the ‘average hand pose’) [56], the application of ‘global’ features of the extracted

hand (using color processing) such as moments, aspect ratio, etc. to determine the shape of the hand

out of 6 predefined hand shapes [57], and model-based recognition based on 3D model prediction

[58].

In dynamic semaphore gesture systems, some or all of the symbols represented in the semaphore

library involve predefined motion of the hands or arms. Such systems typically require that

3

gestures be performed from a predefined viewpoint to determine which gi 2 G is being per-

formed. Approaches include finite state machines for recognition of a set of editing gestures for

an ‘augmented whiteboard’ [59], trajectory-based recognition of gestures for ‘spatial structuring’

[60, 61, 40, 41, 62, 63], recognition of gestures as a sequence of state measurements [64], recog-

nition of oscillatory gestures for robot control [65], and ‘space-time’ gestures that treat time as a

physical third dimension [66, 67].

One of the most common approaches for the recognition of dynamic semaphoric gestures is

based on the Hidden Markov Model (HMM). First applied by Yamato, Ohya, and Ishii [68] to

the recognition of tennis strokes, it has been applied in a myriad of semaphoric gesture recogni-

tion systems. The power of the HMM lies in its statistical rigor and ability to learn semaphore

vocabularies from examples. A HMM may be applied in any situation in which one has a stream

of input observations formulated as a sequence of feature vectors and a finite set of known clas-

sifications for the observed sequences. HMM models comprise state sequences. The transitions

between states are probabilistically determined by the observation sequence. HMMs are ‘hidden’

in that one does not know which state the system is in at any time. Recognition is achieved by

determining the likelihood that any particular HMM model may account for the sequence of in-

put observations. Typically, HMM models for different gestures within a semaphoric library are

rank-ordered by likelihood, and the one with the greatest likelihood is selected. An excellent in-

troduction to HMM’s may be found in [69]. While different configurations are used in HMM’s,

semaphoric gesture recognition systems typically take a left-to-right state transition diagram. Hof-

mann, Heyer, and Hommel’s discussion of their HMM-based system provides a detailed discussion

on how HMM’s may be set up for semaphoric gesture recognition [70]. We shall look at their work

as prototypical of such systems. Their work involves the recognition of dynamic gestures using

discrete Hidden Markov Models using a continuous gesture stream as input data. The data is

obtained using the TUB (Tech. U. of Berlin) “Sensor Glove” that measures hand position and

orientation, and joint finger angles using accelerometers and magnetic sensors that measure the

absolute angles of finger flexure. Because only a right-handed TUB glove existed, all their work is

done with gestures of one hand. In their work, they ignore static pose and work only with gesture

dynamics. They show results from two stylized dictionaries, a special emblem dictionary compiled

at Berlin, and gestures from the German Sign Language. Owing to the limitation of their glove,

some gestures had to be entered slower than the natural signers would prefer. They take a two-stage

4

recognition approach: a low level stage, dividing the input data space into regions and assigning

input vectors to them (what is known as the vector codebook approach in HMM’s). In the second

stage, the vector indices of the first stage are used to classify gestures using a discrete HMM. Their

system does not handle ‘co-articulations’. The dataset is divided in to a set of G different gestures

and each gesture is represented by P (typically 10) prototypes. The prototypes are converted into

a time-feature vector series both for the orientation-acceleration, and the joint angle matrix. They

convert all data into a time-derivative series and they take a running average to smooth it. These

data are thresholded to determine pauses. The remaining segments are treated as the gesture set.

They further extract the velocity extrema from both vectors (velocity-acceleration, and joint angle)

that are used as feature vectors for the low-level classification. Because the feature vectors are too

large to be handled by a discrete HMM, they do a preclustering where they group the feature vec-

tors of the prototypes of each gesture using an iterative clustering algorithm based on the euclidean

distance measure to obtain prototype centers for each gesture type. The output of a prematch of

any input vector to this set of prototypes will be the input to the next stage high level HMM. They

experimented with two different discrete HMM structures, a fully-connected ergodic HMM with

different number of nodes, and a left-to-right HMM with various number of nodes to see if there is

any optimal HMM structure or size. The collected data from 2 subjects, performing 500 gestures

10 times each, yielding 10,000 prototypes. However, because their system takes a whole month to

train 500 gestures, they dealt with only 100 gestures, yielding 2,000 prototypes for the 2 signers.

They also reported the effects of various mixture coefficients for the HMMs. Their results show

an optimal value of the mixture coefficient to be 0.1, and for the ergodic model, the best result was

95.6% accuracy on the 100 gesture set with 5 nodes. For the left-to-right HMM, they got the same

results with a 3 node system. They tested the system with different set sizes from 10, 20, 30, 40, 50

100. Their results show increasing error rates with set size. The growth appears to be linear. It

is difficult to extrapolate the results obtained to other gesture sets with different features, and also

with other gesture set sizes. Assan and Grobel [71] also provide a good discussion of the theory

and application of HMMs to sign language recognition.

A parametric extension to the standard HMM (a PHMM) to recognize degrees (or parameters)

of motion [39, 72]. For example, the authors describe a ‘fish size’ gesture with inward opposing

open palms that indicate the size of the fish. Their system encodes the degree of motion in which

the output densities are a function of the gesture parameter in question (e.g. separation of the hands

5

in the ‘fish size’ gesture). Schlenzig, Hunter and Jain apply a recursive recognition scheme based

on HMMs and utilize a set of rotationally invariant Zernike moments in the hand shape description

vector [73, 74]. Their system recognized a vocabulary of 6 semaphoric gestures for communication

with a robot gopher. Their work was unique in that they used a single HMM in conjunction with

a finite state estimator for sequence recognition. The hand shape in each state was recognized by

a neural net. The authors of [75] describe a system using Hidden Markov Models to recognize a

set of 24 gestures. The gestures are dynamic, i.e. the motion component is used for recognition.

The system uses 24 different HMMs for 24 gestures. The recognition rate (92.9%) is high, but it

was obtained for isolated gestures, i.e. gesture sequences were segmented by hand. The problem,

however, is in filtering out the gestures that do not belong to the gesture vocabulary (folding arms,

scratching head). The authors trained several “garbage” HMM models to recognize and filter out

such gestures, but the experiments performed were limited to the gesture vocabulary and only a

few transitional garbage gestures. Assan and Grobel [71] describe a HMM system for video-based

sign language recognition. The system recognizes 262 different gestures from the Sign Language

of the Netherlands. The authors present both results for recognition of isolated signs and for

reduced vocabulary of connected signs. Colored gloves are used to aid in recognition of hands

and specific fingers. The colored regions are extracted for each frame to obtain hand positions

and shapes, which form the feature vector. For connected signs the authors use additional HMMs

modeling the transitions between signs. The experiments were done in a controlled environment

and only a small set of connected signs was recognized with 73% of recognition versus 94% for

isolated signs).

Other HMM-based systems include the recognition of a set of 6 ‘musical instrument’ sym-

bols (e.g. playing the guitar) [76], recognition of 10 gestures for presentation control [77], music

conducting [64, 78], recognition of unistroke-like finger spelling performed in the air [79], and

communication with a molecular biology workstation [9].

1.1.3 Other Paradigms

Wexelblat [21] describes research whose goal is to “understand and encapsulate gestural interac-

tion in such a way that gesticulation can be treated as a datatype – like graphics and speech – and

incorporated into any computerized environment where it is appropriate.” The author does not

make any distinction between the communicative aspect of gesture and the manipulative use of the

6

hand, citing the act of grasping a virtual door knob and twisting as a ‘natural’ gesture for open-

ing a door in a virtual environment. The paper describes a set of experiments for determining the

characteristics of human gesticulation accompanying the description of video clips subjects have

viewed. Although these experiments were rather naive since there is a large body of literature on

narration of video episodes [1], the study does confirm what psycholinguists already know vis a vis

that females do not produce fewer gestures than males, and that second language speakers do not

produce more gestures than native speakers. In fact, the author states that “in general we could not

predict what users would gesture about.” Although these would be too sweeping generalizations to

make given the limited scope of the study were these novel discoveries, it does agree with the large

body of psycholinguistic literature. Wexelblat also states “there were things in common between

subjects that were not being seen at a full-gesture analysis level. Gesture command languages gen-

erally operate only at a whole gesture level, usually by matching the user’s gesture to a pre-stored

template. . . . [A]ttempting to do gesture recognition solely by template matching would quickly

lead to a proliferation of templates and would miss essential commonalities” [of real gestures]. As

will be seen later, this affirms our assertion concerning the characteristics of human gesticulation

accompanying speech. Proceeding from these observations, the author describes a system that

divides the gesture analysis process into two phases – feature analysis and interpretation (“where

meaning is assigned to the input”). The system uses CyberGloves 1 , body position sensors and eye

trackers as inputs (the data is sampled at 20Hz). The data are then segmented by feature detectors.

The features are then temporally integrated to form “frames” representing phases of a gesture. Be-

side stating that the outputs of these analyses would go to domain-dependent gesture interpreters

which may be built, the system makes no attempt to ‘recognize’ the gestures. The architecture of

the system is similar to the architecture presented in [80]: the signals are analyzed by “layers” of

parallel detectors/analyzing units, and the level of abstraction increases with each level of analysis.

Wilson, Bobick and Cassell [81] proposed a tri-phasic gesture segmenter that expects all ges-

tures to be a rest, transition, stroke, transition, rest sequence. They use an image-difference ap-

proach along with a finite state machine to detect these motion sequences. Natural gestures are,

however, seldom clearly tri-phasic in the sense of this paper. Speakers do not normally terminate

each gesture sequence with the hands in their rest positions. Instead, retractions from the preceding

gesture often merge with the preparation of the next.1See www.virtex.com

7

Kahn et al [10] describe their Perseus architecture that recognizes a standing human form

pointing at various predefined artifacts (e.g. coke cans). They use an object-oriented represen-

tation scheme that uses a ‘feature map’ comprising intensity, edge, motion, disparity and color

features to describe objects (standing person and pointing targets) in the scene. Their system rea-

sons with these objects to determine the object being pointed at. Extending Perseus, [82] describe

an extension of this work to direct and interact with a mobile robot.

There is a class of systems that applies a combination of semaphoric and manipulative ges-

tures within a single system. This class is typified by [9] that combines of HMM-based gesture

semaphores (move forward, backward), static hand poses (grasp, release, drop etc.) and pointing

gestures (finger tip tracking using 2 orthogonally oriented cameras – top and side). System is used

to manipulate graphical DNA models.

Sowa and Wachsmuth [83, 84] describe a study based on a system for using coverbal iconic ges-

tures for describing objects in the performance of an assembly task in an virtual environment. They

use a pair of CyberGloves for gesture capture, three Ascension Flock of Birds electromagnetic

trackers 2 mounted to the subject’s back for torso tracking and wrists, and a headphone-mounted

microphone for speech capture. In this work, subjects describe contents of a set of 5 virtual parts

(e.g. screws and bars) that are presented to them in wall-size display. The gestures were anno-

tated using the Hamburg Notation System for sign languages [85]. The authors found that “such

gestures convey geometric attributes by abstraction from the complete shape. Spatial extensions

in different dimensions and roundness constitute the dominant ‘basic’ attributes in [their] corpus

. . . geometrical attributes can be expressed in several ways using combinations of movement trajec-

tories, hand distances, hand apertures, palm orientations, hand-shapes and index finger direction.”

In essence, even with the limited scope of their experiment in which the imagery of the subjects

was guided by a wall-size visual display, a panoply of iconics relating to some (hard-to-predict)

attributes of each of the 5 target objects were produced by the subjects.

1.2 Conversational Gestures

In their review of gesture recognition systems, Pavlovic, Sharma and Huang [86] conclude that

natural, conversational gesture interfaces are still in their infancy. They state that most current

work “address a very narrow group of applications: mostly symbolic commands based on hand2See www.ascension-tech.com

8

postures or 3D-mouse type of pointing”, and that “real-time interaction based on 3D hand model-

based gesture analysis is yet to be demonstrated”.

In reviewing the challenges to automatic gesture recognition, Wexelblat [87] emphasizes the

need for development of systems able to recognize natural, non-posed and non-discrete gestures.

Wexelblat disqualifies systems recognizing artificial, posed and discrete gestures as unnecessary

and superficial:

If users must make one fixed gesture to, for example, move forward in a system then

stop, then make another gesture to move backward, I find myself wondering why the

system designers bother with gesture in the first place. Why not simply give the person

keys to press: one for forward and one for backward?

He considers the natural gestural interaction to be the only one “real” and useful mode of

interfacing with computer systems:

... one of the major points of gesture modes of operation is their naturalness. If you

take away that advantage, it is hard to see why the user benefits from a gestural inter-

face at all.

He underscores the need for systems working with truly conversational gestures, and also em-

phasizes the tight connection of gestures and speech (conversational gestures cannot be analyzed

without considering speech). He expresses urgent need for standard datasets that could be used

for testing of gesture recognition algorithms (paper [88] describes a development of such a dataset

or corpus containing hand-arm pointing gestures). One of his conclusions, however, is that the

need for conversational gesture recognition still remains to be proven (by proving, for example,

that natural gesture recognition can improve speech recognition):

An even broader challenge in multimodal interaction is the question of whether or not

gesture serves any measurable useful function, particularly in the presence of speech.

1.3 Gesture and Speech in Natural Conversation

In natural conversation between humans, gesture and speech function together as a ‘co-expressive’

whole, providing one’s interlocutor access to semantic content of the speech act. Psycholinguistic

9

evidence has established the complementary nature of the verbal and non-verbal aspects of hu-

man expression [1]. Gesture and speech are not subservient to each other, as though one were

an afterthought to enrich or augment the other. Instead, they proceed together from the same

‘idea units’, and at some point bifurcate to the different motor systems that control movement and

speech. For this reason, human multi-modal communication coheres topically at a level beyond

the local syntax structure.

This discourse structure is an expression of the ‘idea units’ that proceed from human thought.

For this same reason, while the visual form (the kinds of hand shapes etc.), magnitude (distance

of hand excursions), and trajectories (paths along which hands move) may change across cultures

and individual styles, underlying governing principles exist for the study of gesture and speech

in discourse. Chief among these is the temporal coherence between the modalities at the level of

communicative intent. A football coach may say “We have to throw the ball into the end zone”.

Depending on the intent of the expression, the speaker may place the speech and gestural stresses

on different parts on the expression. If the emphasis is on the act, the stresses may coincide on the

verb ‘THROW’, and if the emphasis is on the place of the act, the stresses accompany the utterance

‘END ZONE’. This temporal coherence is governed by the constants of the underlying neuronal

processing that proceed from the nascent ‘idea unit’ or ‘growth point’ [89, 90]. Furthermore, the

utterance in our example makes little sense in isolation. The salience of the emphasis come into

focus only when one considers the content of the preceding discourse. The emphasis on the act

‘THROW’ may be in contrast to some preceding discussion of the futility of trying to run the ball

earlier in the game. The discourse may have been: “They are stacking up an eight-man front to

stop the run . . . We have to THROW the ball into the end zone”. The accompanying gesture might

have taken the form of the hand moving forward in a throwing motion coincident with the prosodic

emphasis on ‘THROW’. Alternatively, the emphasis on the place specifier ‘END ZONE’ may be

in contrast to other destinations of passing plays. In this case, the discourse may have been: “We

have no more time-outs, and there is only 14 seconds left in the game . . . We have to throw the ball

into the END ZONE.” The accompanying gesture may be a pointing gesture to some distal point

from the speaker indicating the end zone. This will again be co-temporal with the prosodically

emphasized utterance ‘END ZONE’. Hence, gesture and speech prosody features may provide

access to the management of discourse at the level of its underlying threads of communicative

intent.

10

gesticulation language-like gestures

pantomimes emblems sign languages

Figure 1: A continuum of gestures (reproduced from Hand and Mind, David McNeill, 1992

We seek an understanding of gesture and speech in natural human conversation. We believe that

an understanding of the the constants and principles of such speech-gesture cohesion is essential

to its application in human computer interaction involving both modalities. Several observations

are necessary to frame the premise of such conversational interaction. While gesture may be ex-

tended to include head and eye gestures, facial gestures and body motion, we shall explore only

the relationship of hand gestures and speech.

When one considers a communicative channel, a key question is the degree of formalism re-

quired. Kendon [91, 1] describes a philology of gesture summarized in Figure 1 which is useful

in this consideration. At one end of this continuum is gesticulation which describes the free-form

gesturing which typically accompany verbal discourse. At the other end of this continuum are sign

languages which are characterized by complete lexical and grammatical specification. In between

we have ‘language-like gestures’ in which the speaker inserts a gesture in the place of a syntactic

unit during speech; ‘pantomimes’ in which the gesturer creates an iconic and motion facsimile of

the referent, and ‘emblems’ which are codified gestural expressions that are not governed by any

formal grammar (The ‘thumbs-up’ gesture to signify ‘all is well’ is an example of such a gesture.)

This paper focuses on the ‘gesticulation’ end of this continuum.

We shall present the underlying psycholinguistic model by which gesture and speech entities

may be integrated, describe our research method that encompasses processing of the video data

and psycholinguistic analysis of the underlying discourse, and present the results of our analysis.

Our purpose, in this paper, is to motivate a perspective of conversational interaction, and to show

that the cues for such interaction are accessible in video data. In this study, our data comes from a

single video camera, and we consider only gestural motions in the camera’s image plane.

2 Psycholinguistic Basis

Gesture and speech clearly belong to different modalities of expression but they are linked on

several levels and work together to present the same semantic idea units. The two modalities are

11

not redundant; they are ‘co-expressive,’ meaning that they arise from a shared semantic source

but are able to express different information, overlapping this source in their own ways. A simple

example will illustrate. In the living space text we present below, the speaker describes at one

point entering a house with the clause, “when you open the doors.” At the same time she performs

a two-handed anti-symmetric gesture in which her hands, upright and palms facing forward, move

left to right several times. Gesture and speech arise from the same semantic source but are non-

redundant; each modality expresses its own part of the shared constellation. Speech describes an

action performed in relation to it, gesture shows the shape and extent of the doors and that there are

two of them rather than one; thus speech and gesture are co-expressive. Since gesture and speech

proceed from the same semantic source, one might expect the semantic structure of the resulting

discourse to be accessible through both the gestural and speech channels. Our ‘catchment’ concept

provides a locus along which gestural entities may be viewed to provide access to this semantic

discourse structure.

The catchment is a unifying concept that associates various discourse components [89, 92, 93].

A catchment is recognized when gesture features recur in two or more (not necessarily consecutive)

gestures. The logic is that the recurrence of imagery in a speaker’s thinking will generate recurrent

gesture features. Recurrent images suggest a common discourse theme. These gesture features can

be detected and the recurring features offer clues to the cohesive linkages in the text with which

they co-occur. A catchment is a kind of thread of visuospatial imagery that runs through the dis-

course to reveal emergent larger discourse units even when the parts of the catchment are separated

in time by other thematic material. By discovering the catchments created by a given speaker, we

can see what this speaker is combining into larger discourse units – what meanings are regarded as

similar or related and grouped together, and what meanings are being put into different catchments

or are being isolated, and thus seen by the speaker as having distinct or less related meanings.

By examining interactively shared catchments, we can extend this thematic mapping to the social

framework of the discourse. Gestural catchments have their own distinctive prosodic profiles and

gaze control in that gaze redirection is an accompaniment of catchment (re)introduction.

By discovering a given speaker’s catchments, we can see what for this speaker goes together

into larger discourse units – what meanings are seen as similar or related and grouped together, and

what meanings are isolated and thus seen by the speaker as having distinct or less related meanings.

Consider one of the most basic gesture features, handedness. Gestures can be made with one

12

Figure 2: A frame of the video data collected in our gesture elicitation experiment

hand (1H) or two (2H); if 1H, they can be made with the left hand (LH) or the right (RH); if

2H, the hands can move and/or be positioned in mirror images or with one hand taking an active

role and the other a more passive ‘platform’ role. Noting groups of gestures that have the same

values of handedness can identify catchments. We can add other features such as shape, location

in space, and trajectory (curved, straight, spiral, etc.), and consider all of these as also defining

possible catchments. A given catchment could, for example, be defined by the recurrent use of the

same trajectory and space with variations of hand shapes. This would suggest a larger discourse

unit within which meanings are contrasted. Individuals differ in how they link up the world into

related and unrelated components, and catchments give us a way of detecting these individual

characteristics or cognitive styles.

3 Experimental Method

Hand gestures are seen in abundance when humans try to communicate spatial information. In our

gesture and speech elicitation experiment, subjects are asked to describe their living quarters to an

interlocutor. This conversation is recorded on a Hi-8 tape using a consumer-quality camcorder (a

Sony TR-101 for the results presented here). Figure 2 is a frame from the experimental sequence

that is presented here. Two independent sets of analyses are performed on the video and audio data.

The first set of analyses entails the processing of the video data to obtain the motion traces of both

13

of the subject’s hands. The synchronized audio data are also analyzed to extract the fundamental

frequency signal and speech power amplitude (in terms of the RMS value of the audio signal).

The second set of analyses entails the expert transcription of the speech and gesture data. This

transcription is done by micro-analyzing the Hi-8 video tape using a frame-accurate video player

to correlate the speech with the gestural entities. We also perform a higher level analysis using the

transcribed text alone. Finally, the results of the psycholinguistic analyses are compared against

the features computed in the video and audio data. The purpose of this comparison is to identify the

cues accessible in the gestural and audio data that correlate well with the expert psycholinguistic

analysis. We shall discuss each step in turn.

3.1 Extraction of Hand Motion Traces for Monocular Video

In the work described here our purpose is to see what cues are afforded by gross hand motion

for discourse structuring. Human hand gesture in standard video data poses several processing

challenges. First, one cannot assume contiguous motion. A sweep of the hand across the body

can span just 0.25 s. to 0.5 s. This means that the entire motion is captured in 7 to 15 frames.

Depending on camera field-of-view on the subject, inter-frame displacement can be quite large.

This means that dense optical flow methods cannot be used. Second, because of the speed of

motion, there is considerable motion blur. Third, the hands tend to occlude each other.

We apply a parallelizable fuzzy image processing approach known as Vector Coherence Map-

ping (VCM) [94, 95] to track the hand motion. VCM is able to apply spatial coherence, momentum

(temporal coherence), motion, and skin color constraints in the vector field computation by using

a fuzzy-combination strategy, and produces good results for hand gesture tracking. Figure 3 il-

lustrates how VCM applies a spatial coherence constraint (minimizing the directional variance) in

vector field computation. Assume 3 feature points pt1 pt

3at time t (represented by the squares

at the top of the figure) move to their new locations (represented by circles) in the next frame. If

all three feature points correspond equally to each other, an application of convolution to detect

matches (e.g. by a Absolute Difference Correlation, ADC) from each pti

would yield correlation

maps with 3 hotspots (shown as N t

1 N t

3in the middle of Figure 3). If all three correlation maps

are normalized and summed, we obtain the vector coherence map (vcm) at the bottom of the fig-

ure, the ‘correct’ correlations would reinforce each other, while the chance correlations would not.

Hence a simple weighted summation of neighboring correlation maps would yield a vector that

14

ADC Area of P2

ADC Area of P3

P1

P2

P3

ADC Area of P1

N (p )3tN (p )2tN (p )1t

vcm(p )1t

Correlation 'Hot Spots'

CorrectCorrespondenceReinforced

Figure 3: Spatial Coherence Constraint in VCM

15

Face Cluster Merged withAmpulet Cluster

LH Cluster

RH Cluster

Non-Skin Vectors Suppressed

Face Vector Cluster

LH Cluster

RH Cluster

Figure 4: Hand and head movements tracked in the same sequence without (left) and with the skincolor constraint applied.

minimizes the local variance in the computed vector field. We can adjust the degree of coherence

enforced by adjusting the contributing weights of the neighboring correlation maps as a function

of distance of these maps from the point of interest.

A normalized vcm computed for each feature point can be thought of as a ‘likelihood’ map

for the spatial variance-minimizing vector at that point. This may be used in a ‘fuzzy image

processing’ process where other constraints may be ‘fuzzy-ANDed’ with it. We use a temporal

constraint based on momentum, and a skin-color constraint to extract the required hand motion

vector fields. Figure 4 shows the effect of incorporating the skin-color constraint to clean up the

computed vector fields. This approach is derived in detail in [94, 95].

3.2 Audio Processing

Since our experiments require ‘natural interaction’ between the subject and interlocutor, we de-

cided not to use intrusive head-mounted devices. In the experiments here reported, the audio

signal came directly from the built-in microphones on the Sony TR-101 camcorders3. Given the

irregular distance and orientation between the subject’s mouth and the microphone, and other envi-

ronmental factors, our analysis approach has to handle a fair amount of noise. For the experiment3One of our original goals was to obtain discourse segmentation from the video and audio of regular camcorders.

16

reported here, we make use of the speech fundamental frequency F0 estimate for the speaker to

extract voiced speech components, and the amplitude of the speech (in terms of the RMS value

of the audio signal). We first obtain an estimate of both the F0 and RMS of the speech using En-

tropic’s Xwaves+TM software. The resulting F0 signal contains two kinds of errors: (i) false F0s

due to background interference (ii) F0 harmonics at half and a third of the correct value. For the

first kind of error, we apply a median filter to the computed RMS signal; use knowledge about the

speaker’s typical F0 to eliminate signals that are too far from this expected value; and, suppress

low amplitude F0 signals if the corresponding signal power (RMS) falls below a threshold. To

address the second error component, we apply a running estimate on the speaker’s F0 value. If the

F0 computed by Xwaves+ changes abruptly by a factor of 0.5 or 0.33, it is frequency shifted while

preserving the signal amplitude. Finally, since the amplitude of spoken (American) English tends

to decline over the course of phrases and utterances [96], we apply a linear declination estimator

to group F0 signal intervals into such declination units.

We compute the pause durations within the F0 signal. The remaining F0 signal is mapped

onto a parametric trend function to locate likely phrase units. For further discussion on our audio

processing algorithm, please see [97].

3.3 Detailed Psycholinguistic Analysis

Perceptual analysis of video (analysis by unaided ear and eye) plays an important role in such

disciplines as psychology, psycholinguistics, linguistics, anthropology, and neurology. In psy-

cholinguistic analysis of gesture and speech, researchers micro-analyze videos of subjects using a

high quality video cassette recorder that has a digital freeze capability down to the specific frame.

The analysis typically proceeds in three iterations. First, the speech is carefully transcribed by

hand, and then typed into a text document. The beginning of each linguistic unit (typically a

phrase) is marked by the time-stamp of the beginning of the unit in the video tape. Second, the

researcher revisits the video and annotates the text, marking co-occurrences of speech and ges-

tural phases (rest-holds, pre-stroke and post-stroke holds, gross hand shape, trajectory of motion,

gestural stroke characteristics etc.). The researcher also inserts locations of audible breath pauses,

speech disfluencies, and other salient comments. Third, all these data are formatted onto a final

transcript for psycholinguistic analysis. This is a painstaking process that takes a week to ten days

to analyze about a minute of discourse. We have developed a multimedia system to assist in this

17

3 4 5 6 7 8 9

00:16:46:28 # [[ / so<oo> you're in th e kitchen ] [['n' then ther e's a s<sss>* /] [ / the back stairc*]]1 2a 2b

1. iconic; RH sl.spread B PTD moves AB from RC to hold directly above knees; <moving into the kitchen; then, loc. in kitchen>2a. deictic vector§ - aborted;; RH G takes off from end point of #1; points and moves RtoL across C; head tilts to look left in the same direction the hand is moving; <move from kitchen area to bottom of back staircase>2b. deictic vector§ - aborted;; RH G takes off from end point of #2a; in one smooth motion: points to R while moving TB, curves to point & move up;

<move up back staircase>.

10 11 1200:16:50:12 [oh I forgot to say ]

emblem-ish; A-hand moves to sternum & holds; <(woops)>

12 13

00:16:51:10 [[when you come through the* / / # ]

iconicΩ - aborted;; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB and apart to a more PTC/PTB position on either side of CC;<open & move through interior double doors>

14 15 16 17 18 19 20 21 23 22

00:16:2:29 [[when you enter the ^house ^from the ^front / ] [annd you<ou> / openn the / doors / with t ]][he* 1 2

1. iconic; BH/mirror 5-hands start together PTB @chest, move AB but do not move apart; 'superimposed' beats, including a big upward & forward head movement at the end on "front"; <enter the house, but not stop short of going through the interior double doors>2. iconic - repeat & complete Ω; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB to a mid-stroke hold, then move apart to PAB on either side of C @far L&R; <open & move through interior double doors>

Figure 5: Sample first page of a psycholinguistic analysis transcript

process [98, 99], but this system was not available for the work reported here.

The outcome of the psycholinguistic analysis process is a set of detailed transcripts. We have

reproduced the first page of the transcript for our example dataset in figure 5. The gestural phases

are annotated and correlated with the text (by underlining, bolding, brackets, symbols insertion

etc.), F0 units (numbers above the text), and video tape time stamp (time signatures to the left).

Comments about gesture details are placed under each transcribed phrase.

In addition to this, we perform a second ‘levels of discourse’ analysis based on the transcribed

text alone without looking at the video (see section 4.1.1).

3.4 Comparative Analysis

In our ‘discovery experiments’, we analyze the gestural and audio signals in parallel with the

perceptual psycholinguistic analysis to find cues for high level discourse segmentation that are

accessible to computation. The voiced units in the F0 plots are numbered and related to the text

manually. These are plotted together with the gesture traces computed using VCM. First, we

perform a perceptual analysis of the data to find features that correlate well between the gesture

signal and the high level speech content. We call this the discovery phase of the analysis. Next, we

devise algorithms to extract these features. We shall present the features and cues we have found

18

in our data.

4 Results

Figures 6 and 7 are summary plots of 32 seconds of experimental data plotted with the video

frame number as the time scale (480 frames in each chart to give a total of 960 frames at 30

fps). The x-axis of these graphs are the frame numbers. The top two plots of each figure describe

the horizontal (x) and vertical (y) hand positions respectively. The horizontal bars under the y

direction plot are an analysis of hand motion features – LH hold, RH hold, 2H anti-symmetric, 2H

symmetric (mirror symmetry), 2H (no detected symmetry), single LH and single RH respectively.

Beneath this is the fundamental frequency F0 plot of the audio signal. The voiced utterances in

the F0 plot are numbered sequentially to facilitate correlation with the speech transcript. We have

reproduced the synchronized speech transcript at the bottom of each chart, as it correlates with

the F0 units. The vertical shaded bars that run across the charts mark the durations in which both

hands are determined to be stationary (holding). We have annotated the hand motion plots with

parenthesized markers and brief descriptive labels. We will use these labels along with the related

frame-number durations for our ensuing discussion.

4.1 Discourse Segments

4.1.1 Linguistic Segmentation

Barbara Grosz and colleagues[100] have devised a systematic procedure for recovering the dis-

course structure from a transcribed text. The method consists of a set of questions with which to

guide analysis and uncover the speaker’s goals in producing each successive line of text.

Following this procedure with the text displayed in Figures 6 and 7 produces a three-level

discourse structure. This structure can be compared to the discourse segments inferred from the

objective motion patterns shown in the gestures. The gesture-based and text-based pictures of

the discourse segmentation are independently derived but the resulting correlation is remarkably

strong, implying that the gesture features were not generated haphazardly but arose as part of a

structured, multi-level process of discourse-building. Every gesture feature corresponds to a text-

based segment. Discrepancies arise where the gesture structure suggests discourse segments that

the text-based hierarchy fails to reveal, implying that the gesture modality has captured additional

19

Left Hand Right Hand

2H Holds

(A)

(B)

(B.1.) (B.2.) (B.3.)(J.1.)

(K.1.)

(K.2.)

(F)

(G)

L HoldR Hold

2H ASym2H Sym

2H1 LH1 RH

gar

age

in b

ack

So

you

're

inth

e ki

t-

chen

'n' t

hen

ther

e's

a s<

sss>

the

bac

k

stai

rc-

oh

I

forg

ot

to

say

wh

en y

ou

com

eth

rou

gh

th

e

wh

en y

ou

ente

r th

eh

ou

se fro

mth

e

fro

nt

ann

dyo

u<o

u>

op

enn

th

e

do

ors

wit

h t

he

<uu

mm

>

(sm

ack)

the

gla

ss

inn

th

em

ther

e's

a

Sp

eech

Tra

nsc

rip

t

1 2 3 4 5 6 7 8 9 10 12 13 14 15 16

17

18 19 20 21

22

23 24

25

26 27 28110

50

100

150

200

250

300

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481

F0

Val

ue

Audio Pitch

Left Hand Rest

Right Hand Rest

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481-100

-50

0

50

100

150

200

250

300

Pix

els

Hand Movement along Y-Direction

-200

-150

-100

-50

0

50

100

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481

Pix

els

Hand Movement along X-Direction

LH

RH

LHRH

DiscourseCorrectionRetraction

DiscourseRepairPause

BackStaircase 1

AntiSymmetry(Enter house from front)

Mirror SymmetryOpen Doors + Hold

AntiSymmetryDoor Description - Glass in Door

Preparation forGlass Door Description

Front Door Discourse SegmentBack of House Discourse Segment

RH Retractionto Rest

Figure 6: Hand position, handedness analysis and F0 graphs for the frames 1–480

20

Left Hand Right Hand

(C)

(D)(E)

(H)

(I) (L)

(J.2.)

(G)

L HoldR Hold

2H ASym2H Sym

2H1 LH1 RH

the

fro

nt

stai

r-ca

se

run

s

rig

ht

up

ther

e

on

on

yo

ur

left

soyo

u

can

go

stra

igh

t u

p

fro

m t

her

e

if y

ou

wan

t

bu

t if

yo

u

com

e ar

ou

nd

thro

ug

h t

he ki

t-

chen

into

th

eb

ack

ther

e's

ab

ack

(sta

irca

seth

at) w

ind

s

aro

un

d(l

ike)

this

and

pu

tsyo

u u

po

n t

he

se-

con

dfl

oo

r

stai

rs

to t

he

se-

con

d

flo

or

Sp

eech

Tra

nsc

rip

t

0

50

100

150

200

250

300

481 511 541 571 601 631 661 691 721 751 781 811 841 871 901 931 961

F0

Val

ue

Audio Pitch

33

28

34 35

323130

29

36 37

38 39

40 41 42

43 44

45 46 47 48 49

50

51

52

53 54 55

56

5758

59

60

61

62

-200

-150

-100

-50

0

50

100

Pix

els

481 511 541 571 601 631 661 691 721 751 781 811 841 871 901 931 961

Hand Movement along X-Direction

LH

RH

Left Hand Rest

Right Hand Rest-100

-50

0

50

100

150

200

250

300

Pix

els

481 511 541 571 601 631 661 691 721 751 781 811 841 871 901 931 961

Hand Movement along Y-Direction

LH

RH

RH Retraction to Rest

BackStaircase 2

FrontStaircase 1

FrontStaircase 2

Non-Hold(Dominant Motion Rule)

1 LH – Front Staircase Discourse Segment

Hold(Dominant Motion Rule)

1 RH – Back Staircase Discourse Segment2H – UpstairsDiscourse Segment

Non-Rest Hold

Figure 7: Hand position, handedness analysis and F0 graphs for the frames 481-961

21

discourse segments that did not find their way into the textual transcription. The uppermost level

of the Grosz-type hierarchy can be labeled “Locating the Back Staircase.” It is at this level that

the discourse as a whole had its significance. The middle level concerned the first staircase and its

location, and the lowest level the front of the house and the restarting of the house tour from there.

4.1.2 Gesture-Speech Discourse Correlations

Labels (A) through (E) mark the discourse segments accessible from the gestural traces indepen-

dently from the speech data. These segments are determined solely from the hold and motion

patterns of the speaker’s hands. We summarize the correspondences of these segments with the

linguistic analysis following.

(A) Back-of-house discourse segment, 1 RH (Frames 1–140): These one-handed gestures, all with

the RH, accompany the references to the back of the house that launch the discourse. This

1H catchment is replaced by a series of 2H gestures in (B), marking the shift to a different

discourse purpose, that of describing the front of the house. Notice that this catchment

feature of 1H-RH gestures (i.e. the LH is holding) reprises itself in segment (D) when the

subject returns to describing the back of the house.

(B) Front door discourse segment, 2 Synchronized Hands (Frames 188–455): Two-handed ges-

tures occur when the discourse theme is the front of the house, but there are several variants

and these mark sub-parts of the theme – the existence of the front door, opening it, and

describing it. Each sub-theme is initiated by a gesture hold, marking off in gesture the in-

ternal divisions of the discourse hierarchy. These sub-divisions are not evident in the text

and thus not picked up by the text-only analysis that produced the purpose hierarchy and its

segmentation (described in Section 4.1.1). This finer grained segmentation is confirmed by

psycholinguistic analysis of the original video.

(B.1.) ‘Enter house from front’ discourse segment 2H Anti-symmetric (Frames 188–298):

Anti-symmetric two-handed movements iconically embody the image of the two front

doors; the anti-symmetric movements themselves contrast with the following mirror-

image movements, and convey, not motion as such, but the surface and orientation of

the doors.

22

(B.2.) ‘Open doors’ discourse segment 2H Mirror Symmetry (Frames 299–338): In contrast

with the preceding two-handed segment, this gesture shows opening the doors and

the hands moving apart. This segment terminates in a non-rest two-handed hold of

sizeable duration of more than 0.75s (all other pre-stroke and post-stroke holds are less

than 0.5s in duration). This suggests that it is itself a ‘hold-stroke’ (i.e. an information-

laden component). This corresponds well with the text transcription. The thrusting

open of the hands indicates the action of opening the doors (coinciding with the words:

‘open the’), and the 2H hold-stroke indicates the object of interest (coinciding with

the word: ‘doors’). This agrees with the discourse analysis that carries the thread

of the ‘front door’ as the element of focus. Furthermore, the 2H mirror symmetric

motion for opening the doors carries the added information that these are double doors

(information unavailable from the text transcription alone).

(B.3.) Door description discourse segment 2H Anti-symmetric (Frames 351–458): The form

of the doors returns as a sub-theme in its own right, and again the movement is anti-

symmetric, in the plane of the closed doors.

(C) Front staircase discourse segment, 1 LH (Frames 491–704): The LH becomes active in a

series of distinctive up-down movements coinciding exactly with the discourse goal of in-

troducing the front staircase. These are clearly one-handed gestures with the RH at the rest

position.

(D) Back staircase discourse segment 1 RH (Frames 754–929): The gestures for the back staircase

are again made with the RH, but now, in contrast to the (A) catchment, the LH is at a non-rest

hold, and still in play from (C). This changes in the final segment of the discourse.

(E) ‘Upstairs’ discourse segment 2H synchronized (Frames 930–): The LH and RH join forces in a

final gesture depicting ascent to the second floor via the back staircase. This is another place

where gesture reveals a discourse element not recoverable from the text (no text accompanied

the gesture).

4.2 Other Gestural Features

Beside the overall gesture hold analysis, this 32 seconds of discourse also contains several exam-

ples of gestural features and cues. We have labeled these (F) through (L). The following discussion

23

summarizes the features we found under these labels.

(F) Preparation for glass door description (Frames 340–359): In the middle of the discourse

segment on the front door (B), we have the interval marked (F) which appears to break

the symmetry. This break is actually the preparation phase of the RH to the non-rest hold

(for both hands) section that continues into the strongly anti-symmetric (B.3.) ‘glass door’

segment. This clarifies the interpretation of the 2H holds preceding and following (F). The

former is the post-stroke hold for the ‘open doors’ segment (B.2.), and latter is the pre-stroke

hold for segment (B.3.). Furthermore, we can then extend segment (B.3.) backward to

the beginning of (F). This would group F0 unit 23 (‘with the’) with (B.3.). This matches the

discourse segmentation ‘with the . . .<uumm> glass in them’. It is significant to note that the

subject has interrupted her speech stream and is ‘searching’ for the next words to describe

what she wants to say. The cohesion of the phrase would be lost in a pure speech pause

analysis. We introduce the rule for gesture segment extension to include the last movement

to the pre-stroke hold for the segment.

(G) RH retraction to rest (Frames 468–490 straddles both charts): The RH movement labeled

(G) spanning both plots terminates in in the resting position for the hand. We can therefore

extend the starting point of the target rest backward to the start of this retraction for discourse

segmentation. Hence, we might actually move the (C) front staircase discourse segment

marked by the 1 LH feature backward from frame 491 to frame 470. This matches the

discourse analysis from F0 units 28–30: “. . . there’s the front- . . . ” This provides us with

the rule of rest-extension to the start of the last motion that results in the rest. The pauseless

voiced speech section from F0 units 28–35 would provide complementary evidence for this

segmentation. We have another example for this rule in the LH motion preceding the hold

labeled (I), effectively extending the ‘back staircase’ discourse segment (D) backward to

frame 705. Again this corresponds well with the speech transcript.

(H) & (I) Non-Hold for (H) in (C) (Frames 643–704) and Hold for (I) in (D) (Frames 740–811):

The LH was judged by the expert coder to be not holding in (H) while it was judged to be

holding in (I). An examination of the video shows that in (H) the speaker was making a

series of small oscillatory motions (patting motion with her wrist to signify the floor of the

‘second floor’) with a general downward trend. In segment (I), the LH was holding, but the

24

Left Hand Right Hand

(J.1.)

(K.1.)

(K.2.) (J.2.)

2H Hold

you

're

inth

e ki

t-ch

en'n

' th

enth

ere'

sa

s<ss

s>

the

bac

k

stai

rc-

oh

I

forg

ot

to

say

wh

enyo

u

kit

chen

into

the

bac

k

ther

e's

a b

ack

(sta

irca

seth

at) w

ind

s

aro

un

d(l

ike)

this

Sp

eech

Tra

nsc

rip

t

0

50

100

150

200

250

300

F0

Val

ue

5 6 7 8 9 10 1211

91 121 151 781

50

51

52

53 54 55

56

0

50

100

150

200

250

300Audio Pitch

751 811 841

-100

-50

0

50

100

Pix

els

-100

-50

0

50

100Hand Movementalong X-Direction

91 121 151 751 781 811 841

LHLH

RH RH

-100

-50

0

50

100

150

200

Pix

els

Left Hand Rest

Right Hand Rest

91 121 151

LHRH

751 781 811 841

Hand Movementalong Y-Direction

Left Hand Rest

Right Hand Rest-50

0

50

100

150

200

-100

LH

RH

DiscourseCorrectionRetraction

DiscourseRepairPause

BackStaircase 1

BackStaircase 2

Figure 8: Side-by-side comparison of the back-staircase catchment

entire body was moving slightly because of the rapid and large movements of the RH. This

distinction cannot be made from the motion traces of the LH alone. Here, we introduce a

dominant motion rule for rest determination. We use the motion energy differential of the

movements of both hands to determine if small movements in one hand are interpreted as

holds. In segment (H), the RH is at rest, hence any movement in the alternate LH becomes

significant. In segment (I), the RH exhibits strong motion, and the effects of the LH motion

are attenuated.

(J.1.) & (J.2.) Backstaircase Catchment 1 (Frames 86–132), and Backstaircase Catchment 2 (Frames

753–799): Figure 8 is a side-by-side comparison of the motions of the right hand that consti-

tute the back staircase description. The subject described the spiral staircase with an upward

twirling motion of the RH. In the first case, the subject aborted the gesture and speech se-

25

quence with an abrupt interruption and went on to describe the front of the house and the

front staircase. In the second case, the subject completed the gestural motion and the back

staircase discourse. We can make several observations about this pair of gestures that are

separated by more than 22 seconds. First, both are gestures of one hand (RH). Second, the

general forms of both gestures are similar. Third, up till the discourse repair break, both

iterations had exactly the same duration of 47 frames. Fourth, the speaker appears to have

already been planning a change in direction in her discourse, and the first motion is muted

with respect to the second.

(K.1.) & (K.2.) Discourse repair retraction (Frames 133-142), and discourse repair pause (Frames

143–159): Segments (K.1.) and (K.2.) correspond with the speech ‘Oh I forgot to say’ and

flag a repair in the discourse structure. The subject actually pulls her RH back toward herself

rapidly and holds an emblematic gesture with an index finger point. While more data in

experiments targeted at discourse repair are needed for us to make a more definitive state-

ment, it is likely that abrupt gestural trajectory changes where the hand is retracted from the

gestural space suggest discourse repair (We are in the process of performing experiments,

and building a video-audio-transcription corpus to answer such questions).

(L) Non-rest hold (Frames 740–929): An interesting phenomenon is seen in the (L) non-rest hold.

This is a lengthy hold spanning 190 frames or 6.33 seconds. This means that it cannot be

a pre-stroke or post-stroke hold. It could be a hold gesture or a stationary reference hand

in a 2H gesture sequence. In the discourse example at hand, it actually serves as an ‘idea

hold’. The subject just ended her description of the front staircase with a mention of the

second floor. While her LH is holding, she proceeds to describe the back staircase that takes

her to the same location. At the end of this non-rest hold, she goes on to a 2H gesture

sequence (E) describing the second floor. This means that her ‘discourse plan’ to proceed

to the second floor was already in place at the end of the (C) discourse segment. The LH

suspended above her shoulder could be interpreted as holding the upstairs discourse segment

in abeyance while she describes the back staircase (segment (D)). Such holds may thus be

thought of as super-segmental cues for the overall discourse structure. The non-rest hold (L),

in essence, allows us to connect the end of (C) with the (E) discourse segment.

26

5 Conclusions

Natural conversational interaction is a complex composition of multimodal visually and audibly

accessible information. We believe that the key to such interaction is an understanding of the un-

derlying psycholinguistic phenomenology, along with a handle on the kinds of computable visual

and audio features that aid its interpretation.

We have shown that conversational discourse analysis using both speech and gesture captured

in video is possible. We overviewed our algorithms for gesture tracking in video and showed our

results for the extended tracking across 960 video frames. The quality of this tracking permits us

to perform the psycholinguistic analysis presented.

In the example discourse analyzed, we have shown strong correlation between handedness

and high level semantic content of the discourse. Where both hands are moving, the kind of

synchrony (anti-symmetry or mirror symmetry) also provides cues for discourse segmentation.

In some cases, the gestural cues reinforce the phrase segmentation based on F0 pause analysis

(segment boundaries coincide with such pauses in the voiced signal). In some other cases the

gestural analysis provides complementary information permitting segmentation where no pauses

are evident, or grouping phrase units across pauses. The gestural analysis corresponds well with the

discourse analysis performed on the text transcripts. In some cases, the gestural stream provides

segmentation cues for linguistic phenomena that are inaccessible from the text transcript alone.

The finer grained segmentations in these cases were confirmed when the linguists reviewed the

original experimental video tapes.

We also presented other observations about the gestural signal that are useful for discourse

analysis. For some of these observations, we have derived rules for extracting the associated ges-

tural features. We have shown that the final motion preceding a pre-stroke hold should be grouped

with the discourse unit of that hold. Likewise, the last movement preceding a rest-hold should

be grouped with the rest for that hand. When a hand movement is small, our ‘dominant motion

rule’ permits us to distinguish if it constitutes a hold or gestural hand motion. We have also made

observations about repeated motion trajectory catchments, discourse interruptions and repair, and

non-rest holds.

Our data also show that while 2D monocular video affords to a fair amount of analysis, more

could be done with 3D data from multiple cameras. In the two discourse segments labeled (B.1.)

27

and (B.3.) in Figure 6, a significant catchment feature is that the hands move in a vertical plane

iconic of the surface of the doors. The twirling motion for the back staircase catchment would

be more reliably detected with 3D data. The discourse repair retraction toward the subject’s body

(K.1.) would be evident in 3D. There is also a fair amount of perspective effects that are difficult

to remove from 2D data. Because of the camera angle (we use an oblique view because subjects

are made uncomfortable sitting face-to-face with the interlocutor and directly facing the camera),

a horizontal motion appears to have a vertical trajectory component. This trajectory is dependent

on the distance of the hand from the subject’s body. In our data, we corrected for this effect by

assuming that the hand is moving in a fixed plane in front of the subject. This produces some

artifacts that are hard to distinguish without access to the three-dimensional data. In the ‘open

doors’ stroke in (B.2.), for example, hands move in an arc centered around the subject’s elbows

and shoulders. Hence, there is a large displacement in the ‘z’ direction (in and away from the

body). The LH also appears to move upward while the RH’s motion appear to be almost entirely

in the x-dimension. Consequently, the LH seems to move a shorter distance than the RH in the

x-dimension. This has directed us to move a good portion of our ongoing experiments to 3D.

We believe that the work reported here just barely scratches the surface of a compelling scien-

tific direction. This research is necessarily cross-disciplinary, and involves a fair amount of collab-

oration between the computation sciences and psycholinguistics. The computational requirements

for the video processing are significant. On a four-processor (4 R10000) Silicon Graphics Onyx

workstation, we could process 10 seconds of data in 2 hours. We have recently obtained National

Science Foundation support for a supercomputer to handle this data and computationally intensive

task. We are in the process of testing the results obtained from the case study reported here on a

corpus of experimental video data we are assembling in new human subject discourse elicitations.

We are also continuing to determine other cues for discourse analysis in different conversational

scenarios through detailed case studies like the one reported here.

The bulk of prior research on gesture interaction have focussed on manipulative and semaphoric

interfaces. Apart from the use of the hands in direct manipulation, stylized semaphore recogni-

tion has dominated ‘communicative’ gesture research. We contend that far from being naturally

communicative, such stylized use of the hand[s] is in fact a rather unnatural means of communi-

cation that requires specialized training. Nonetheless, gesture and speech are a potent medium for

human-machine interface. To tap into the richness of this multimodal interaction, it is essential

28

that we have a better understanding of the ‘natural’ function of gesture and speech. This paper

introduces a paradigm of analysis that may have applications in such applications as multi-modal

speech/discourse recognition, detection of speech repairs in interactive speech systems, and mod-

eling of multimodal elements for distance communication (for example with avatars control [49]).

More information about our ongoing research is available at http://vislab.cs.wright.edu/KDI.

Acknowledgement

This research and preparation of this paper has been supported by the U.S. National Science Foun-

dation STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse

Segmentation”, and the National Science Foundation KDI program, Grant No. BCS-9980054,

“Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech,

and Gaze Research”.

References

[1] D. McNeill, Hand and Mind: What Gestures Reveal about thought. Chicago: University of

Chicago Press, 1992.

[2] brittanica.com, “Encyclopaedia brittanica web site.” http://www.brittanica.com.

[3] R. A. Bolt, “Put-that there,” Computer Graphics, vol. 14, pp. 262–270, 1980.

[4] R. A. Bolt, “Eyes at the interface,” in ACM CHI Human Factors in Computing Systems

Conference, pp. 360–362, 1982.

[5] J. Crowley, F. Berard, and J. Coutaz, “Finger tracking as an input device for augmented

reality,” in IWAFGR, (Zurich, Switzerland), pp. 195–200, June 1995.

[6] R. Kjeldsen and J. Kender, “Toward the use of gesture in traditional user interfaces,” in

FG96, (Killington, VT), pp. 151–156, Oct. 14–16 1996.

[7] F. Quek, T. Mysliwiec, and M. Zhao, “FingerMouse: A freehand pointing interface,” in Intl

Wksp Auto. Face & Gesture Rec., pp. 372–377, 1995.

29

[8] M. Fukumoto, K. Mase, and Y. Suenaga, “Real-time detection of pointing actions for a

glove-free interface,” in Proceedings of IAPR Workshop on Machine Vision Applications,

(Tokyo, Japan), Dec. 1992.

[9] V. I. Pavlovic, R. Sharma, and T. S. Huang, “Gestural interface to a visual computing envi-

ronment for molecular biologsts,” in FG96, (Killington, VT), pp. 30–35, IEEE Comp. Soc.,

Oct. 14–16 1996.

[10] R. Kahn, M. Swain, P. Projkopowicz, and R. Firby, “Gesture recognition using the perseus

architecture,” in CVPR, pp. 734–741, IEEE, CS, 1994.

[11] J. Segen, “Controlling computers with gloveless gestures,” in Virtual Reality Systems’93,

pp. 2–6, 1993.

[12] P. Wellner, “Interacting with paper on the DigitalDesk,” Comm. ACM, vol. 36, pp. 87–95,

July 1993.

[13] J. Segen and S. Kumar, “Fast and accurate 3D gesture recognition interface,” in ICPR, vol. 1,

(Brisbane, AU), IEEE Comp. Soc., Aug.16–20 1998.

[14] S. Kumar and J. Segen, “Gesture based 3d man-machine interaction using a single camera,”

in IEEE Int. Conf. on MM Comp. & Sys., vol. 1, (Florence, IT), IEEE Comp. Soc., June

7–11 1999.

[15] J. Segen and S. Kumar, “Shadow gestures: 3D hand pose estimation using a single camera,”

in CVPR, vol. 1, (Ft. Collins, CO), pp. 479–485, IEEE Com. Soc., June 23–25 1999.

[16] J. Rehg and T. Kanade, “DigitEyes: Vision-based human hand tracking,” Tech. Rep. CMU-

CS-93-220, CMU, 1993.

[17] P. Nesi and D. A. Bimbo, “Hand pose tracking for 3-D mouse emulation,” in IWAFGR,

(Zurich, Switzerland), pp. 302–307, June 1995.

[18] T. Baudel and M. Beaudouin-Lafon, “Charade – Remote control of objects using free-hand

gestures,” Comm. ACM, vol. 36, pp. 28–35, July 1993.

30

[19] R. Kadobayashi, K. Nishimoto, and K. Mase, “Design and evaluation of gesture interface

of an immersive walk-through application for exploring cyberspace,” in FG98, (Nara, Jp),

pp. 534–539, IEEE, CS, Apr. 14–16 1998.

[20] R. O’Hagan and A. Zelinsky, “Visual gesture interfaces for virtual environments,” in 1st

Australasian UI Conf., (Canberra, AU), IEEE Comp. Soc., Jan. 31 – Feb. 3 2000.

[21] A. Wexelblat, “An approach to natural gesture in virtual environments,” ACM Transactions

on Computer-Human Interaction, vol. 2, pp. 179–200, Sept. 1995.

[22] A. Azarbayejani, T. Starner, B. Horowitz, and A. Pentland, “Visually controlled graphics,”

PAMI, vol. 16, pp. 602–605, June 1993.

[23] A. G. Hauptmann, “Speech and gestures for graphic image manipulation,” in Proceedings

of CHI’89, pp. 241–245, 1989.

[24] W. Freeman and C. Weissman, “Television control by hand gestures,” in IWAFGR, (Zurich,

Switzerland), pp. 179–183, June 1995.

[25] W. Freeman, K. Tanaka, J. Ohta, and K. Kyuma, “Computer vision for computer games,” in

FG96, (Killington, VT), pp. 100–105, Oct. 14–16 1996.

[26] W. Freeman, “Computer vision for television and games,” in Proceedings of the Interna-

tional Workshop on Recognition, Analysis, and Tracking of Gestures in Real-Time Systems,

(Corfu, Greece), p. 118, 26-27 Sept. 1999.

[27] W. Freeman, D. Anderson, P. Beardsley, C. Dodge, M. Roth, C. Weissman, S. Yerazunis,

H. Kage, K. Kyuma, Y. Mayake, and K.-I. Tanaka, “Computer vision in interactive computer

graphics,” IEEE Computer Graphics and Applications, vol. 18, no. 3, pp. 42–53, 1998.

[28] M. B. Friedman, “Gestural control of robot end effectors,” in Proc. SPIE Conf. Intel. Rob.

& Comp. Vis., vol. 726, pp. 500–502, 1986.

[29] A. Katkere, E. Hunter, D. Kuramura, J. Schlenzig, S. Moezzi, and R. Jain, “Robogest:

Telepresence using hand gestures,” Tech. Rep. VCL-94-204, Visual Computing Laboratory,

University of California, San Diego, Dec. 1994.

31

[30] J. Triesch and C. von der Malsburg, “Robotic gesture recognition,” in Proceedings of the

International Gesture Workshop: Gesture and Sign Language in Human-Computer Inter-

action (I. Wachsmuth and M. Frohlich, eds.), (Bielefeld, Germany), pp. 233–244, Springer,

Sept. 17–19 1997.

[31] J. Triesch and C. von der Malsburg, “A gesture interface for human-robot-interaction,” in

FG98, (Nara, Jp), pp. 546–551, IEEE, CS, Apr. 14–16 1998.

[32] S. Bryson and C. Levit, “The virtual wind tunnel: An environmet for the exploration of

three-dimensional unsteady flows,” in Proceedings of IEEE Visualization ’91, pp. 17–24,

Oct. 1991.

[33] S. Bryson and C. Levit, “The virtual wind tunnel,” IEEE Computer Graphics and Applica-

tions, vol. July, pp. 25–34, 1992.

[34] S. Fels and G. Hinton, “Glove-talk - A neural network interface between a data-glove and a

speech synthesizer,” IEEE Transactions on Neural Networks, vol. 4, pp. 2–8, Jan. 1993.

[35] S. Fels, Glove-TalkII: Mapping Hand Gestures to Speech Using Neural Networks-An Ap-

proach to Building Adaptive Interfaces. PhD thesis, Computer Science Dept., The Univer-

sity of Toronto, 1994.

[36] R. Pausch and R. Williams, “Giving candy to children – user-tailored gesture input driving

an articulator-based speech synthesizer,” Comm. ACM, vol. 35, pp. 58–66, May 1992.

[37] R. Cutler and M. Turk, “View-based interpretation of real-time optical flow for gesture

recognition,” in FG98, (Nara, Jp), pp. 416–421, IEEE, CS, Apr. 14–16 1998.

[38] A. Wilson and A. Bobick, “Nonlinear PHMMs for the interpretation of parameterized ges-

ture,” in CVPR, (Santa Barbara, CA), pp. 879–884, IEEE Comp. Soc, June 23–25 1998.

[39] A. Wilson and A. Bobick, “Parametric hidden Markov models for gesture recognition,”

PAMI, vol. 21, pp. 884–900, Oct. 1999.

[40] F. Quek, “Eyes in the interface,” Int. J. of Image and Vision Comp., vol. 13, pp. 511–525,

Aug. 1995.

32

[41] F. Quek, “Unencumbered gestural interaction,” IEEE Multimedia, vol. 4, no. 3, pp. 36–47,

1996.

[42] P. Cohen, M. Dalrymple, D. Moran, F. Pereira, J. Sullivan, R. Gargan, J. Schlossberg, and

S. Tyler, “Synergistic use of direct manipulation and natural language,” in Human Factors

in Computing Systems: CHI’89 Conference Proceedings, pp. 227–234, ACM, Addison-

Wesley Publishing Co., Apr. 1989.

[43] P. Cohen, M. Dalrymple, D. Moran, F. Pereira, J. Sullivan, R. Gargan, J. Schlossberg, and

S. Tyler, “Synergistic use of direct manipulation and natural language,” in Readings in Intel-

ligent User Interfaces (M. Maybury and W. Wahlster, eds.), pp. 29–35, Morgan Kaufman,

1998.

[44] J. G. Neal, C. Y. Thielman, Z. Dobes, S. M. Haller, and S. C. Shapiro, “Natural language

with integrated deictic and graphic gestures,” in Proc. of the Speech and Natural Language

Workshop, (Cape Cod, MA), pp. 410–423, 1989.

[45] J. G. Neal, C. Y. Thielman, Z. Dobes, S. M. Haller, and S. C. Shapiro, “Natural language

with integrated deictic and graphic gestures,” in Readings in Intelligent User Interfaces

(M. Maybury and W. Wahlster, eds.), pp. 38–51, Morgan Kaufman, 1998.

[46] S. Oviatt, A. DeAngeli, and K. Kuhn, “Integration and synchronization of input modes

during multimodal human-computer interaction,” in Proc. CHI 97, Vol. 1, pp. 415–422,

1999.

[47] S. Oviatt and P. Cohen, “Multimodal interfaces that process what comes naturally,” Comm.

ACM, vol. 43, no. 3, pp. 43–53, 2000.

[48] S. Oviatt, “Mutual disambiguation of recognition errors in a multimodal architecture,” in

Proc. CHI 99, Vol. 1, pp. 576–583, 1999.

[49] F. Quek, R. Yarger, Y. Hachiahmetoglu, J. Ohya, K. Shinjiro, and D. Nakatsu, McNeill,

“Bunshin: A believable avatar surrogate for both scripted and on-the-fly pen-based control

in a presentation environment,” in Emerging Technologies, SIGGRAPH 2000, (New Orleans,

LA), pp. 187 (abstract) and CD–ROM (full paper), July 23–28 2000.

33

[50] D. Koons, C. Sparrell, and K. Thorisson, “Integrating simultaneous input from speech , gaze,

and hand gestures,” in Intelligent Multimedia Interfaces (M. Maybury, ed.), pp. 257–276,

AAAI Press; The MIT Press, 1993.

[51] D. Koons, C. Sparrell, and K. Thorisson, “Integrating simultaneous input from speech ,

gaze, and hand gestures,” in Intelligent Multimedia Interfaces (M. Maybury, ed.), pp. 53–

62, AAAI Press; The MIT Press, 1998.

[52] X. Zhu, J. Yang, and A. Waibel, “Segmenting hands of arbitrary color,” in FG2000, (Greno-

ble, France), pp. 446–453, 28-30 Mar. 2000.

[53] E. Hunter, J. Schlenzig, and R. Jain, “Posture estimation in reduced-model gesture input

systems,” in IWAFGR, (Zurich, Switzerland), pp. 290–295, June 1995.

[54] F. W.T. and M. Roth, “Orientation histograms for hand gesture recognition,” in Intl Wksp

Auto. Face & Gesture Rec., (Zurich, Switzerland), pp. 296–301, June 1995.

[55] J. Triesch and C. von der Malsburg, “Robust classification of hand postures against complex

backgrounds,” in FG96, (Killington, VT), pp. 170–175, Oct. 14–16 1996.

[56] A. Lanitis, C. Taylor, T. Cootes, and T. Ahmed, “Automatic interpretation of human faces

and hand gestures,” in IWAFGR, (Zurich, Switzerland), pp. 98–103, June 1995.

[57] C. Maggioni, “Gesturecomputer - new ways of operating a computer,” in IWAFGR, (Zurich,

Switzerland), pp. 166–171, June 1995.

[58] K. H. Munk and E. Granum, “On the use of context and a priori knowledge in motion

analysis for visual gesture recognition,” in Proceedings of the International Gesture Work-

shop: Gesture and Sign Language in Human-Computer Interaction (I. Wachsmuth and

M. Frohlich, eds.), (Bielefeld, Germany), pp. 123–134, Springer, Sept. 17–19 1997.

[59] M. Black and A. Jepson, “Recognizing temporal trajectories using the condensation algo-

rithm,” in FG98, (Nara, Jp), pp. 16–21, IEEE, CS, Apr. 14–16 1998.

[60] F. Quek, “Hand gesture interface for human-machine interaction,” in Virtual Reality Sys-

tems’93 Spring Conference, (New York), pp. 13–19, Mar. 15-17 1993.

34

[61] F. Quek, “Toward a vision-based hand gesture interface,” in Virtual Reality Software and

Technology Conference, (Singapore), pp. 17–29, Aug. 23-26 1994.

[62] F. K. H. Quek and M. Zhao, “Inductive learning in hand pose recognition,” in FG96,

(Killington, VT), pp. 78–83, IEEE Comp. Soc., Oct. 14–16 1996.

[63] M. Zhao, F. Quek, and X. Wu, “RIEVL: Recursive induction learning in hand gesture recog-

nition,” PAMI, vol. 20, no. 11, pp. 1174–1185, 1998.

[64] A. Wilson and A. Bobick, “Configuration states for representation and recognition of ges-

ture,” in IWAFGR, (Zurich, Switzerland), pp. 129–134, June 1995.

[65] C. J. Cohen, L. Conway, and D. Koditschek, “Dynamical system representation, generation,

and recognition of basic oscillatory motion gestures,” in FG96, (Killington, VT), pp. 60–65,

Oct. 14–16 1996.

[66] T. Darrell and A. Pentland, “Recognition of space-time gestures using a distributed rep-

resentation,” in CVPR, IEEE, CS, 1993. Also appears as M.I.T. Media Lab. Vision and

Modeling Group Tech. Rept. No. 197.

[67] T. J. Darrell, I. A. Essa, and P. A. P., “Talk-specific gesture analysis in real-time using

interpolated views,” Tech. Rep. MIT Media Lab. Rept. 364, M.I.T. Media Laboratory, 1995.

[68] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-sequential images

using hidden markov model,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pp. 379–385, 1992.

[69] L. Rabiner, “A tutorial on hidden markov models and selected applications in speech recog-

nition,” Proc. of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[70] F. G. Hofmann, P. Heyer, and G. Hommel, “Velocity profile based recognition of dynamic

gestures with discrete hidden markov models,” in Proceedings of the International Gesture

Workshop: Gesture and Sign Language in Human-Computer Interaction (I. Wachsmuth and

M. Frohlich, eds.), (Bielefeld, Germany), pp. 81–95, Springer, Sept. 17–19 1997.

35

[71] M. Assan and K. Grobel, “Video-based sign language recognition using hidden markov

models,” in Proceedings of the International Gesture Workshop: Gesture and Sign Lan-

guage in Human-Computer Interaction (I. Wachsmuth and M. Frohlich, eds.), (Bielefeld,

Germany), pp. 97–109, Springer, Sept. 17–19 1997.

[72] A. Wilson and A. Bobick, “Recognition and interpretation of parametric gesture,” Tech.

Rep. MIT Media Lab. Rept. 421, M.I.T. Media Laboratory, 1998.

[73] J. Schlenzig, E. Hunter, and R. Jain, “Recursive identification of gesture inputs using hidden

markov models,” in Proceedings of the Second IEEE Workshop on Applications of Computer

Vision, (Pacific Grove, California), Nov. 1994.

[74] J. Schlenzig, E. Hunter, and R. Jain, “Vision-based human hand gesture interpretation us-

ing recursive estimation,” in Twenty-Eighth Asilomar Conference on Signals, Systems, and

Computers, (Pacific Grove, California), pp. 187–194, Dec. 1994.

[75] G. Rigoll, A. Kosmala, and S. Eickeler, “High performance real-time gesture recognition

using hidden markov models,” in Proceedings of the International Gesture Workshop: Ges-

ture and Sign Language in Human-Computer Interaction (I. Wachsmuth and M. Frohlich,

eds.), (Bielefeld, Germany), pp. 69–80, Springer, Sept. 17–19 1997.

[76] Y. Iwai, H. Shimizu, and M. Yachida, “Real-time context-based gesture recognition using

HMM and automaton,” in Proceedings of the International Workshop on Recognition, Anal-

ysis, and Tracking of Gestures in Real-Time Systems, (Corfu, Greece), pp. 127–135, 26-27

Sept. 1999.

[77] H.-K. Lee and J. Kim, “An HMM-based threshold model approach for gesture recognition,”

PAMI, vol. 21, pp. 961–973, Sept. 1999.

[78] A. F. Bobick and Y. A. Ivanov, “Action recognition using probabilistic parsing,” Tech. Rep.

MIT Media Lab. Rept. 449, M.I.T. Media Laboratory, 1998.

[79] J. Martin and J.-B. Durand, “Automatic handwriting gestures recognition using hidden

markov models,” in FG2000, (Grenoble, France), pp. 403–409, 28-30 Mar. 2000.

36

[80] H.-J. Boehme, A. Brakensiek, U.-D. Braumann, M. Krabbes, and H.-M. Gross, “Neural

architecture for gesture-based human-machine-interaction,” in Proceedings of the Interna-

tional Gesture Workshop: Gesture and Sign Language in Human-Computer Interaction

(I. Wachsmuth and M. Frohlich, eds.), (Bielefeld, Germany), pp. 219–232, Springer, Sept.

17–19 1997.

[81] A. D. Wilson, A. F. Bobick, and J. Cassell, “Recovering temporal structure of natural ges-

ture,” in FG96, (Killington, VT), pp. 66–71, Oct. 14–16 1996.

[82] D. Franklin, R. E. Kahn, M. J. Swain, and R. J. Firby, “Happy patrons make better tip-

pers creating a robot waiter using perseus and the animate agent architecture,” in FG96,

(Killington, VT), pp. 253–258, Oct. 14–16 1996.

[83] T. Sowa and I. Wachsmuth, “Coverbal iconic gestures for object descriptions in virtual en-

vironments: An empirical study,” in Submitted to: Post-Proceedings of the Conference on

Gestures: Meaning and Use, (Porto, Portugal), Apr. 1999.

[84] T. Sowa and I. Wachsmuth, “Understanding coverbal dimensional gestures in a virtual de-

sign environment,” in Proceedings of the ESCA Workshop on Interactive Dialogue in Multi-

Modal Systems (P. Dalsgaard, C.-H. Lee, P. Heisterkamp, and R. Cole, eds.), (Kloster Irsee,

Germany), pp. 117–120, June 22–25 1999.

[85] S. Prillwitz et al, Hamburg Notation System for Sign Languages – An Introductory Guide.

Hamburg: Signum Press, 1989.

[86] V. I. Pavlovic, R. Sharma, and T. S. Huang, “Visual interpretation of hand gestures for

human-computer interaction: A review,” IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, vol. 19, pp. 677–695, July 1997.

[87] A. Wexelblat, “Research challenges in gesture: Open issues and unsolved problems,” in

Proceedings of the International Gesture Workshop: Gesture and Sign Language in Human-

Computer Interaction (I. Wachsmuth and M. Frohlich, eds.), (Bielefeld, Germany), pp. 1–

11, Springer, Sept. 17–19 1997.

[88] S. Gibet, J. Richardson, T. Lebourque, and A. Braffort, “Corpus of 3d natural movements

and sign language primitives of movement,” in Proceedings of the International Gesture

37

Workshop: Gesture and Sign Language in Human-Computer Interaction (I. Wachsmuth

and M. Frohlich, eds.), (Bielefeld, Germany), pp. 111–121, Springer, Sept. 17–19 1997.

[89] D. McNeill, “Growth points, catchments, and contexts,” Cognitive Studies: Bulletin of the

Japanese Cognitive Science Society, vol. 7, no. 1, 2000.

[90] D. McNeill and S. Duncan, “Growth points in thinking-for-speaking,” in Language and

Gesture (D. McNeill, ed.), ch. 7, pp. 141–161, Cambridge: Cambridge University Press,

2000.

[91] A. Kendon, “Current issues in the study of gesture,” in The Biological Foundations of

Gestures:Motor and Semiotic Aspects (J.-L. Nespoulous, P. Peron, and A. Lecours, eds.),

pp. 23–47, Hillsdale, NJ: Lawrence Erlbaum Assoc., 1986.

[92] D. McNeill, “Catchments and context: Non-modular factors in speech and gesture,” in Lan-

guage and Gesture (D. McNeill, ed.), ch. 15, pp. 312–328, Cambridge: Cambridge Univer-

sity Press, 2000.

[93] D. McNeill, F. Quek, K.-E. McCullough, S. Duncan, N. Furuyama, R. Bryll, X.-F. Ma, and

R. Ansari, “Catchments, prosody and discourse,” in in press: Gesture, 2000.

[94] F. Quek and R. Bryll, “Vector Coherence Mapping: A parallelizable approach to image flow

computation,” in Proc. Asian Conf. on Comp. Vis., vol. 2, (Hong Kong), pp. 591–598, Jan.

1998.

[95] F. Quek, X. Ma, and R. Bryll, “A parallel algorithm for dynamic gesture tracking,” in

ICCV’99 Wksp on Rec., Anal. & Tracking of Faces & Gestures in R.T. Sys., (Corfu, Greece),

pp. 119–126, Sept.26–27 1999.

[96] D. Ladd, Intonational Phonology. Cambridge: Cambridge University Press, 1996.

[97] R. Ansari, Y. Dai, J. Lou, D. McNeill, and F. Quek, “Representation of prosodic structure

in speech using nonlinear methods,” in 1999 Workshop on Nonlinear Signal and Image

Processing, (Antalya, Turkey), 1999.

38

[98] F. Quek and D. McNeill, “A multimedia system for temporally situated perceptual psy-

cholinguistic analysis,” in Measuring Behavior 2000, (Nijmegen, NL), p. 257, Aug. 15–18

2000.

[99] F. Quek, R. Bryll, H. Arslan, C. Kirbas, and D. McNeill, “A multimedia database system

for temporally situated perceptual psycholinguistic analysis,” Accepted, to appear in Multi-

media Tools and Applications (Kluwer Academic Publishers), 1999.

[100] C. Nakatani, B. Grosz, D. Ahn, and J. Hirschberg, “Instructions for annotating discourses,”

Tech. Rep. TR-21-95, Ctr for Res. in Comp. Tech., Harvard U., MA, 1995.

39