A Multimedia System for Temporally Situated Perceptual Psycholinguistic Analysis

33

Transcript of A Multimedia System for Temporally Situated Perceptual Psycholinguistic Analysis

, , 133 ()c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

A Multimedia System for Temporally Situated

Perceptual Psycholinguistic Analysis

FRANCIS QUEKy, ROBERT BRYLL

y, CEMIL KIRBAS

y, HASAN ARSLAN

y,

AND DAVID MCNEILLz

[email protected]

yVision Interfaces and Systems Laboratory

Wright State University, Dayton, OH

zDepartment of Psychology

University of Chicago

Abstract. Perceptual analysis of video (analysis by unaided ear and eye) plays an important

role in such disciplines as psychology, psycholinguistics, linguistics, anthropology, and neurology.

In the specic domain of psycholinguistic analysis of gesture and speech, researchersmicro-analyze

videos of subjects using a high quality video cassette recorder that has a digital freeze capability

down to the specic frame. Such analyses are very labor intensive and slow. We present a

multimedia system for perceptual analysis of video data using a multiple, dynamically linked

representation model. The system components are linked through a time portal with a current

time focus. The system provides mechanisms to analyze overlapping hierarchical interpretations of

the discourse, and integrates visual gesture analysis, speech analysis, video gaze analysis, and text

transcription into a coordinated whole. The various interaction components facilitate accurate

multi-point access to the data. While this system is currently used to analyze gesture, speech and

gaze in human discourse, the system described may be applied to any other eld where careful

analysis of temporal synchronies in video is important.

Keywords: Multimedia Data Visualization; Temporal Analysis; User Interface; Multiple, Linked

Representation; Gesture Coding; Gesture, Speech and Gaze Analysis

D R A F T March 23, 2000, 12:23am D R A F T

VISLab
Multimedia Tools and Applications, Kluwer Academic Publishers, Vol. 18, No. 2, pp. 91-113, August 2002 Also as VISLab Report: VISLab-00-03.

2

1. Introduction

Perceptual analysis of video (analysis by unaided ear and eye) plays an important

role in such disciplines as psychology, psycholinguistics, linguistics, anthropology,

and neurology. In the specic domain of psycholinguistic analysis of gesture and

speech, researchers micro-analyze videos of subjects using a professional quality

video cassette recorder that has a digital freeze capability down to the specic

frame. This is a painstaking and laborious task. In our own work on the integrated

analysis of gesture, speech, and gaze (GSG), the labor intensivity of such analysis

is one of the key bottlenecks in our research.

We have developed a multimedia system for perceptual analysis of GSG in video

(henceforth MDB-GSG). This system, developed with attention to the interactive

needs of psycholinguistic perceptual analysis, has resulted in at least a ten-fold

increase in coding eciency. Furthermore, MDB-GSG provides a level of access

to GSG entities computationally extracted from the video and audio streams that

facilitates new analysis and discoveries.

In this paper, we shall discuss the task of perceptual analysis of video, the inter-

active model of our MDB-GSG system based on time situated multiple-linked and

related representations, and the MDB-GSG system design and implementation.

2. Perceptual Analysis of Video

Psycholinguistic perceptual analysis of video typically proceeds in three iterations.

First, the speech is carefully transcribed by hand, and then typed into a text doc-

ument. The beginning of each linguistic unit (typically a phrase) is marked by the

time-stamp of the beginning of the unit in the video tape. Second, the researcher

revisits the video and annotates the text, marking co-occurrences of speech and

gestural phases (rest-holds, pre-stroke and post-stroke holds, gross hand shape, tra-

jectory of motion, gestural stroke characteristics etc.). The researcher also inserts

locations of audible breath pauses, speech dis uencies, and other salient comments.

Third, all these data are formatted onto a nal transcript for psycholinguistic anal-

ysis. This is a painstakingly laborious process that takes a week to ten days to

analyze about a minute of discourse.

D R A F T March 23, 2000, 12:23am D R A F T

3

Sue C1 1 of 2

gar

age

in b

ack

So

you

're

inth

e ki

t-

chen

'n' t

hen

ther

e's

a s<

sss>

the

bac

k

stai

rc-

oh

I

forg

ot

to

say

wh

en y

ou

com

eth

rou

gh

th

e

wh

en y

ou

ente

r th

eh

ou

se fro

mth

e

fro

nt

ann

dyo

u<o

u>

op

enn

th

e

do

ors

wit

h t

he

<uu

mm

>

(sm

ack)

the

gla

ss

inn

th

em

ther

e's

a

Sp

eech

Tra

nsc

rip

t

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451

RM

S A

mp

litu

de

Speech Amplitude

Frame Number

0

400800

1200

24002800

3200

16002000

481

0

50

100

150

200

250

300

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481

F0

Val

ue

1 2 3 4 5 6 7 8 9 10 12 13 14 15 16

17

18 19 20 21

22

23 2425

26 27 2811

Audio Pitch

Left Hand Rest

Right Hand Rest

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481-100

-50

0

50

100

150

200

250

300

Pix

els

Hand Movement along Y-Direction

-200

-150

-100

-50

0

50

100

1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481

Pix

els

Hand Movement along X-Direction

LH

RH

LHRH

Figure 1. Hand position, analysis, F0, transcript and RMS graphs for frames 1481

D R A F T March 23, 2000, 12:23am D R A F T

4

3 4 5 6 7 8 9

00:16:46:28 # [[ / so<oo> you're in th e kitchen ] [['n' then ther e's a s<sss>* /] [ / the back stairc*]]1 2a 2b

1. iconic; RH sl.spread B PTD moves AB from RC to hold directly above knees; <moving into the kitchen; then, loc. in kitchen>2a. deictic vector§ - aborted;; RH G takes off from end point of #1; points and moves RtoL across C; head tilts to look left in the same direction the hand is moving; <move from kitchen area to bottom of back staircase>2b. deictic vector§ - aborted;; RH G takes off from end point of #2a; in one smooth motion: points to R while moving TB, curves to point & move up;

<move up back staircase>.

10 11 1200:16:50:12 [oh I forgot to say ]

emblem-ish; A-hand moves to sternum & holds; <(woops)>

12 13

00:16:51:10 [[when you come through the* / / # ]

iconicΩ - aborted;; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB and apart to a more PTC/PTB position on either side of CC;<open & move through interior double doors>

14 15 16 17 18 19 20 21 23 22

00:16:2:29 [[when you enter the ^house ^from the ^front / ] [annd you<ou> / openn the / doors / with t ]][he* 1 2

1. iconic; BH/mirror 5-hands start together PTB @chest, move AB but do not move apart; 'superimposed' beats, including a big upward & forward head movement at the end on "front"; <enter the house, but not stop short of going through the interior double doors>2. iconic - repeat & complete Ω; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB to a mid-stroke hold, then move apart to PAB on either side of C @far L&R; <open & move through interior double doors>

Figure 2. Sample rst page of a psycholinguistic analysis transcript

In our current work [17, 16], we relate such analysis to speech prosody, three-

dimensional traces of hand motion (plotted as x, y, and z displacements against

time), three-dimensional traces of head motion, and three-dimensional (turn, nod,

roll) traces of gaze orientation. The fundamental frequency plots of F0 envelopes

are extracted using Entropic's XwavesTM [1], transferred to a page layout package

and printed together with the hand and head motion/direction traces. These F0

plots are then numbered and related to the text manually. The addition of these

steps on top of the traditional perceptual analysis makes such research even more

labor intensive. Finally, we use a graphical page layout program to combine all

the plots manually so that the time scales are aligned. This is essential to provide

visualization of the data to support discovery. It typically takes an entire day to

organize a set of data in this way. Figure 1 is an example of such plots. The

outcome of the psycholinguistic analysis process is a set of detailed transcripts.

We have reproduced the rst page of the transcript for our example dataset in

Figure 2. The gestural phases are annotated and correlated with the text (by

underlining, bolding, brackets, symbols insertion etc.), F0 units (numbers above

the text), and video tape time stamp (time signatures to the left). Comments

D R A F T March 23, 2000, 12:23am D R A F T

5

about gesture details are placed under each transcribed phrase. Each step of this

analysis requires signicant research labor.

3. Time Situated Multiple-Linked and Related Representations

This paper presents a multimedia system that is designed to reduce the labor of

perceptual analysis, and to provide a level of analysis that heretofore has not been

possible. Our goal is not to do away with expert perceptual analysis. Rather,

we seek to provide higher level objects and representations and to mitigate the

labor-intensiveness of analysis that has access only to the time stamp of the video

signal. By providing direct access to computed entities (GSG plots, automatically

segmented gesture units, speech prosody groupings etc.), the underlying video and

audio, and other representations, the MDB-GSG system also enables a level of

analysis heretofore not available to researchers.

An interactive system may be viewed as a conduit of communication between the

human and the machine [9]. Modern psychology and linguistics theories of discourse

stress the importance of maintaining a state of `situatedness' for communication

to be successful [4, 3, 5]. Under this model, the user and the computer system

maintain a stream of communication that keeps the user situated within an abstract

interactive space. In the complex environment of multi-modal discourse analysis,

this becomes all the more important. In our system, the key element of this user-

system coordination is temporal situatedness. To motivate this situatedness, all

the interface components are linked by their time synchrony. Each representation

of the complex multi-modal space is focussed on the same moment in time. To

enforce this mental model, we call this a time-portal through which we view the

extended timeline of the subject's GSG behavior. Hence, this is an example of

Multiple, Linked Representations (MLR) of dynamic components [7, 8, 15, 14] in

which each representation reinforces the situatedness condition. Furthermore, each

representation in the system is active, thereby enabling multiple-point access to the

underlying data. We shall esh out these concepts using the actual components of

our MDB-GSG system as examples.

D R A F T March 23, 2000, 12:23am D R A F T

6

Temporal cohesion is especially critical in GSG analysis. The motor and speech

channels are not subservient to one another, but spring from a common psycho-

logical (and neurophysiological) source. They proceed on independent pathways to

the nal observable behavioral output. The temporal cohesion of the GSG chan-

nels are thus governed by the constants of neurophysiology and psycholinguistics

[10, 6, 12, 13]. The time portal metaphor provides a panoramic snapshot of the

functioning of all modalities at synchronized time instants.

4. The MDB-GSG System

Figure 3 shows our MDB-GSG system with all the representational components

open. The essence of our MLR strategy is that each of these components is syn-

chronized with the current time focus. This means that each component is animated

to track this time focus, and when it changes, all the system components change

to re ect this. Not all components, however, have to be active at the same time.

When they are inactive, the system `deregisters' them, and they are not updated.

When a system component is opened (e.g. the avatar representation at the bottom

left of Figure 3), it is registered with the system and linked with the current time

focus. These system components will be considered in ve groups:

The VCR-like interface and player

The hierarchical shot-based representation and editor

The animated strip chart representation

The synchronized speech transcription interface

The GSG abstraction representation or avatar

Each of these representational components was chosen to advance our psycholin-

guistic analysis and support or computer vision and signal processing eorts. We

shall motivate each interface element as it is discussed.

D R A F T March 23, 2000, 12:23am D R A F T

7

VCR-LikeControlPanel

TextTranscription

Interface

SynchronizedTranscript

DisplayAvatarRepresentation

HierarchicalKeyframeBrowser

AnimatedStrip Chart

Panel

HierarchyEditorPanel

Current TimeFocus

Current ShotKeyframe

TimelineRepresentation

DigitalVideo

Monitor

Figure 3. System Screen of the MDB-GSG Analysis System

D R A F T March 23, 2000, 12:23am D R A F T

8

4.1. VCR-Like Interface and Player

The panels labeled VCR-like Control Panel and Digital Video Monitor at the bottom

right of Figure 3 provide a handle to the data using the familiar video metaphor.

The MDB-GSG system is designed so that dierent virtual players may be plugged

into the system. Currently, we have drivers to control a digital video (e.g. MJPEG,

MPEG, QuickTime) player and two physical devices (a computer controlled Hi-8

video player, and a laser videodisc player). Drivers for other media such as DVD

players can easily be added.

The functions of this control panel and video display are similar to other computer-

based players except for several enhancements. The frame shown in the Digital Video

Monitor is always the frame at the current time focus. As the video plays, the time

focus changes accordingly. If the time focus is changed through some other interface

component, the video player will jump to the corresponding video frame.

Our choice of MJPEG for the video is driven by the need for random frame

access and reverse play. The media player in the Silicon Graphics media library is

able to play both the video and audio tracks together both forward and in reverse

at various speeds. This is important for coding the exact beginning of particular

utterances in the audio because of the psychological eect where a listener perceives

a word a fraction of a second after it is uttered. Humans hear coherent sounds as

words, and perceive the words as they emerge from the mental processing. Hence,

it is dicult to locate the exact synchronies of the beginning of gesture phrases

and speech phrases when the audio is played forward at regular speeds. When the

audio is played backward or at slow speeds, this eect is removed and the coder

can perceive the synchronies of interest.

The circular loop button on the top row of the control panel toggles the `shot

loop' mode. When this mode is set, the video player will keep looping through the

current shot at the current level until play is halted. A `shot' is a video segment of

signicance to the GSG analysis. The loop mode permits the researcher to examine

a particular GSG entity (e.g. a stroke) to identify its idiosyncrasies at various play

speeds. The jump-to-start and jump-to-end (double arrows with a vertical terminal

D R A F T March 23, 2000, 12:23am D R A F T

9

bar) buttons at the right end of the top row allow the user to skip from shot to

shot in the shot hierarchy (described in the next section).

The Step Rev and Step Fwd buttons in the second row permit the researcher to

step through the video a frame at a time (similar to the frame jog operation in

a professional video player). This is important for micro gesture and gaze shift

coding. The triple and quadruple directional arrow buttons play the video and

audio backward and forward at dierent greater than realtime speeds (with audio).

The Slow Rev and Slow Fwd buttons in the third row permit play at various fractions

of the regular video rate with the accompanying audio. These rates (0.25, 0.5 and

0.75) are set via the pull-down menu at the top of the VCR-Like Control Panel. This,

again, is important for detailed analysis and the coding of exact synchronies in the

GSG signals.

The Timeline Representation at the bottom of the of the VCR-Like Control Panel

shows the oset of the current frame in the entire video. Consistent with the

rest of the interface, the slider is strongly coupled to the other representational

components. As the video plays in the VCR-like representation, the location of

the current frame is tracked on the timeline. Any change in the current frame by

interaction from either the visual summary representation or the hierarchical shot

representation is re ected by the timeline as well. The timeline also serves as a

slider control. As the slider is moved, all the other representational components

alter to re ect this change. The number above the slider represents the percent

oset of the current frame into the video.

Rationale

We have already discussed the importance of the careful temporal analysis in the

study of GSG cross-modal communication. Expert gesture and speech coders are

well experienced in the operation of high-end VCR's in their analysis. The design

choices in our video player: the VCR-like control panel, and the timeline sliders

are designed to enhance the users' familiarity with the professional quality VCR

control and to extend it for temporal analysis.

D R A F T March 23, 2000, 12:23am D R A F T

10

2.2 2.3 2.4

2.3.1 2.3.2 2.3.3

6.1 6.2 6.3

1 2 3 4 5 6 7 8

Current shot at level 0

Current shot at level 1

Current shot at level 2

2.1

Figure 4. Shot Hierarchical Organization

4.2. Hierarchical Shot-Based Representation and Editor

The panels labeled Hierarchical Keyframe Browser and Hierarchy Editor Panel facili-

tate the organization of the video stream into a nested hierarchical structure, and

the visualization of this hierarchy in summary format.

4.2.1. Shot Architecture Before we proceed, we need to dene several terms to

facilitate our discussion. A video sequence can be thought of as a series of video

frames. These frames can be organized into shots. We dene a shot as any sequential

series of video frames delineated by a rst and a last frame. These shots may be

organized into a nested hierarchy as illustrated in Figure 4. In the gure, each shot

at the highest level (we call this level 0) is numbered starting from 1. Shot 2 is

shown to have four subshots which are numbered 2.1 to 2.4. These shots are said

to be in level 1 (as are shots 6.1 to 6.3). Shot 2.3 in turn contains three subshots,

2.3.1 to 2.3.3. Shot 2 spans its subshots. In other words, all the video frames in

shots 2.1 to 2.4 are also frames of shot 2. Similarly, all the frames of shots 2.3.1 to

2.3.3 are also in shot 2.3. Hence, the same frame may be contained in a shot in each

level of the hierarchy. One could, for example, select shot 2 and begin playing the

video from its rst frame. The frames would play through shot 2, and eventually

enter shot 3. If one begins playing from the rst frame of shot 2.2, the video would

play through shot 2.4 and proceed to shot 3.

D R A F T March 23, 2000, 12:23am D R A F T

11

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

F K L

P

null

null null

null

null

null

null null

null null

null

nullnull

nullnull

null

null

null

Figure 5. Shot Hierarchy Data Model

Next, we dene a series of concepts which dene the temporal-situatedness of the

system. The user may select any shot to be the current shot. The video system

will play the frames in that shot, and the frame being displayed at any instant is

known as the current frame. These are dynamic concepts since the current frame

and current shot change as the video is being played. Suppose we are at level 2 of

the hierarchy and select shot 2.3.3 as the current shot (shown as a shaded box in

Figure 4). Shot 2.3 at level 1 and shot 2 at level 0 would conceptually become the

current shot at those levels. This could lead to confusion in the user, and hence

we introduce the concept of the current level. At any moment in the interface, the

system is situated at one level, and only the shots at that level appear in the visual

keyframe summary representation. In our current example, the system would be

in level 2 although the current frame is also in shot 2.3 and shot 2 in levels 1 and

0 respectively.

Figure 5 shows the data model in our shot hierarchy. Each shot denes its rst

and nal frames within the video (represented by F and L respectively) and its

keyframe for use in the visual summary. Each shot unit is linked to its predecessor

and successor in the shot sequence. Each shot may also be comprised of a list

of subshots. This allows the shots to be organized in a recursive hierarchy. The

shot data model is essentially a structure containing the F and L indices into a

D R A F T March 23, 2000, 12:23am D R A F T

12

video. Hence, we may maintainmultiple shot hierarchies into the same video/audio

sequence. This is important for GSG analysis since natural communication typically

contains multiple overlapping semantic threads.

4.2.2. Keyframe-Based Visual Summary The use of the keyframe representation

as a visual summary for video segments has proven surprisingly eective [18, 2]. For

GSG segmentation, a keyframe representation of each video segment permits the

researcher to see the general hand congurations and gaze directions represented in

the keyframe. The system takes the rst frame of each shot/subshot as the default

keyframe, but the user may substitute this with any frame of her choice through

the Hierarchy Editor Panel.

Figure 3 shows the standard keyframe representation in the top left corner. Each

shot is summarized by its keyframe, and the keyframes are displayed in a scrollable

table. The keyframe representing the \current shot" is highlighted with a boundary.

A shot can be selected as the current shot from this interface by selecting its

keyframe with the computer pointing device. In accordance with the MLR strategy,

the current time focus is set to the beginning of the shot, and all other interface

components are updated (the position of the current shot in the shot hierarchy

appears in the shot hierarchy representation, the rst frame of the shot appears as

the \current frame" in the display of the Digital Video Monitor, and the timeline

representation is updated to show the osets etc.). The video can be played using

the VCR-Like Control Panel. When the video is being played, the current keyframe

highlight boundary blinks. When the current frame advances beyond the selected

shot, the next shot becomes the current shot, and the highlight is moved to the

new current shot.

Figure 6 shows the keyframe browser with a larger keyframe presentation than

that in Figure 3. Our MDB-GSG implements three keyframe sizes that are gener-

ated dynamically from the MJPEG video to permit the user to trade-o between

keyframe resolution and the number of shots concurrently visible. The two text

entry boxes at the bottom of Figure 6 permit the user to enter textual annotation

for the current shot. The user enters a short label for the shot in the smaller box on

D R A F T March 23, 2000, 12:23am D R A F T

13

Figure 6. Keyframe Visual Summary Representation with Annotation Boxes

D R A F T March 23, 2000, 12:23am D R A F T

14

the left and a more complete textual description in the box on the right. The label

entry is used to provide a textual synopsis and the description provides detail.

This hierarchical structure is an eective means of representing nested seman-

tic discourse structures. However, this may be insucient to represent discourse

models in which multiple concurrent semantic threads are pursued through the dis-

course. While each thread may be amenable to hierarchical analysis, the multiple

threads taken together are not. In our shot hierarchy architecture, each shot object

stores only the frame indices and its position within one hierarchy. This economy

of representation permits us to impose multiple hierarchies on the same discourse

video. These multiple hierarchies are synchronized through a single time portal

with a single current time focus.

4.2.3. Shot Hierarchy Editor The Hierarchy Editor Panel in the top right of Fig-

ure 3 is designed to allow the user to navigate, to view, and to organize video in

our hierarchical shot model shown in Figure 4. It comprises three sub-panels. The

one on the left labeled Shot Editing permits the construction and deletion of shots

from the video stream. This panel allows the user to create new shots by setting

the rst and last frames in the shot and capturing its keyframe. When the \Set

Start" or \Set End" buttons are pressed, the current frame (displayed in the Digital

Video Monitor) becomes the start or end frame respectively of the new shot. The

default keyframe is the rst frame in the new shot, but the user can select any

frame within the shot by pausing the video display at that shot and activating the

\Grab Keyframe" button.

The middle sub-panel labeled Subshot Editing permits the user to organize shots

in a hierarchical fashion. This sub-panel is context sensitive. The buttons that

represent currently unavailable operations are blanked. As is obvious from the

button icons in this panel, it permits the deletion of a subshot sequence (the subshot

data remains within the rst and last frames of the supershot, which is the current

shot). The \promote subshot" button permits the current shot to be replaced by

its subshots, and the \group to subshot" permits the user to create a new shot as

the current shot and make all shots marked in the Hierarchical Keyframe Browser

window its subshots. The blank button at the top of subshot editing panel is a

D R A F T March 23, 2000, 12:23am D R A F T

15

\Create Subshots" button. In the example shown, the current shot already has its

subshot list, so this button is disabled and blanked.

The rightmost subpanel labeled Subshot Navigation displays the ancestry of the

current shot (the shot labels entered by the user in the Hierarchical Keyframe Browser

are listed) and permits navigation up and down the hierarchy. The \Down" button

in this panel indicates that the current shot has subshots, and the user can descend

the hierarchy by clicking on the button. If the current shot has no subshots, the

button becomes blank. Since the current shot in the gure (labeled as \L1/C

(WOMBATS)") is a top level shot, the \Up" button (above the \Down" button) is

left blank. The user can also navigate the hierarchy from the Hierarchical Keyframe

Browser window. If a shot contains subshots, the user can click on the shot keyframe

with the right mouse button to descend one level into the hierarchy. The user can

ascend the hierarchy by clicking on any keyframe with the middle mouse button.

These button assignments are, however, arbitrary and can be replaced by any mouse

or key-chord combinations.

The hierarchical shot representation panel also permits the user to hide the cur-

rent shot or a series of marked shots in the visual summary display. This permits

the user to remove video segments that do not participate in the particular dis-

course structure of interest in a study. The hide feature can, of course, be switched

o to reveal all shots.

Rationale

In our work on discourse analysis, we employ such discourse structure models

as that of Grosz and colleagues [11] to parse the underlying text transcripts. The

method consists of a set of questions with which to guide analysis and uncover the

speaker's goals in producing each successive line of text. Such discourse models

are amenable to hierarchical representation. We compare this structure against the

discourse segments inferred from the objective motion patterns shown in the gesture

and gaze modalities [17, 16]. Our Hierarchical Keyframe Browser and the underlying

nested shot architecture directly support such discourse patterning. The Keyframe

Browser spatializes the time dimension so that the user may view the discourse units

as a hierarchy of keyframes. Each keyframe is a `snapshot' view of the gestural

morphology of the corresponding discourse element, and serves as a memory cue of

D R A F T March 23, 2000, 12:23am D R A F T

16

the gestural and gaze conguration for the coder. The Shot Hierarchy Editor is the

tool for the coder to add discourse segmentation and hierarchy information to the

data. The shot labeling facility is used by coders to annotate each discourse segment

with psycholinguistic observations. As will be described later, these annotations are

used to generate text transcript formats with which psycholinguistics researchers

are familiar.

4.3. Strip Chart Abstraction of Communicative Signal

The Animated Strip Chart Panel at the left middle of Figure 3 provides the user

access to the computed GSG entities. The user may select any signal to be displayed

in a pane in this panel. In the gure, the x-position trace describing both of the

subject's hand motion is in the top pane, and her voiced utterances are displayed as

the fundamental frequency F0 plots in the lower pane. Each pane may be displayed

in Huge, Normal and Compressed resolutions in the y-dimension. The x-dimension

of the plots is time expressed in terms of video frame oset into the video. The

time scale may be displayed in 3 resolutions Small, Normal (as shown) and Large

(each successive scale being twice the previous).

The red line down the middle of the plots represents the current time focus. When

the video plays, this line stays in the middle of the panel, and the plots animate

so that the signal points under this current time focus always represent the signal

at that instance. The user is able to drag the plots right and left by pulling the

time scale at the bottom of the panel in the desired direction with the middle

mouse button depressed. All other MLR interaction components will animate to

track this change (for practical reasons, audio does not play when this happens).

The user may also bring any point of the graph immediately to the current time

focus line by clicking on it with the left mouse button. If the mouse is in any

portion of the Animated Strip Chart Panel other than the time scale, the middle and

right mouse button will toggle forward and reverse play respectively at the current

speed (set using the VCR-Like Control Panel). This feature was added because the

psycholinguists wanted rapid control of the video playing without having to move

to the VCR-Like Control Panel repeatedly.

D R A F T March 23, 2000, 12:23am D R A F T

17

The user may select any available plot to be displayed in any pane. All that the

system needs to represent a plot is for its name to be registered with the system,

and for an ASCII le containing a list of points to be provided. Although there is no

theoretical limit to the number of plots in this scrollable panel, this is limited by the

pragmatic concerns of screen real-estate (why animate all plots when a maximum

of 5 may be seen at any time) and processor speed (how many plots can the system

animate before it impacts performance). Our current limit is 10 animated plots at

any time.

Rationale

This representational component has proven invaluable in our research into GSG.

First, it provides the researcher with a `god's eye view' into the video time line

so that she can conceptualize about the GSG entities represented beyond the im-

mediacy of the current point being played. This has helped immensely to speed

up the psycholinguistic coding. Second, since it is trivial to change and add sig-

nals to the display, dierent extracted time-based traces may be displayed in this

way. This has been extremely useful in the development of algorithms to process

the video and segment the GSG signals. In our work in `deconstructing' the hand

motion traces into atomic `strokelet' motion units, for example, we simply gener-

ate a signal stream that has value spikes at the `strokelet' transition points and

is zero elsewhere. This allows us to evaluate the eectiveness of our segmentation

perceptually with access to the underlying video and audio through the interface.

This system malleability directly supports our reciprocal cross-disciplinary research

strategy. Psycholinguists provide perspective and analysis to guide the engineering

eorts in audio, video, and signal processing. The engineering team provides access

to GSG signals and entities, and the tools to access and visualize them. The MDB-

GSG system provides the locus of integration and interaction among researchers

from both disciplines.

4.4. Transcription Interface

The Transcription Interface shown in Figure 7 consists of a text display and editing

area (the Transcript Display Pane), a status display, and a set of control buttons and a

D R A F T March 23, 2000, 12:23am D R A F T

18

Figure 7. The Transcription Interface

pull down menu. These provide access and manipulation to the temporal properties

and content of the underlying syntax of the subject's speech. The speech is rst

transcribed to text manually to obtain a preliminary ASCII text le that may be

organized and rened using the MDB-GSG system. When this text transcript is

registered with the system, it is indexed and displayed in the Transcript Display

Pane.

The Transcription Interface has three modes of operation: Time, Edit, and Playback.

In Time mode, the user associates text entities with timestamps; in Edit mode

the user may modify the underlying transcript text; and, in Playback mode the

MDB-GSG system animates the text to track the current time focus. The mode

of operation is selectable from the `Mode' pull down list. The default mode of

operation is Playback, and the system returns to Playback mode whenever Time or

Edit mode is terminated.

4.4.1. Transcript and Associated Representation In our system, the transcrip-

tion is maintained in two dierent les. The transcription text (and other embedded

information) is maintained as a straight ASCII le. The transcription is divided

D R A F T March 23, 2000, 12:23am D R A F T

19

into `separable entities' in the form of alphabetic strings that are delineated by

separators (white space or special characters). These entities are represented as a

list in the Tagged Transcription File. Each item in the Tagged Entity List may be

associated with a time stamp that is synchronized with the rest of the database.

The time stamp represents the onset of voicing of the particular transcript entity.

These timestamps, therefore, describe a set of intervals between successive entities

in the list. If a timestamp is not assigned to a particular entity, that entity is said

to belong to the interval between the last previously tagged, and next subsequently

tagged entities in the list. Beside white spaces, our system provides for other delim-

iters such as a period between two alphabetic character strings with no spaces. This

permits the separate tagging of dierent syllables or phones within a single word.

For example, the word \Transcription" may be stored in the ASCII Transcription

File as \Tra.ns.crip.tion". This is represented as a sublist of four items: \[tra]-[ns]-

[crip]-[tion]" in the Tagged Transcription File, allowing the separate tagging of each

item. The user may also add comments to the ASCII Transciption File. Following

programming convention, comment lines in the le begin with a semi-colon imme-

diately following the preceding line break. Comment text is ignored in the Tagged

Entity List as are all text delimiters.

4.4.2. Time Mode Operation Transcription typically begins with an untagged

text transcript le that is generated manually by a transcriber viewing the exper-

iment video tape. This le is imported into the system and forms the basis for

the ASCII Transcription File. Upon importation, the MDB-GSG system parses the

ASCII transcription le to produce the list of separate entities that are initially

untagged. This forms the basis of the Tagged Entity List.

In Time mode operation, an `Accept' button appears next to the mode status

indicator. The user may mark a text entity to associate it with the current time

focus. This time tag is entered into the Tagged Entity List when the `Accept' button

is pressed or when the user hits the return key on the keyboard. The system checks

that the time tags entered are temporally constrained (i.e. items in the front of

the list have earlier time stamps), and ags ordering errors for user correction.

Since the only criterion we use for indexing the textual entities is a parse based

D R A F T March 23, 2000, 12:23am D R A F T

20

on word separators (spaces, tabs, and line breaks), the user may enforce a syllable

and phone level parsing by inserting in-word separators. Once a time stamp is

associated with a text entity, the system automatically highlights the next text

entity to be associated with the new timestamp. The user may of course highlight

some other text entity using either the mouse or the cursor control keys on the

keyboard.

Not all words need to be time-annotated. If the rst words at the beginnings of

two consecutive phrases are annotated, the system associates all the words from

the rst annotated word up to the one before the second annotated word with the

duration between the time indices. This allows sentence-, phrase- and word-level

analyses using the MDB-GSG system.

4.4.3. Edit Mode Operation In Edit mode operation, the user may modify the

underlying transcript text (add missed words, correct transcription errors, add au-

dible breath pauses etc.). When the user selects `Edit' from the pull down mode

menu, an edit session begins. A pair of buttons labeled `Done' and `Cancel' respec-

tively appear next to the mode status indicator. The user may also add or delete

words (or word fragments) in the tagged transcription list by changing the text in

the Transcript Display Pane directly. In Edit mode, in-word separators and white

space may also be added. Edit mode is exited when the user activates either the

`Done' or `Cancel' button. If `Cancel' is selected, the changes made in the editing

session are discarded. If the user selects done, the changes are parsed and incor-

porated into the Tagged Entity List. Upon exiting edit mode operation, Playback

mode is automatically activated.

4.4.4. Synchronized Transcript Playback In Playback mode, the Transcript Dis-

play Pane serves as a synchronized playback display. Once the text is time anno-

tated, the word[s] associated with the current time focus in the Transcript Display

Pane are highlighted in synchrony with the all other MLR interaction components.

When the video plays, this highlighting animates and the Transcript Display Pane

scrolls to keep the current time focus text within the pane. When the current time

focus is changed in any other interface component, the appropriate scrolling and

D R A F T March 23, 2000, 12:23am D R A F T

21

highlighting takes place. The user is also able to select any word in the Transcript

Display Pane and make its time index the current time focus of the entire system.

The user may choose either to show or hide comment lines and within-word delim-

iters in the Transcript Display Pane.

Since the Transcript Display Pane highlights the appropriate word[s] and scrolls

during playback, the user may activate Time or Edit mode to modify and rene

the time annotation or transcript text at any point during playback.

4.4.5. Importing and Exporting ASCII Transcript Files Since the transcript is

an ASCII text le, these may be prepared and edited independently outside the

system, and then imported. Text may also be exported. During the importation

process, it is critical that we reassociate it with any existing Tagged Transcription

File (if this does not exist, the system simply creates a new one). The imported

ASCII Text File is parsed to produce a new tagged transcription list that is compared

against the old list. If the only changes added are comments and formatting (the

most common situation), the two lists would be identical, and the time tags are

simply transferred to the new list. If the underlying text entities are changed, the

MDB-GSG system highlights these changes to bring them to the attention of the

user for time tagging. Once the lists have been fully resolved, the new list is saved

as the Tagged Transcription File.

Rationale

Since text-level manual transcription using high quality frame-accurate VCRs

is the way psycholinguistic GSG research is traditionally done, there was much

interaction with our psycholinguist colleagues in the design of this sub-system. The

`within-string' delimiters were incorporated because this was deemed essential to

the transcription process. Similarly, the addition of the commenting capability was

motivated by the research need to add scientic observations in the commentary.

The comments and formatting also permit the researcher to use indented formatting

to represent discourse-level structure. For this reason, the ASCII Transcription File

leaves the white space formatting of the transcript intact, and the system displays

this in the Transcript Display Pane.

D R A F T March 23, 2000, 12:23am D R A F T

22

The capability to import and resolve new ASCII Transcription Files is also driven

by the working style of our research's primary critical resource: expert psycholin-

guistic researcher. The ability to export the ASCII Transcription File for editing on

a standard word processor and reimporting the edited result allows the researcher

to analyze the text without being tied to the workstation that runs the MDB-GSG

system. Since it is much easier to do time tagging and syllable and phone level pars-

ing (these depend on the content of the video and audio tracks) on the MDB-GSG

system, such o-line editing is typically done to add comments and indentation

structure. Since such operations leave the Tagged Transcription File unchanged,

minimal labor is required to resolve the new ASCII Transcription File with the exist-

ing MDB-GSG representation. The resolution process often serves as a debugging

operation to remove inadvertently modied text (e.g. comments added without the

commenting ag).

4.5. Avatar Abstraction

The Avatar Representation at the bottom left of the screen displays an animated

avatar that moves in synchrony with the current time focus. In our current GSG

work, we image the subjects using three cameras (two calibrated stereo, one closeup

on the head) to extract the three-dimensional velocities and positions of the hands,

the three-dimensional position of the head, and the head gaze direction in terms

of the turn, nod, and roll angles. These values are fed into the avatar simulation

that plays in synchrony with the rest of our MDB-GSG interface. This provides

essentially a simulated analog of the subject in the Digital Video Monitor displaying

only the signal dimensions extracted.

Rationale

This avatar serves three purposes to support GSG research. First, it is not re-

stricted to a particular viewpoint. For example, the simulation may be rotated

to give the user a top-down view of how the hands move toward and away from

the body. A top-down view also aids the examination of the direction of gaze in

terms of the head `turn' alone. This provides a better understanding of how the

subject is structuring her gestural space in the `z-direction'. Second, it permits

D R A F T March 23, 2000, 12:23am D R A F T

23

researchers to see the communicative eects of each extracted signal. For example,

we applied the system with a constant z-dimension to see the eect of depth on

how one perceives a gesticulatory stream with the hands motions constrained to

be in a plane in front of the subject. We also saw the eectiveness of head and

gaze direction in giving the avatar a sense of communicative realism by disabling

that signal. We expect that this avatar interaction will also provide insight to the

eects of slight dissynchronies in speech and gesture, or the removal of dierent

gestural components. We cannot do this with the original video, but this is trivial

in the avatar. Third, the avatar permits us to do a qualitative evaluation of our

hand and head tracking algorithms. Since our purpose is not absolute position but

conversational intent, the avatar facilitates a quick evaluation of the eectiveness

of our extraction algorithms in comparison with the original video.

5. Transcript Generation

The MDB-GSG system permits researchers to analyze and organize multi-modal

discourse by applying specic psycholinguistic models. Hence, the resulting seg-

mentation, structure, transcription and annotation are the intellectual product of

the analysis. This may be accessed and communicated in two ways.

First, the MDB-GSG system itself provides multi-media access to these GSG

entities. The database associated with a particular analysis may be loaded for

perusal and query. Dierent databases may be generated on the same discourse

data to re ect dierent discourse organization theories and methodologies.

Second, the MDB-GSG system is able to produce a text transcript from the anal-

ysis. Figure 8 shows a fragment of such a transcript. The hierarchical organization

of the transcript derives automatically from the shot hierarchy along with its labels

and annotation. The speech transcription text associated with each item in this

hierarchy derives directly from the time-annotated speech transcript.

Rationale:

This transcript is similar to that produced manually in Figure 2. Psycholinguis-

tic researchers familiar with manual transcription nd this automatically gener-

D R A F T March 23, 2000, 12:23am D R A F T

24

1 clapper (0:7:12:23 - 0:7:14:17) (0 - 54) beginning of the film clip

Transcript: ""

2 L1 (0:7:14:18 - 0:7:17:1) (55 - 128) Introduces top discourse layer

Transcript: "okay what we need to do <whisper?> "

2.1 L1 (0:7:14:18 - 0:7:15:18) (55 - 85) "okay"-Macro level discourse marker

BH fists PTB/FTC contact in cc

Transcript: "okay "

2.2 L1 (0:7:15:19 - 0:7:17:1) (86 - 128) RH G to interlocutor

Transcript: "what we need to do <whisper?> "

3 whisper by LSNR (0:7:17:2 - 0:7:17:6) (129 - 133) motivates slight SPKR hesitation

Transcript: ""

4 L2/C(TRAINS) (0:7:17:7 - 0:7:28:9) (134 - 466) RH, BH, LH

Transcript: "is we're gonna ride in through town umhm uhm get off atthe train station uhm we'll be getting off at the right sowe're coming from this direction <smack> past this to thestation "

Figure 8. Transcript produced by the MDB-GSG system

D R A F T March 23, 2000, 12:23am D R A F T

25

ated transcript invaluable. MDB-GSG, therefore, facilitates greater communication

among researchers.

6. Object-Oriented Implementation and Temporal Synchronization

Figure 9 shows the simplied object hierarchy of our system and also the method of

temporal synchronization. All MDB-GSG interface elements are separate objects

in our C++ implementation of the system. The system is implemented under SGI

IRIX, using X Windows, Motif and SGI Movie libraries.

To permit multiple time portals, our system can open multiple datasets comprising

shot hierarchies, video recordings, transcriptions, and charts simultaneously. Some

interface elements are shared by all datasets while some are associated with specic

datasets.

6.1. Architecture Overview

Figure 9 presents a simplied diagram of our MDB-GSG system architecture. The

VCR-like Control Panel, Shot Hierarchy Editor and Virtual Video Player are shared

\common" objects and are instantiated only once. The Virtual Video Player is

designed to give the system independence from the actual device that plays the

video data. This player may be either a hardware device or a software media

player. In the current implementation, the MDB-GSG system can handle two

physical devices (a RS-232 controlled VCR and a laser disk player) and the Digital

Video Monitor discussed in section 4.1.

Beside these shared common objects, all other objects are instantiated on demand

and tied to the specic dataset to for which they are created. The key interface

component for a particular dataset is the Keyframe Browser that is associated with

a specic shot-hierarchy. At any point in time only one Keyframe Browser may be

active (or `in focus '). The active Keyframe Browser is shown highlighted in Figure

9. Each Keyframe Browser contains a list of one or more Browser Windows, only

one of which is active at any time (shown highlighted in Figure 9). Each Browser

Window is a kind of a `view' into the shot hierarchy, maintaining a separate time

focus, and active level. Our time portal conceptual model is embodied by a specic

D R A F T March 23, 2000, 12:23am D R A F T

26

Shot HierarchyEditor

KeyframeBrowser

KeyframeBrowser

KeyframeBrowser

DB DB DB

BrowserWindow

Browser Window

Strip Chart Avatar

VCR-likeControl PanelVirtual

VideoPlayer

VCR

LaserDisk

MPEG

MJPEG

VideoStorage

or Device

BrowserWindow

Text Transcript

VIdeo is PlayedIndependently of theMain Event Loop Main X Event Loop

"Synchronization Dispatcher"Propagates the

"Update Position"Message to All Objects

Polling Functionin the Event Loop

(Timeout)

Query for CurrentPosition in Media

"Update Position"Message

Figure 9. Simplied Architecture of the MDB-GSG System and the Method of Media Synchro-

nization

D R A F T March 23, 2000, 12:23am D R A F T

27

Browser Window with its time focus and active level. As discussed earlier, this time

portal concept is critical in the analysis of GSG interaction. Each browser window is

associated with its own Strip Chart, Avatar, and Text Transcription interfaces. When

a browser window is active or in focus, the entire system uses its time focus as the

current time focus.

6.2. Media Synchronization

All objects, except the virtual video player object and the actual video device (im-

plemented in software or hardware), share the same X event loop, shown as dashed

circle in Figure 9. The Digital Video Monitor is the most commonly used video

device, but our system allows other types of video devices to be connected (e.g.

a laser disk player or a computer-controlled VCR) and interfaced to the system

through an appropriate `virtual player'. The only requirements for each video de-

vice is that it has to be able to return the current (currently played) frame number,

and go to a particular frame number on demand. The video device plays the video

independently from the rest of the system. This is obviously the case when an ex-

ternal physical device is used. Software and video players, such as are Digital Video

Monitor, play the video data on the separate thread of execution.

The system's main X event loop contains a function (implemented as a timeout)

which periodically polls the video device for the current position in the media. The

function then sends a `update position' message to the `synchronization dispatcher'

object (see Figure 9). The `synchronization dispatcher' propagates this message

to all its sub-objects (keyframe browsers), and all currently active objects (i.e.

currently opened windows) update themselves in response. Each object, in turn,

propagates the message to all active sub-objects (e.g. if any keyframe browser

window has some active animated strip charts). So the update message originates

in the X event loop and is propagated in a tree-like fashion throughout all currently

active objects.

Polling an independently executing virtual player makes the video recording a

basis of temporal synchronization of all the system elements and is consistent with

the basic philosophy of the entire system, where the video is central to analysis

D R A F T March 23, 2000, 12:23am D R A F T

28

and all other data. In fact, all system data (e.g. text transcript, strip chart data,

hand position data for the avatar) are derived from the video. It is also easy to

implement and makes all object interfaces relatively simple (any object has to be

able - in general - to update itself with a new current frame number, and also to

send a new frame number to the virtual player - e.g. when a slider is moved by

the user on the VCR-like Control Panel). Furthermore, most video players maintain

accurate temporal synchronization making them ideal `clocks' for system.

By using the virtual media player as the central synchronization agent, dropping

frames during video playback does not pose a problem, since all the remaining

interface modules respond to such an event by updating themselves with the correct

new frame number. In fact, our synchronization strategy has the added benet of

automatic system load balancing. On a slower machine, the video player is likely

to drop frames if many interface components are active and animated. This in turn

causes a coarser animation update, giving more resources to the video player.

In the current implementation the polling is performed in 33 ms intervals (slightly

faster than 30 times per second), which is sucient, considering that the typical

video framerate is 30 fps. Some events in certain interface objects (e.g. pressing

the \skip single frame" buttons in the VCR-like Control Panel) automatically force

dispatching of the `update position' message to all interface elements.

7. System Use

Our MDB-GSG system replaces a process of manual perceptual analysis using a

frame-accurate videotape player (a Sony EVO-9650); hand transcription; manual

production of gesture and gaze tracking charts and audio F0 and RMS charts; man-

ual tagging of F0 charts and synchronization with the text transcript; and manual

reproduction of the analysis results into a text transcript. For this reason, it is not

possible to evaluate the MDB-GSG system against a predecessor. Furthermore, the

kind of psycholinguistic analysis performed is extremely skill-intensive and tedious.

For this reason, the number of people actively engaged in micro-analysis of video

for GSG coding is small. We hope that a system like our MDB-GSG will bring

modern multimedia tools to bear on such research and increase the number of new

D R A F T March 23, 2000, 12:23am D R A F T

29

researchers in the eld. We hope to have an eect similar to that of PC-controlled

telescopes bringing many new amateur astronomers to the discovery of celestial

phenomena.

The system is being used by two doctoral students in psycholinguistic research,

and both have given the MDB-GSG system high marks in their subjective evalu-

ation. The work is an order faster (a day for a week and a half of intense labor)

with the MDB-GSG system. Furthermore, with direct access to the multimedia,

multimodal data, the degree of integrated analysis is enhanced.

8. Lessons Learned

We worked closely with domain scientists in all phases of our research and imple-

mentation of the system. We had been doing GSG research in collaboration with

our psycholinguistics partners, and had observed the degree of detail required in

the micro analysis of the video data. We also saw the importance of temporal anal-

ysis for such research. At each stage of our system development, we proposed the

structure of each interface component to, and discussed how it would function with

the psycholinguists. The computer scientists took the lead in this tool development

eort as they are more familiar with what can be done. Since we are proposing new

ways for GSG access, the psycholinguists initially found it dicult to visualize how

the new interface would work. To overcome this, after initial discussion we devel-

oped rapid prototypes of the new tools and showed them to the psycholinguists for

comment. This is when we usually get the most eective insights on how to modify

the tools as psycholinguistic GSG analysis. Most of the tools required two or three

iterations to rene both interface and the back end processing. The components

that elicited the most discussion and renements were not surprisingly the Text

Transcription Interface and the VCR-like Control Panel. These were the components

that matched most precisely how video micro analysis for GSG Research was previ-

ously conducted. The other components like the video Keyframe Browser, the Avatar

Representation, and the Strip Chart Interface were completely new to the experience

of the psycholinguists, and we expect that new comments and renement requests

would be forthcoming as they become more familiar with these tools. With respect

D R A F T March 23, 2000, 12:23am D R A F T

30

to system architecture, clean objects dened in relation to interface components

facilitate such interactive renement and system development.

As for data representation, our discussion led to the implementation of the time

portal concept to visualize and compare multiple instances in time. The time portal

conceptual model provides a good mechanism to visualize and access single or

multiple time instances in a spatialized representation of time. We also found that

single hierarchies are insucient to represent complex human discourse structures.

Our implementation of multiple keyframe browsers associated with dierent shot

hierarchies is the rst attempt at visualizing the tangled hierarchies and overlapping

semantic units at various levels of the psycholinguistic analysis. Our research on

this representation continues.

Multiple-linked representations are essential in the design of complex tools for

temporal analysis. By requiring all interface components to function under a single

current time focus now helps the user to stay situated with the multi-modal data.

The use of animation of the strip chart, avatar, key frame highlighting, and text

transcription display in synchrony with video and audio helps to maintain the sense

of situatedness.

For media synchronization, we found the use of the media or video player as the

master clock for the entire system an eective and simple way to ensure a synchrony

of the various interface components.

9. Conclusions

We have presented a multimediadatabase for perceptual analysis of video data using

a multiple, dynamically linked representations model. The system components

are linked through a time portal with a current time focus. The system provides

mechanisms to analyze overlapping hierarchical interpretations of the discourse,

and integrates visual gesture analysis, speech analysis, visual gaze analysis, and

text transcription into a coordinated whole. The various interaction components

facilitate accurate multi-point access to the data.

While our system is currently applied to gesture, speech, and gaze research, it

may be applied to any other eld where careful analysis of temporal synchronies in

D R A F T March 23, 2000, 12:23am D R A F T

31

video is important. The VCR-Like Control Panel, Digital Video Monitor, Hierarchical

Keyframe Browser, Hierarchy Editor Panel, Animated Strip Charts, Text Transcription

Interface, and Synchronized Transcript Display loaded with any time-based signal les

as ASCII point lists, and may be used for any synchronized signal-based analysis

(video compression rate against time for a compression application, force measure-

ments in a videotaped ergonomic lifting experiment, etc.). The Avatar Representa-

tion may be customized for any other abstract representation.

The time portal concept permits the simultaneous analysis of multiple time foci.

In our work on GSG, we have found situations where we need to compare recur-

rences across multiple time indices and sometimes across video datasets. A single

time portal may then represent the events converging in one nexus of time. This

may be compared against that of an alternate time portal.

Acknowledgments

This research has been funded by the U.S. National Science Foundation STIMU-

LATE program, Grant No. IRI-9618887, \Gesture, Speech, and Gaze in Discourse

Segmentation", and the National Science Foundation KDI program, Grant No.

BCS-9980054, \Cross-Modal Analysis of Signal and Sense: Multimedia Corpora

and Computational Tools for Gesture, Speech, and Gaze (GSG) Research."

References

1. R. Ansari, Y. Dai, J. Lou, D. McNeill, and F. Quek. Representation of prosodic structure

in speech using nonlinear methods. In 1999 Workshop on Nonlinear Signal and Image

Processing, Antalya, Turkey, 1999.

2. Aaron F. Bobick. Representational frames in video annotation. In Proceedings of the 27th

Annual Asilomar Conference on Signals, Systems and Computers, November 1993. Also

appears as MIT Media Laboratory Perceptual Computing Section Technical Report No. 251.

3. Susan E. Brennan. Centering attention in discourse. Language and Cognitive Processes,

10(2):137167, 1995.

4. Judy Delin. Presupposition and shared knowledge in it clefts. Language and Cognitive

Processes, 10(2):97120, 1995.

D R A F T March 23, 2000, 12:23am D R A F T

32

5. Peter C. Gordon, Barbara J. Grosz, and LauraA Gilliom. Pronouns, names, and the centering

of attention in discourse. Cognitive Science, 17(3):311347, 1993.

6. Adam Kendon. Current issues in the study of gesture. In J-L Nespoulous, P. Peron, and

A.R. Lecours, editors, The Biological Foundations of Gestures:Motor and Semiotic Aspects,

pages 2347. Lawrence Erlbaum Associates, Hillsdale, NJ, 1986.

7. R.B. Kozma. A reply: Media and methods. Educational Technology Research and Develop-

ment, 42(3):114, 1994.

8. Robert B. Kozma, Joel Russel, Tricia Jones, Nancy Marx, and Joan Davis. The use of

multiple, linked representations to facilitate science understanding. In Stella Vosniadou,

Erik De Corte, Robert Glaser, and Heinz Mandl, editors, International Perspectives on

the Design of Technology-Supported Learning Environments. Lawrence Erlbaum Associates,

Publishers, Mahwah, New Jersey, 1996.

9. Deborah Mayhew. Principles and Guidelines in Software User Interface Design. Prentice-

Hall Inc., 1992.

10. David McNeill. Hand and Mind: What Gestures Reveal about thought. University of Chicago

Press, Chicago, 1992.

11. C.H. Nakatani, B.J. Grosz, D.D. Ahn, and J. Hirschberg. Instructions for annotating dis-

courses. Technical Report TR-21-95, Center for Research in Computing Technology, Harvard

U., Cambridge, MA, 1995.

12. S. Nobe. Represenational Gestures, Cognitive Rythms, and Accoustic Aspects of Speech: A

Network-Threshold Model of Gesture Production. PhD thesis, Department of Psychology,

University of Chicago, 1996.

13. Shuichi Nobe. When do most spontaneous representational gestures actually occur with

respect to speech? In D. McNeill, editor, Language and Gesture. Cambridge: Cambridge

University Press, 2000.

14. F. Quek. Content-based video access system. US provisional patent application Serial

No. 60/053,353 led on 07/22/1997. PCT application Serial No. PCT/US98/15063 led

on 07/22/1998. UIC le number: CQ037.

15. F. Quek, R. Bryll, and X. Ma. Vector coherence mapping: A parallel algorithm for image

ow computation with fuzzy combination of multiple constraints. Submitted (04/1999) to

IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999.

D R A F T March 23, 2000, 12:23am D R A F T

33

16. F. Quek, McNeill, R. D., Bryll, C. Kirbas, H. Arslan, K-E. McCullough, N. Furuyama, and

R. Ansari. Gesture, speech, and gaze cues for discourse segmentation. In Submitted to IEEE

Conf. on CVPR, Hilton Head Island, South Carolina, June13-15 2000.

17. F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K-E. McCullough. Gesture

cues for conversational interaction in monocular video. In ICCV'99 International Workshop

on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pages

6469, Corfu, Greece, September 2627 1999.

18. Boon-Lock Yeo and Minerva M. Yeung. Retrieving and visualizing video. Communications

of the ACM, 40(12):4352, December 1997.

D R A F T March 23, 2000, 12:23am D R A F T