Spatial shaping of cochlear innervation by temporally regulated neurotrophin expression
A Multimedia System for Temporally Situated Perceptual Psycholinguistic Analysis
Transcript of A Multimedia System for Temporally Situated Perceptual Psycholinguistic Analysis
, , 133 ()c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
A Multimedia System for Temporally Situated
Perceptual Psycholinguistic Analysis
FRANCIS QUEKy, ROBERT BRYLL
y, CEMIL KIRBAS
y, HASAN ARSLAN
y,
AND DAVID MCNEILLz
yVision Interfaces and Systems Laboratory
Wright State University, Dayton, OH
zDepartment of Psychology
University of Chicago
Abstract. Perceptual analysis of video (analysis by unaided ear and eye) plays an important
role in such disciplines as psychology, psycholinguistics, linguistics, anthropology, and neurology.
In the specic domain of psycholinguistic analysis of gesture and speech, researchersmicro-analyze
videos of subjects using a high quality video cassette recorder that has a digital freeze capability
down to the specic frame. Such analyses are very labor intensive and slow. We present a
multimedia system for perceptual analysis of video data using a multiple, dynamically linked
representation model. The system components are linked through a time portal with a current
time focus. The system provides mechanisms to analyze overlapping hierarchical interpretations of
the discourse, and integrates visual gesture analysis, speech analysis, video gaze analysis, and text
transcription into a coordinated whole. The various interaction components facilitate accurate
multi-point access to the data. While this system is currently used to analyze gesture, speech and
gaze in human discourse, the system described may be applied to any other eld where careful
analysis of temporal synchronies in video is important.
Keywords: Multimedia Data Visualization; Temporal Analysis; User Interface; Multiple, Linked
Representation; Gesture Coding; Gesture, Speech and Gaze Analysis
D R A F T March 23, 2000, 12:23am D R A F T
2
1. Introduction
Perceptual analysis of video (analysis by unaided ear and eye) plays an important
role in such disciplines as psychology, psycholinguistics, linguistics, anthropology,
and neurology. In the specic domain of psycholinguistic analysis of gesture and
speech, researchers micro-analyze videos of subjects using a professional quality
video cassette recorder that has a digital freeze capability down to the specic
frame. This is a painstaking and laborious task. In our own work on the integrated
analysis of gesture, speech, and gaze (GSG), the labor intensivity of such analysis
is one of the key bottlenecks in our research.
We have developed a multimedia system for perceptual analysis of GSG in video
(henceforth MDB-GSG). This system, developed with attention to the interactive
needs of psycholinguistic perceptual analysis, has resulted in at least a ten-fold
increase in coding eciency. Furthermore, MDB-GSG provides a level of access
to GSG entities computationally extracted from the video and audio streams that
facilitates new analysis and discoveries.
In this paper, we shall discuss the task of perceptual analysis of video, the inter-
active model of our MDB-GSG system based on time situated multiple-linked and
related representations, and the MDB-GSG system design and implementation.
2. Perceptual Analysis of Video
Psycholinguistic perceptual analysis of video typically proceeds in three iterations.
First, the speech is carefully transcribed by hand, and then typed into a text doc-
ument. The beginning of each linguistic unit (typically a phrase) is marked by the
time-stamp of the beginning of the unit in the video tape. Second, the researcher
revisits the video and annotates the text, marking co-occurrences of speech and
gestural phases (rest-holds, pre-stroke and post-stroke holds, gross hand shape, tra-
jectory of motion, gestural stroke characteristics etc.). The researcher also inserts
locations of audible breath pauses, speech dis uencies, and other salient comments.
Third, all these data are formatted onto a nal transcript for psycholinguistic anal-
ysis. This is a painstakingly laborious process that takes a week to ten days to
analyze about a minute of discourse.
D R A F T March 23, 2000, 12:23am D R A F T
3
Sue C1 1 of 2
gar
age
in b
ack
So
you
're
inth
e ki
t-
chen
'n' t
hen
ther
e's
a s<
sss>
the
bac
k
stai
rc-
oh
I
forg
ot
to
say
wh
en y
ou
com
eth
rou
gh
th
e
wh
en y
ou
ente
r th
eh
ou
se fro
mth
e
fro
nt
ann
dyo
u<o
u>
op
enn
th
e
do
ors
wit
h t
he
<uu
mm
>
(sm
ack)
the
gla
ss
inn
th
em
ther
e's
a
Sp
eech
Tra
nsc
rip
t
1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451
RM
S A
mp
litu
de
Speech Amplitude
Frame Number
0
400800
1200
24002800
3200
16002000
481
0
50
100
150
200
250
300
1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481
F0
Val
ue
1 2 3 4 5 6 7 8 9 10 12 13 14 15 16
17
18 19 20 21
22
23 2425
26 27 2811
Audio Pitch
Left Hand Rest
Right Hand Rest
1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481-100
-50
0
50
100
150
200
250
300
Pix
els
Hand Movement along Y-Direction
-200
-150
-100
-50
0
50
100
1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481
Pix
els
Hand Movement along X-Direction
LH
RH
LHRH
Figure 1. Hand position, analysis, F0, transcript and RMS graphs for frames 1481
D R A F T March 23, 2000, 12:23am D R A F T
4
3 4 5 6 7 8 9
00:16:46:28 # [[ / so<oo> you're in th e kitchen ] [['n' then ther e's a s<sss>* /] [ / the back stairc*]]1 2a 2b
1. iconic; RH sl.spread B PTD moves AB from RC to hold directly above knees; <moving into the kitchen; then, loc. in kitchen>2a. deictic vector§ - aborted;; RH G takes off from end point of #1; points and moves RtoL across C; head tilts to look left in the same direction the hand is moving; <move from kitchen area to bottom of back staircase>2b. deictic vector§ - aborted;; RH G takes off from end point of #2a; in one smooth motion: points to R while moving TB, curves to point & move up;
<move up back staircase>.
10 11 1200:16:50:12 [oh I forgot to say ]
emblem-ish; A-hand moves to sternum & holds; <(woops)>
12 13
00:16:51:10 [[when you come through the* / / # ]
iconicΩ - aborted;; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB and apart to a more PTC/PTB position on either side of CC;<open & move through interior double doors>
14 15 16 17 18 19 20 21 23 22
00:16:2:29 [[when you enter the ^house ^from the ^front / ] [annd you<ou> / openn the / doors / with t ]][he* 1 2
1. iconic; BH/mirror 5-hands start together PTB @chest, move AB but do not move apart; 'superimposed' beats, including a big upward & forward head movement at the end on "front"; <enter the house, but not stop short of going through the interior double doors>2. iconic - repeat & complete Ω; BH/mirror 5-hands PTB 5-hands start together PTB @chest, move AB to a mid-stroke hold, then move apart to PAB on either side of C @far L&R; <open & move through interior double doors>
Figure 2. Sample rst page of a psycholinguistic analysis transcript
In our current work [17, 16], we relate such analysis to speech prosody, three-
dimensional traces of hand motion (plotted as x, y, and z displacements against
time), three-dimensional traces of head motion, and three-dimensional (turn, nod,
roll) traces of gaze orientation. The fundamental frequency plots of F0 envelopes
are extracted using Entropic's XwavesTM [1], transferred to a page layout package
and printed together with the hand and head motion/direction traces. These F0
plots are then numbered and related to the text manually. The addition of these
steps on top of the traditional perceptual analysis makes such research even more
labor intensive. Finally, we use a graphical page layout program to combine all
the plots manually so that the time scales are aligned. This is essential to provide
visualization of the data to support discovery. It typically takes an entire day to
organize a set of data in this way. Figure 1 is an example of such plots. The
outcome of the psycholinguistic analysis process is a set of detailed transcripts.
We have reproduced the rst page of the transcript for our example dataset in
Figure 2. The gestural phases are annotated and correlated with the text (by
underlining, bolding, brackets, symbols insertion etc.), F0 units (numbers above
the text), and video tape time stamp (time signatures to the left). Comments
D R A F T March 23, 2000, 12:23am D R A F T
5
about gesture details are placed under each transcribed phrase. Each step of this
analysis requires signicant research labor.
3. Time Situated Multiple-Linked and Related Representations
This paper presents a multimedia system that is designed to reduce the labor of
perceptual analysis, and to provide a level of analysis that heretofore has not been
possible. Our goal is not to do away with expert perceptual analysis. Rather,
we seek to provide higher level objects and representations and to mitigate the
labor-intensiveness of analysis that has access only to the time stamp of the video
signal. By providing direct access to computed entities (GSG plots, automatically
segmented gesture units, speech prosody groupings etc.), the underlying video and
audio, and other representations, the MDB-GSG system also enables a level of
analysis heretofore not available to researchers.
An interactive system may be viewed as a conduit of communication between the
human and the machine [9]. Modern psychology and linguistics theories of discourse
stress the importance of maintaining a state of `situatedness' for communication
to be successful [4, 3, 5]. Under this model, the user and the computer system
maintain a stream of communication that keeps the user situated within an abstract
interactive space. In the complex environment of multi-modal discourse analysis,
this becomes all the more important. In our system, the key element of this user-
system coordination is temporal situatedness. To motivate this situatedness, all
the interface components are linked by their time synchrony. Each representation
of the complex multi-modal space is focussed on the same moment in time. To
enforce this mental model, we call this a time-portal through which we view the
extended timeline of the subject's GSG behavior. Hence, this is an example of
Multiple, Linked Representations (MLR) of dynamic components [7, 8, 15, 14] in
which each representation reinforces the situatedness condition. Furthermore, each
representation in the system is active, thereby enabling multiple-point access to the
underlying data. We shall esh out these concepts using the actual components of
our MDB-GSG system as examples.
D R A F T March 23, 2000, 12:23am D R A F T
6
Temporal cohesion is especially critical in GSG analysis. The motor and speech
channels are not subservient to one another, but spring from a common psycho-
logical (and neurophysiological) source. They proceed on independent pathways to
the nal observable behavioral output. The temporal cohesion of the GSG chan-
nels are thus governed by the constants of neurophysiology and psycholinguistics
[10, 6, 12, 13]. The time portal metaphor provides a panoramic snapshot of the
functioning of all modalities at synchronized time instants.
4. The MDB-GSG System
Figure 3 shows our MDB-GSG system with all the representational components
open. The essence of our MLR strategy is that each of these components is syn-
chronized with the current time focus. This means that each component is animated
to track this time focus, and when it changes, all the system components change
to re ect this. Not all components, however, have to be active at the same time.
When they are inactive, the system `deregisters' them, and they are not updated.
When a system component is opened (e.g. the avatar representation at the bottom
left of Figure 3), it is registered with the system and linked with the current time
focus. These system components will be considered in ve groups:
The VCR-like interface and player
The hierarchical shot-based representation and editor
The animated strip chart representation
The synchronized speech transcription interface
The GSG abstraction representation or avatar
Each of these representational components was chosen to advance our psycholin-
guistic analysis and support or computer vision and signal processing eorts. We
shall motivate each interface element as it is discussed.
D R A F T March 23, 2000, 12:23am D R A F T
7
VCR-LikeControlPanel
TextTranscription
Interface
SynchronizedTranscript
DisplayAvatarRepresentation
HierarchicalKeyframeBrowser
AnimatedStrip Chart
Panel
HierarchyEditorPanel
Current TimeFocus
Current ShotKeyframe
TimelineRepresentation
DigitalVideo
Monitor
Figure 3. System Screen of the MDB-GSG Analysis System
D R A F T March 23, 2000, 12:23am D R A F T
8
4.1. VCR-Like Interface and Player
The panels labeled VCR-like Control Panel and Digital Video Monitor at the bottom
right of Figure 3 provide a handle to the data using the familiar video metaphor.
The MDB-GSG system is designed so that dierent virtual players may be plugged
into the system. Currently, we have drivers to control a digital video (e.g. MJPEG,
MPEG, QuickTime) player and two physical devices (a computer controlled Hi-8
video player, and a laser videodisc player). Drivers for other media such as DVD
players can easily be added.
The functions of this control panel and video display are similar to other computer-
based players except for several enhancements. The frame shown in the Digital Video
Monitor is always the frame at the current time focus. As the video plays, the time
focus changes accordingly. If the time focus is changed through some other interface
component, the video player will jump to the corresponding video frame.
Our choice of MJPEG for the video is driven by the need for random frame
access and reverse play. The media player in the Silicon Graphics media library is
able to play both the video and audio tracks together both forward and in reverse
at various speeds. This is important for coding the exact beginning of particular
utterances in the audio because of the psychological eect where a listener perceives
a word a fraction of a second after it is uttered. Humans hear coherent sounds as
words, and perceive the words as they emerge from the mental processing. Hence,
it is dicult to locate the exact synchronies of the beginning of gesture phrases
and speech phrases when the audio is played forward at regular speeds. When the
audio is played backward or at slow speeds, this eect is removed and the coder
can perceive the synchronies of interest.
The circular loop button on the top row of the control panel toggles the `shot
loop' mode. When this mode is set, the video player will keep looping through the
current shot at the current level until play is halted. A `shot' is a video segment of
signicance to the GSG analysis. The loop mode permits the researcher to examine
a particular GSG entity (e.g. a stroke) to identify its idiosyncrasies at various play
speeds. The jump-to-start and jump-to-end (double arrows with a vertical terminal
D R A F T March 23, 2000, 12:23am D R A F T
9
bar) buttons at the right end of the top row allow the user to skip from shot to
shot in the shot hierarchy (described in the next section).
The Step Rev and Step Fwd buttons in the second row permit the researcher to
step through the video a frame at a time (similar to the frame jog operation in
a professional video player). This is important for micro gesture and gaze shift
coding. The triple and quadruple directional arrow buttons play the video and
audio backward and forward at dierent greater than realtime speeds (with audio).
The Slow Rev and Slow Fwd buttons in the third row permit play at various fractions
of the regular video rate with the accompanying audio. These rates (0.25, 0.5 and
0.75) are set via the pull-down menu at the top of the VCR-Like Control Panel. This,
again, is important for detailed analysis and the coding of exact synchronies in the
GSG signals.
The Timeline Representation at the bottom of the of the VCR-Like Control Panel
shows the oset of the current frame in the entire video. Consistent with the
rest of the interface, the slider is strongly coupled to the other representational
components. As the video plays in the VCR-like representation, the location of
the current frame is tracked on the timeline. Any change in the current frame by
interaction from either the visual summary representation or the hierarchical shot
representation is re ected by the timeline as well. The timeline also serves as a
slider control. As the slider is moved, all the other representational components
alter to re ect this change. The number above the slider represents the percent
oset of the current frame into the video.
Rationale
We have already discussed the importance of the careful temporal analysis in the
study of GSG cross-modal communication. Expert gesture and speech coders are
well experienced in the operation of high-end VCR's in their analysis. The design
choices in our video player: the VCR-like control panel, and the timeline sliders
are designed to enhance the users' familiarity with the professional quality VCR
control and to extend it for temporal analysis.
D R A F T March 23, 2000, 12:23am D R A F T
10
2.2 2.3 2.4
2.3.1 2.3.2 2.3.3
6.1 6.2 6.3
1 2 3 4 5 6 7 8
Current shot at level 0
Current shot at level 1
Current shot at level 2
2.1
Figure 4. Shot Hierarchical Organization
4.2. Hierarchical Shot-Based Representation and Editor
The panels labeled Hierarchical Keyframe Browser and Hierarchy Editor Panel facili-
tate the organization of the video stream into a nested hierarchical structure, and
the visualization of this hierarchy in summary format.
4.2.1. Shot Architecture Before we proceed, we need to dene several terms to
facilitate our discussion. A video sequence can be thought of as a series of video
frames. These frames can be organized into shots. We dene a shot as any sequential
series of video frames delineated by a rst and a last frame. These shots may be
organized into a nested hierarchy as illustrated in Figure 4. In the gure, each shot
at the highest level (we call this level 0) is numbered starting from 1. Shot 2 is
shown to have four subshots which are numbered 2.1 to 2.4. These shots are said
to be in level 1 (as are shots 6.1 to 6.3). Shot 2.3 in turn contains three subshots,
2.3.1 to 2.3.3. Shot 2 spans its subshots. In other words, all the video frames in
shots 2.1 to 2.4 are also frames of shot 2. Similarly, all the frames of shots 2.3.1 to
2.3.3 are also in shot 2.3. Hence, the same frame may be contained in a shot in each
level of the hierarchy. One could, for example, select shot 2 and begin playing the
video from its rst frame. The frames would play through shot 2, and eventually
enter shot 3. If one begins playing from the rst frame of shot 2.2, the video would
play through shot 2.4 and proceed to shot 3.
D R A F T March 23, 2000, 12:23am D R A F T
11
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
F K L
P
null
null null
null
null
null
null null
null null
null
nullnull
nullnull
null
null
null
Figure 5. Shot Hierarchy Data Model
Next, we dene a series of concepts which dene the temporal-situatedness of the
system. The user may select any shot to be the current shot. The video system
will play the frames in that shot, and the frame being displayed at any instant is
known as the current frame. These are dynamic concepts since the current frame
and current shot change as the video is being played. Suppose we are at level 2 of
the hierarchy and select shot 2.3.3 as the current shot (shown as a shaded box in
Figure 4). Shot 2.3 at level 1 and shot 2 at level 0 would conceptually become the
current shot at those levels. This could lead to confusion in the user, and hence
we introduce the concept of the current level. At any moment in the interface, the
system is situated at one level, and only the shots at that level appear in the visual
keyframe summary representation. In our current example, the system would be
in level 2 although the current frame is also in shot 2.3 and shot 2 in levels 1 and
0 respectively.
Figure 5 shows the data model in our shot hierarchy. Each shot denes its rst
and nal frames within the video (represented by F and L respectively) and its
keyframe for use in the visual summary. Each shot unit is linked to its predecessor
and successor in the shot sequence. Each shot may also be comprised of a list
of subshots. This allows the shots to be organized in a recursive hierarchy. The
shot data model is essentially a structure containing the F and L indices into a
D R A F T March 23, 2000, 12:23am D R A F T
12
video. Hence, we may maintainmultiple shot hierarchies into the same video/audio
sequence. This is important for GSG analysis since natural communication typically
contains multiple overlapping semantic threads.
4.2.2. Keyframe-Based Visual Summary The use of the keyframe representation
as a visual summary for video segments has proven surprisingly eective [18, 2]. For
GSG segmentation, a keyframe representation of each video segment permits the
researcher to see the general hand congurations and gaze directions represented in
the keyframe. The system takes the rst frame of each shot/subshot as the default
keyframe, but the user may substitute this with any frame of her choice through
the Hierarchy Editor Panel.
Figure 3 shows the standard keyframe representation in the top left corner. Each
shot is summarized by its keyframe, and the keyframes are displayed in a scrollable
table. The keyframe representing the \current shot" is highlighted with a boundary.
A shot can be selected as the current shot from this interface by selecting its
keyframe with the computer pointing device. In accordance with the MLR strategy,
the current time focus is set to the beginning of the shot, and all other interface
components are updated (the position of the current shot in the shot hierarchy
appears in the shot hierarchy representation, the rst frame of the shot appears as
the \current frame" in the display of the Digital Video Monitor, and the timeline
representation is updated to show the osets etc.). The video can be played using
the VCR-Like Control Panel. When the video is being played, the current keyframe
highlight boundary blinks. When the current frame advances beyond the selected
shot, the next shot becomes the current shot, and the highlight is moved to the
new current shot.
Figure 6 shows the keyframe browser with a larger keyframe presentation than
that in Figure 3. Our MDB-GSG implements three keyframe sizes that are gener-
ated dynamically from the MJPEG video to permit the user to trade-o between
keyframe resolution and the number of shots concurrently visible. The two text
entry boxes at the bottom of Figure 6 permit the user to enter textual annotation
for the current shot. The user enters a short label for the shot in the smaller box on
D R A F T March 23, 2000, 12:23am D R A F T
13
Figure 6. Keyframe Visual Summary Representation with Annotation Boxes
D R A F T March 23, 2000, 12:23am D R A F T
14
the left and a more complete textual description in the box on the right. The label
entry is used to provide a textual synopsis and the description provides detail.
This hierarchical structure is an eective means of representing nested seman-
tic discourse structures. However, this may be insucient to represent discourse
models in which multiple concurrent semantic threads are pursued through the dis-
course. While each thread may be amenable to hierarchical analysis, the multiple
threads taken together are not. In our shot hierarchy architecture, each shot object
stores only the frame indices and its position within one hierarchy. This economy
of representation permits us to impose multiple hierarchies on the same discourse
video. These multiple hierarchies are synchronized through a single time portal
with a single current time focus.
4.2.3. Shot Hierarchy Editor The Hierarchy Editor Panel in the top right of Fig-
ure 3 is designed to allow the user to navigate, to view, and to organize video in
our hierarchical shot model shown in Figure 4. It comprises three sub-panels. The
one on the left labeled Shot Editing permits the construction and deletion of shots
from the video stream. This panel allows the user to create new shots by setting
the rst and last frames in the shot and capturing its keyframe. When the \Set
Start" or \Set End" buttons are pressed, the current frame (displayed in the Digital
Video Monitor) becomes the start or end frame respectively of the new shot. The
default keyframe is the rst frame in the new shot, but the user can select any
frame within the shot by pausing the video display at that shot and activating the
\Grab Keyframe" button.
The middle sub-panel labeled Subshot Editing permits the user to organize shots
in a hierarchical fashion. This sub-panel is context sensitive. The buttons that
represent currently unavailable operations are blanked. As is obvious from the
button icons in this panel, it permits the deletion of a subshot sequence (the subshot
data remains within the rst and last frames of the supershot, which is the current
shot). The \promote subshot" button permits the current shot to be replaced by
its subshots, and the \group to subshot" permits the user to create a new shot as
the current shot and make all shots marked in the Hierarchical Keyframe Browser
window its subshots. The blank button at the top of subshot editing panel is a
D R A F T March 23, 2000, 12:23am D R A F T
15
\Create Subshots" button. In the example shown, the current shot already has its
subshot list, so this button is disabled and blanked.
The rightmost subpanel labeled Subshot Navigation displays the ancestry of the
current shot (the shot labels entered by the user in the Hierarchical Keyframe Browser
are listed) and permits navigation up and down the hierarchy. The \Down" button
in this panel indicates that the current shot has subshots, and the user can descend
the hierarchy by clicking on the button. If the current shot has no subshots, the
button becomes blank. Since the current shot in the gure (labeled as \L1/C
(WOMBATS)") is a top level shot, the \Up" button (above the \Down" button) is
left blank. The user can also navigate the hierarchy from the Hierarchical Keyframe
Browser window. If a shot contains subshots, the user can click on the shot keyframe
with the right mouse button to descend one level into the hierarchy. The user can
ascend the hierarchy by clicking on any keyframe with the middle mouse button.
These button assignments are, however, arbitrary and can be replaced by any mouse
or key-chord combinations.
The hierarchical shot representation panel also permits the user to hide the cur-
rent shot or a series of marked shots in the visual summary display. This permits
the user to remove video segments that do not participate in the particular dis-
course structure of interest in a study. The hide feature can, of course, be switched
o to reveal all shots.
Rationale
In our work on discourse analysis, we employ such discourse structure models
as that of Grosz and colleagues [11] to parse the underlying text transcripts. The
method consists of a set of questions with which to guide analysis and uncover the
speaker's goals in producing each successive line of text. Such discourse models
are amenable to hierarchical representation. We compare this structure against the
discourse segments inferred from the objective motion patterns shown in the gesture
and gaze modalities [17, 16]. Our Hierarchical Keyframe Browser and the underlying
nested shot architecture directly support such discourse patterning. The Keyframe
Browser spatializes the time dimension so that the user may view the discourse units
as a hierarchy of keyframes. Each keyframe is a `snapshot' view of the gestural
morphology of the corresponding discourse element, and serves as a memory cue of
D R A F T March 23, 2000, 12:23am D R A F T
16
the gestural and gaze conguration for the coder. The Shot Hierarchy Editor is the
tool for the coder to add discourse segmentation and hierarchy information to the
data. The shot labeling facility is used by coders to annotate each discourse segment
with psycholinguistic observations. As will be described later, these annotations are
used to generate text transcript formats with which psycholinguistics researchers
are familiar.
4.3. Strip Chart Abstraction of Communicative Signal
The Animated Strip Chart Panel at the left middle of Figure 3 provides the user
access to the computed GSG entities. The user may select any signal to be displayed
in a pane in this panel. In the gure, the x-position trace describing both of the
subject's hand motion is in the top pane, and her voiced utterances are displayed as
the fundamental frequency F0 plots in the lower pane. Each pane may be displayed
in Huge, Normal and Compressed resolutions in the y-dimension. The x-dimension
of the plots is time expressed in terms of video frame oset into the video. The
time scale may be displayed in 3 resolutions Small, Normal (as shown) and Large
(each successive scale being twice the previous).
The red line down the middle of the plots represents the current time focus. When
the video plays, this line stays in the middle of the panel, and the plots animate
so that the signal points under this current time focus always represent the signal
at that instance. The user is able to drag the plots right and left by pulling the
time scale at the bottom of the panel in the desired direction with the middle
mouse button depressed. All other MLR interaction components will animate to
track this change (for practical reasons, audio does not play when this happens).
The user may also bring any point of the graph immediately to the current time
focus line by clicking on it with the left mouse button. If the mouse is in any
portion of the Animated Strip Chart Panel other than the time scale, the middle and
right mouse button will toggle forward and reverse play respectively at the current
speed (set using the VCR-Like Control Panel). This feature was added because the
psycholinguists wanted rapid control of the video playing without having to move
to the VCR-Like Control Panel repeatedly.
D R A F T March 23, 2000, 12:23am D R A F T
17
The user may select any available plot to be displayed in any pane. All that the
system needs to represent a plot is for its name to be registered with the system,
and for an ASCII le containing a list of points to be provided. Although there is no
theoretical limit to the number of plots in this scrollable panel, this is limited by the
pragmatic concerns of screen real-estate (why animate all plots when a maximum
of 5 may be seen at any time) and processor speed (how many plots can the system
animate before it impacts performance). Our current limit is 10 animated plots at
any time.
Rationale
This representational component has proven invaluable in our research into GSG.
First, it provides the researcher with a `god's eye view' into the video time line
so that she can conceptualize about the GSG entities represented beyond the im-
mediacy of the current point being played. This has helped immensely to speed
up the psycholinguistic coding. Second, since it is trivial to change and add sig-
nals to the display, dierent extracted time-based traces may be displayed in this
way. This has been extremely useful in the development of algorithms to process
the video and segment the GSG signals. In our work in `deconstructing' the hand
motion traces into atomic `strokelet' motion units, for example, we simply gener-
ate a signal stream that has value spikes at the `strokelet' transition points and
is zero elsewhere. This allows us to evaluate the eectiveness of our segmentation
perceptually with access to the underlying video and audio through the interface.
This system malleability directly supports our reciprocal cross-disciplinary research
strategy. Psycholinguists provide perspective and analysis to guide the engineering
eorts in audio, video, and signal processing. The engineering team provides access
to GSG signals and entities, and the tools to access and visualize them. The MDB-
GSG system provides the locus of integration and interaction among researchers
from both disciplines.
4.4. Transcription Interface
The Transcription Interface shown in Figure 7 consists of a text display and editing
area (the Transcript Display Pane), a status display, and a set of control buttons and a
D R A F T March 23, 2000, 12:23am D R A F T
18
Figure 7. The Transcription Interface
pull down menu. These provide access and manipulation to the temporal properties
and content of the underlying syntax of the subject's speech. The speech is rst
transcribed to text manually to obtain a preliminary ASCII text le that may be
organized and rened using the MDB-GSG system. When this text transcript is
registered with the system, it is indexed and displayed in the Transcript Display
Pane.
The Transcription Interface has three modes of operation: Time, Edit, and Playback.
In Time mode, the user associates text entities with timestamps; in Edit mode
the user may modify the underlying transcript text; and, in Playback mode the
MDB-GSG system animates the text to track the current time focus. The mode
of operation is selectable from the `Mode' pull down list. The default mode of
operation is Playback, and the system returns to Playback mode whenever Time or
Edit mode is terminated.
4.4.1. Transcript and Associated Representation In our system, the transcrip-
tion is maintained in two dierent les. The transcription text (and other embedded
information) is maintained as a straight ASCII le. The transcription is divided
D R A F T March 23, 2000, 12:23am D R A F T
19
into `separable entities' in the form of alphabetic strings that are delineated by
separators (white space or special characters). These entities are represented as a
list in the Tagged Transcription File. Each item in the Tagged Entity List may be
associated with a time stamp that is synchronized with the rest of the database.
The time stamp represents the onset of voicing of the particular transcript entity.
These timestamps, therefore, describe a set of intervals between successive entities
in the list. If a timestamp is not assigned to a particular entity, that entity is said
to belong to the interval between the last previously tagged, and next subsequently
tagged entities in the list. Beside white spaces, our system provides for other delim-
iters such as a period between two alphabetic character strings with no spaces. This
permits the separate tagging of dierent syllables or phones within a single word.
For example, the word \Transcription" may be stored in the ASCII Transcription
File as \Tra.ns.crip.tion". This is represented as a sublist of four items: \[tra]-[ns]-
[crip]-[tion]" in the Tagged Transcription File, allowing the separate tagging of each
item. The user may also add comments to the ASCII Transciption File. Following
programming convention, comment lines in the le begin with a semi-colon imme-
diately following the preceding line break. Comment text is ignored in the Tagged
Entity List as are all text delimiters.
4.4.2. Time Mode Operation Transcription typically begins with an untagged
text transcript le that is generated manually by a transcriber viewing the exper-
iment video tape. This le is imported into the system and forms the basis for
the ASCII Transcription File. Upon importation, the MDB-GSG system parses the
ASCII transcription le to produce the list of separate entities that are initially
untagged. This forms the basis of the Tagged Entity List.
In Time mode operation, an `Accept' button appears next to the mode status
indicator. The user may mark a text entity to associate it with the current time
focus. This time tag is entered into the Tagged Entity List when the `Accept' button
is pressed or when the user hits the return key on the keyboard. The system checks
that the time tags entered are temporally constrained (i.e. items in the front of
the list have earlier time stamps), and ags ordering errors for user correction.
Since the only criterion we use for indexing the textual entities is a parse based
D R A F T March 23, 2000, 12:23am D R A F T
20
on word separators (spaces, tabs, and line breaks), the user may enforce a syllable
and phone level parsing by inserting in-word separators. Once a time stamp is
associated with a text entity, the system automatically highlights the next text
entity to be associated with the new timestamp. The user may of course highlight
some other text entity using either the mouse or the cursor control keys on the
keyboard.
Not all words need to be time-annotated. If the rst words at the beginnings of
two consecutive phrases are annotated, the system associates all the words from
the rst annotated word up to the one before the second annotated word with the
duration between the time indices. This allows sentence-, phrase- and word-level
analyses using the MDB-GSG system.
4.4.3. Edit Mode Operation In Edit mode operation, the user may modify the
underlying transcript text (add missed words, correct transcription errors, add au-
dible breath pauses etc.). When the user selects `Edit' from the pull down mode
menu, an edit session begins. A pair of buttons labeled `Done' and `Cancel' respec-
tively appear next to the mode status indicator. The user may also add or delete
words (or word fragments) in the tagged transcription list by changing the text in
the Transcript Display Pane directly. In Edit mode, in-word separators and white
space may also be added. Edit mode is exited when the user activates either the
`Done' or `Cancel' button. If `Cancel' is selected, the changes made in the editing
session are discarded. If the user selects done, the changes are parsed and incor-
porated into the Tagged Entity List. Upon exiting edit mode operation, Playback
mode is automatically activated.
4.4.4. Synchronized Transcript Playback In Playback mode, the Transcript Dis-
play Pane serves as a synchronized playback display. Once the text is time anno-
tated, the word[s] associated with the current time focus in the Transcript Display
Pane are highlighted in synchrony with the all other MLR interaction components.
When the video plays, this highlighting animates and the Transcript Display Pane
scrolls to keep the current time focus text within the pane. When the current time
focus is changed in any other interface component, the appropriate scrolling and
D R A F T March 23, 2000, 12:23am D R A F T
21
highlighting takes place. The user is also able to select any word in the Transcript
Display Pane and make its time index the current time focus of the entire system.
The user may choose either to show or hide comment lines and within-word delim-
iters in the Transcript Display Pane.
Since the Transcript Display Pane highlights the appropriate word[s] and scrolls
during playback, the user may activate Time or Edit mode to modify and rene
the time annotation or transcript text at any point during playback.
4.4.5. Importing and Exporting ASCII Transcript Files Since the transcript is
an ASCII text le, these may be prepared and edited independently outside the
system, and then imported. Text may also be exported. During the importation
process, it is critical that we reassociate it with any existing Tagged Transcription
File (if this does not exist, the system simply creates a new one). The imported
ASCII Text File is parsed to produce a new tagged transcription list that is compared
against the old list. If the only changes added are comments and formatting (the
most common situation), the two lists would be identical, and the time tags are
simply transferred to the new list. If the underlying text entities are changed, the
MDB-GSG system highlights these changes to bring them to the attention of the
user for time tagging. Once the lists have been fully resolved, the new list is saved
as the Tagged Transcription File.
Rationale
Since text-level manual transcription using high quality frame-accurate VCRs
is the way psycholinguistic GSG research is traditionally done, there was much
interaction with our psycholinguist colleagues in the design of this sub-system. The
`within-string' delimiters were incorporated because this was deemed essential to
the transcription process. Similarly, the addition of the commenting capability was
motivated by the research need to add scientic observations in the commentary.
The comments and formatting also permit the researcher to use indented formatting
to represent discourse-level structure. For this reason, the ASCII Transcription File
leaves the white space formatting of the transcript intact, and the system displays
this in the Transcript Display Pane.
D R A F T March 23, 2000, 12:23am D R A F T
22
The capability to import and resolve new ASCII Transcription Files is also driven
by the working style of our research's primary critical resource: expert psycholin-
guistic researcher. The ability to export the ASCII Transcription File for editing on
a standard word processor and reimporting the edited result allows the researcher
to analyze the text without being tied to the workstation that runs the MDB-GSG
system. Since it is much easier to do time tagging and syllable and phone level pars-
ing (these depend on the content of the video and audio tracks) on the MDB-GSG
system, such o-line editing is typically done to add comments and indentation
structure. Since such operations leave the Tagged Transcription File unchanged,
minimal labor is required to resolve the new ASCII Transcription File with the exist-
ing MDB-GSG representation. The resolution process often serves as a debugging
operation to remove inadvertently modied text (e.g. comments added without the
commenting ag).
4.5. Avatar Abstraction
The Avatar Representation at the bottom left of the screen displays an animated
avatar that moves in synchrony with the current time focus. In our current GSG
work, we image the subjects using three cameras (two calibrated stereo, one closeup
on the head) to extract the three-dimensional velocities and positions of the hands,
the three-dimensional position of the head, and the head gaze direction in terms
of the turn, nod, and roll angles. These values are fed into the avatar simulation
that plays in synchrony with the rest of our MDB-GSG interface. This provides
essentially a simulated analog of the subject in the Digital Video Monitor displaying
only the signal dimensions extracted.
Rationale
This avatar serves three purposes to support GSG research. First, it is not re-
stricted to a particular viewpoint. For example, the simulation may be rotated
to give the user a top-down view of how the hands move toward and away from
the body. A top-down view also aids the examination of the direction of gaze in
terms of the head `turn' alone. This provides a better understanding of how the
subject is structuring her gestural space in the `z-direction'. Second, it permits
D R A F T March 23, 2000, 12:23am D R A F T
23
researchers to see the communicative eects of each extracted signal. For example,
we applied the system with a constant z-dimension to see the eect of depth on
how one perceives a gesticulatory stream with the hands motions constrained to
be in a plane in front of the subject. We also saw the eectiveness of head and
gaze direction in giving the avatar a sense of communicative realism by disabling
that signal. We expect that this avatar interaction will also provide insight to the
eects of slight dissynchronies in speech and gesture, or the removal of dierent
gestural components. We cannot do this with the original video, but this is trivial
in the avatar. Third, the avatar permits us to do a qualitative evaluation of our
hand and head tracking algorithms. Since our purpose is not absolute position but
conversational intent, the avatar facilitates a quick evaluation of the eectiveness
of our extraction algorithms in comparison with the original video.
5. Transcript Generation
The MDB-GSG system permits researchers to analyze and organize multi-modal
discourse by applying specic psycholinguistic models. Hence, the resulting seg-
mentation, structure, transcription and annotation are the intellectual product of
the analysis. This may be accessed and communicated in two ways.
First, the MDB-GSG system itself provides multi-media access to these GSG
entities. The database associated with a particular analysis may be loaded for
perusal and query. Dierent databases may be generated on the same discourse
data to re ect dierent discourse organization theories and methodologies.
Second, the MDB-GSG system is able to produce a text transcript from the anal-
ysis. Figure 8 shows a fragment of such a transcript. The hierarchical organization
of the transcript derives automatically from the shot hierarchy along with its labels
and annotation. The speech transcription text associated with each item in this
hierarchy derives directly from the time-annotated speech transcript.
Rationale:
This transcript is similar to that produced manually in Figure 2. Psycholinguis-
tic researchers familiar with manual transcription nd this automatically gener-
D R A F T March 23, 2000, 12:23am D R A F T
24
1 clapper (0:7:12:23 - 0:7:14:17) (0 - 54) beginning of the film clip
Transcript: ""
2 L1 (0:7:14:18 - 0:7:17:1) (55 - 128) Introduces top discourse layer
Transcript: "okay what we need to do <whisper?> "
2.1 L1 (0:7:14:18 - 0:7:15:18) (55 - 85) "okay"-Macro level discourse marker
BH fists PTB/FTC contact in cc
Transcript: "okay "
2.2 L1 (0:7:15:19 - 0:7:17:1) (86 - 128) RH G to interlocutor
Transcript: "what we need to do <whisper?> "
3 whisper by LSNR (0:7:17:2 - 0:7:17:6) (129 - 133) motivates slight SPKR hesitation
Transcript: ""
4 L2/C(TRAINS) (0:7:17:7 - 0:7:28:9) (134 - 466) RH, BH, LH
Transcript: "is we're gonna ride in through town umhm uhm get off atthe train station uhm we'll be getting off at the right sowe're coming from this direction <smack> past this to thestation "
Figure 8. Transcript produced by the MDB-GSG system
D R A F T March 23, 2000, 12:23am D R A F T
25
ated transcript invaluable. MDB-GSG, therefore, facilitates greater communication
among researchers.
6. Object-Oriented Implementation and Temporal Synchronization
Figure 9 shows the simplied object hierarchy of our system and also the method of
temporal synchronization. All MDB-GSG interface elements are separate objects
in our C++ implementation of the system. The system is implemented under SGI
IRIX, using X Windows, Motif and SGI Movie libraries.
To permit multiple time portals, our system can open multiple datasets comprising
shot hierarchies, video recordings, transcriptions, and charts simultaneously. Some
interface elements are shared by all datasets while some are associated with specic
datasets.
6.1. Architecture Overview
Figure 9 presents a simplied diagram of our MDB-GSG system architecture. The
VCR-like Control Panel, Shot Hierarchy Editor and Virtual Video Player are shared
\common" objects and are instantiated only once. The Virtual Video Player is
designed to give the system independence from the actual device that plays the
video data. This player may be either a hardware device or a software media
player. In the current implementation, the MDB-GSG system can handle two
physical devices (a RS-232 controlled VCR and a laser disk player) and the Digital
Video Monitor discussed in section 4.1.
Beside these shared common objects, all other objects are instantiated on demand
and tied to the specic dataset to for which they are created. The key interface
component for a particular dataset is the Keyframe Browser that is associated with
a specic shot-hierarchy. At any point in time only one Keyframe Browser may be
active (or `in focus '). The active Keyframe Browser is shown highlighted in Figure
9. Each Keyframe Browser contains a list of one or more Browser Windows, only
one of which is active at any time (shown highlighted in Figure 9). Each Browser
Window is a kind of a `view' into the shot hierarchy, maintaining a separate time
focus, and active level. Our time portal conceptual model is embodied by a specic
D R A F T March 23, 2000, 12:23am D R A F T
26
Shot HierarchyEditor
KeyframeBrowser
KeyframeBrowser
KeyframeBrowser
DB DB DB
BrowserWindow
Browser Window
Strip Chart Avatar
VCR-likeControl PanelVirtual
VideoPlayer
VCR
LaserDisk
MPEG
MJPEG
VideoStorage
or Device
BrowserWindow
Text Transcript
VIdeo is PlayedIndependently of theMain Event Loop Main X Event Loop
"Synchronization Dispatcher"Propagates the
"Update Position"Message to All Objects
Polling Functionin the Event Loop
(Timeout)
Query for CurrentPosition in Media
"Update Position"Message
Figure 9. Simplied Architecture of the MDB-GSG System and the Method of Media Synchro-
nization
D R A F T March 23, 2000, 12:23am D R A F T
27
Browser Window with its time focus and active level. As discussed earlier, this time
portal concept is critical in the analysis of GSG interaction. Each browser window is
associated with its own Strip Chart, Avatar, and Text Transcription interfaces. When
a browser window is active or in focus, the entire system uses its time focus as the
current time focus.
6.2. Media Synchronization
All objects, except the virtual video player object and the actual video device (im-
plemented in software or hardware), share the same X event loop, shown as dashed
circle in Figure 9. The Digital Video Monitor is the most commonly used video
device, but our system allows other types of video devices to be connected (e.g.
a laser disk player or a computer-controlled VCR) and interfaced to the system
through an appropriate `virtual player'. The only requirements for each video de-
vice is that it has to be able to return the current (currently played) frame number,
and go to a particular frame number on demand. The video device plays the video
independently from the rest of the system. This is obviously the case when an ex-
ternal physical device is used. Software and video players, such as are Digital Video
Monitor, play the video data on the separate thread of execution.
The system's main X event loop contains a function (implemented as a timeout)
which periodically polls the video device for the current position in the media. The
function then sends a `update position' message to the `synchronization dispatcher'
object (see Figure 9). The `synchronization dispatcher' propagates this message
to all its sub-objects (keyframe browsers), and all currently active objects (i.e.
currently opened windows) update themselves in response. Each object, in turn,
propagates the message to all active sub-objects (e.g. if any keyframe browser
window has some active animated strip charts). So the update message originates
in the X event loop and is propagated in a tree-like fashion throughout all currently
active objects.
Polling an independently executing virtual player makes the video recording a
basis of temporal synchronization of all the system elements and is consistent with
the basic philosophy of the entire system, where the video is central to analysis
D R A F T March 23, 2000, 12:23am D R A F T
28
and all other data. In fact, all system data (e.g. text transcript, strip chart data,
hand position data for the avatar) are derived from the video. It is also easy to
implement and makes all object interfaces relatively simple (any object has to be
able - in general - to update itself with a new current frame number, and also to
send a new frame number to the virtual player - e.g. when a slider is moved by
the user on the VCR-like Control Panel). Furthermore, most video players maintain
accurate temporal synchronization making them ideal `clocks' for system.
By using the virtual media player as the central synchronization agent, dropping
frames during video playback does not pose a problem, since all the remaining
interface modules respond to such an event by updating themselves with the correct
new frame number. In fact, our synchronization strategy has the added benet of
automatic system load balancing. On a slower machine, the video player is likely
to drop frames if many interface components are active and animated. This in turn
causes a coarser animation update, giving more resources to the video player.
In the current implementation the polling is performed in 33 ms intervals (slightly
faster than 30 times per second), which is sucient, considering that the typical
video framerate is 30 fps. Some events in certain interface objects (e.g. pressing
the \skip single frame" buttons in the VCR-like Control Panel) automatically force
dispatching of the `update position' message to all interface elements.
7. System Use
Our MDB-GSG system replaces a process of manual perceptual analysis using a
frame-accurate videotape player (a Sony EVO-9650); hand transcription; manual
production of gesture and gaze tracking charts and audio F0 and RMS charts; man-
ual tagging of F0 charts and synchronization with the text transcript; and manual
reproduction of the analysis results into a text transcript. For this reason, it is not
possible to evaluate the MDB-GSG system against a predecessor. Furthermore, the
kind of psycholinguistic analysis performed is extremely skill-intensive and tedious.
For this reason, the number of people actively engaged in micro-analysis of video
for GSG coding is small. We hope that a system like our MDB-GSG will bring
modern multimedia tools to bear on such research and increase the number of new
D R A F T March 23, 2000, 12:23am D R A F T
29
researchers in the eld. We hope to have an eect similar to that of PC-controlled
telescopes bringing many new amateur astronomers to the discovery of celestial
phenomena.
The system is being used by two doctoral students in psycholinguistic research,
and both have given the MDB-GSG system high marks in their subjective evalu-
ation. The work is an order faster (a day for a week and a half of intense labor)
with the MDB-GSG system. Furthermore, with direct access to the multimedia,
multimodal data, the degree of integrated analysis is enhanced.
8. Lessons Learned
We worked closely with domain scientists in all phases of our research and imple-
mentation of the system. We had been doing GSG research in collaboration with
our psycholinguistics partners, and had observed the degree of detail required in
the micro analysis of the video data. We also saw the importance of temporal anal-
ysis for such research. At each stage of our system development, we proposed the
structure of each interface component to, and discussed how it would function with
the psycholinguists. The computer scientists took the lead in this tool development
eort as they are more familiar with what can be done. Since we are proposing new
ways for GSG access, the psycholinguists initially found it dicult to visualize how
the new interface would work. To overcome this, after initial discussion we devel-
oped rapid prototypes of the new tools and showed them to the psycholinguists for
comment. This is when we usually get the most eective insights on how to modify
the tools as psycholinguistic GSG analysis. Most of the tools required two or three
iterations to rene both interface and the back end processing. The components
that elicited the most discussion and renements were not surprisingly the Text
Transcription Interface and the VCR-like Control Panel. These were the components
that matched most precisely how video micro analysis for GSG Research was previ-
ously conducted. The other components like the video Keyframe Browser, the Avatar
Representation, and the Strip Chart Interface were completely new to the experience
of the psycholinguists, and we expect that new comments and renement requests
would be forthcoming as they become more familiar with these tools. With respect
D R A F T March 23, 2000, 12:23am D R A F T
30
to system architecture, clean objects dened in relation to interface components
facilitate such interactive renement and system development.
As for data representation, our discussion led to the implementation of the time
portal concept to visualize and compare multiple instances in time. The time portal
conceptual model provides a good mechanism to visualize and access single or
multiple time instances in a spatialized representation of time. We also found that
single hierarchies are insucient to represent complex human discourse structures.
Our implementation of multiple keyframe browsers associated with dierent shot
hierarchies is the rst attempt at visualizing the tangled hierarchies and overlapping
semantic units at various levels of the psycholinguistic analysis. Our research on
this representation continues.
Multiple-linked representations are essential in the design of complex tools for
temporal analysis. By requiring all interface components to function under a single
current time focus now helps the user to stay situated with the multi-modal data.
The use of animation of the strip chart, avatar, key frame highlighting, and text
transcription display in synchrony with video and audio helps to maintain the sense
of situatedness.
For media synchronization, we found the use of the media or video player as the
master clock for the entire system an eective and simple way to ensure a synchrony
of the various interface components.
9. Conclusions
We have presented a multimediadatabase for perceptual analysis of video data using
a multiple, dynamically linked representations model. The system components
are linked through a time portal with a current time focus. The system provides
mechanisms to analyze overlapping hierarchical interpretations of the discourse,
and integrates visual gesture analysis, speech analysis, visual gaze analysis, and
text transcription into a coordinated whole. The various interaction components
facilitate accurate multi-point access to the data.
While our system is currently applied to gesture, speech, and gaze research, it
may be applied to any other eld where careful analysis of temporal synchronies in
D R A F T March 23, 2000, 12:23am D R A F T
31
video is important. The VCR-Like Control Panel, Digital Video Monitor, Hierarchical
Keyframe Browser, Hierarchy Editor Panel, Animated Strip Charts, Text Transcription
Interface, and Synchronized Transcript Display loaded with any time-based signal les
as ASCII point lists, and may be used for any synchronized signal-based analysis
(video compression rate against time for a compression application, force measure-
ments in a videotaped ergonomic lifting experiment, etc.). The Avatar Representa-
tion may be customized for any other abstract representation.
The time portal concept permits the simultaneous analysis of multiple time foci.
In our work on GSG, we have found situations where we need to compare recur-
rences across multiple time indices and sometimes across video datasets. A single
time portal may then represent the events converging in one nexus of time. This
may be compared against that of an alternate time portal.
Acknowledgments
This research has been funded by the U.S. National Science Foundation STIMU-
LATE program, Grant No. IRI-9618887, \Gesture, Speech, and Gaze in Discourse
Segmentation", and the National Science Foundation KDI program, Grant No.
BCS-9980054, \Cross-Modal Analysis of Signal and Sense: Multimedia Corpora
and Computational Tools for Gesture, Speech, and Gaze (GSG) Research."
References
1. R. Ansari, Y. Dai, J. Lou, D. McNeill, and F. Quek. Representation of prosodic structure
in speech using nonlinear methods. In 1999 Workshop on Nonlinear Signal and Image
Processing, Antalya, Turkey, 1999.
2. Aaron F. Bobick. Representational frames in video annotation. In Proceedings of the 27th
Annual Asilomar Conference on Signals, Systems and Computers, November 1993. Also
appears as MIT Media Laboratory Perceptual Computing Section Technical Report No. 251.
3. Susan E. Brennan. Centering attention in discourse. Language and Cognitive Processes,
10(2):137167, 1995.
4. Judy Delin. Presupposition and shared knowledge in it clefts. Language and Cognitive
Processes, 10(2):97120, 1995.
D R A F T March 23, 2000, 12:23am D R A F T
32
5. Peter C. Gordon, Barbara J. Grosz, and LauraA Gilliom. Pronouns, names, and the centering
of attention in discourse. Cognitive Science, 17(3):311347, 1993.
6. Adam Kendon. Current issues in the study of gesture. In J-L Nespoulous, P. Peron, and
A.R. Lecours, editors, The Biological Foundations of Gestures:Motor and Semiotic Aspects,
pages 2347. Lawrence Erlbaum Associates, Hillsdale, NJ, 1986.
7. R.B. Kozma. A reply: Media and methods. Educational Technology Research and Develop-
ment, 42(3):114, 1994.
8. Robert B. Kozma, Joel Russel, Tricia Jones, Nancy Marx, and Joan Davis. The use of
multiple, linked representations to facilitate science understanding. In Stella Vosniadou,
Erik De Corte, Robert Glaser, and Heinz Mandl, editors, International Perspectives on
the Design of Technology-Supported Learning Environments. Lawrence Erlbaum Associates,
Publishers, Mahwah, New Jersey, 1996.
9. Deborah Mayhew. Principles and Guidelines in Software User Interface Design. Prentice-
Hall Inc., 1992.
10. David McNeill. Hand and Mind: What Gestures Reveal about thought. University of Chicago
Press, Chicago, 1992.
11. C.H. Nakatani, B.J. Grosz, D.D. Ahn, and J. Hirschberg. Instructions for annotating dis-
courses. Technical Report TR-21-95, Center for Research in Computing Technology, Harvard
U., Cambridge, MA, 1995.
12. S. Nobe. Represenational Gestures, Cognitive Rythms, and Accoustic Aspects of Speech: A
Network-Threshold Model of Gesture Production. PhD thesis, Department of Psychology,
University of Chicago, 1996.
13. Shuichi Nobe. When do most spontaneous representational gestures actually occur with
respect to speech? In D. McNeill, editor, Language and Gesture. Cambridge: Cambridge
University Press, 2000.
14. F. Quek. Content-based video access system. US provisional patent application Serial
No. 60/053,353 led on 07/22/1997. PCT application Serial No. PCT/US98/15063 led
on 07/22/1998. UIC le number: CQ037.
15. F. Quek, R. Bryll, and X. Ma. Vector coherence mapping: A parallel algorithm for image
ow computation with fuzzy combination of multiple constraints. Submitted (04/1999) to
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999.
D R A F T March 23, 2000, 12:23am D R A F T
33
16. F. Quek, McNeill, R. D., Bryll, C. Kirbas, H. Arslan, K-E. McCullough, N. Furuyama, and
R. Ansari. Gesture, speech, and gaze cues for discourse segmentation. In Submitted to IEEE
Conf. on CVPR, Hilton Head Island, South Carolina, June13-15 2000.
17. F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K-E. McCullough. Gesture
cues for conversational interaction in monocular video. In ICCV'99 International Workshop
on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pages
6469, Corfu, Greece, September 2627 1999.
18. Boon-Lock Yeo and Minerva M. Yeung. Retrieving and visualizing video. Communications
of the ACM, 40(12):4352, December 1997.
D R A F T March 23, 2000, 12:23am D R A F T