Final Report on Multimodal Experiments-Part II: Experiments for data collection and technology...

66
Final Report on Multimodal Experiments - Part II: Experiments for data collection and technology evaluation Andreas Korthauer (ed.), Ivana Kruijff-Korbayov´ a, Tilman Becker, Nate Blaylock, Ciprian Gerstenberger, Michael Kaißer, Peter Poller, Verena Rieser, Jan Schehl, Oliver Lemon, Kallirroi Georgila, James Henderson, Pilar Manch ´ on, Carmen Del Solar Distribution: public TALK Talk and Look: Tools for Ambient Linguistic Knowledge IST-507802 Deliverable 6.4 (Part II) 20th December 2006 Project funded by the European Community under the Sixth Framework Programme for Research and Technological Development The deliverable identification sheet is to be found on the reverse of this page.

Transcript of Final Report on Multimodal Experiments-Part II: Experiments for data collection and technology...

Final Report on Multimodal Experiments -Part II: Experiments for data collection and

technology evaluation

Andreas Korthauer (ed.), Ivana Kruijff-Korbayova, Tilman Becker,Nate Blaylock, Ciprian Gerstenberger, Michael Kaißer,

Peter Poller, Verena Rieser, Jan Schehl,Oliver Lemon, Kallirroi Georgila, James Henderson,

Pilar Manchon, Carmen Del Solar

Distribution: public

TALKTalk and Look: Tools for Ambient Linguistic Knowledge

IST-507802 Deliverable 6.4 (Part II)

20th December 2006

Project funded by the European Communityunder the Sixth Framework Programme forResearch and Technological Development

The deliverable identification sheet is to be found on the reverse of this page.

Project ref. no. IST-507802Project acronym TALKProject full title Talk and Look: Tools for Ambient Linguistic KnowledgeInstrument STREPThematic Priority Information Society TechnologiesStart date / duration 01 January 2004 / 36 Months

Security publicContractual date of delivery M36 = December 2006Actual date of delivery 20th December 2006Deliverable number 6.4 (Part II)Deliverable title Final Report on Multimodal Experiments - Part II: Experi-

ments for data collection and technology evaluationType ReportStatus & version Final 1.0Number of pages 60 (excluding front matter)Contributing WP 6WP/Task responsible BOSCH

Other contributors USAAR, DFKI, UEDIN, USE, UCAM

Author(s) Andreas Korthauer (ed.), Ivana Kruijff-Korbayova, TilmanBecker, Nate Blaylock, Ciprian Gerstenberger, MichaelKaißer, Peter Poller, Verena Rieser, Jan Schehl, OliverLemon, Kallirroi Georgila, James Henderson, PilarManchon, Carmen Del Solar

EC Project Officer Evangelia MarkidouKeywords experimental methods, multimodal experiments, evaluation,

data collection

The partners in TALK are: Saarland University USAAR

University of Edinburgh HCRC UEDIN

University of Gothenburg UGOT

University of Cambridge UCAM

University of Seville USE

Deutches Forschungszentrum fur Kunstliche Intelligenz DFKI

Linguamatics LING

BMW Forschung und Technik GmbH BMW

Robert Bosch GmbH BOSCH

For copies of reports, updates on project activities and other TALK-related information, contact:

The TALK Project Co-ordinatorProf. Manfred PinkalComputerlinguistikFachrichtung 4.7 Allgemeine LinguistikPostfach 15 11 5066041 Saarbrucken, [email protected] +49 (681) 302-4343 - Fax +49 (681) 302-4351

Copies of reports and other material can also be accessed via the project’s administration homepage,http://www.talk-project.org

c©2006, The Individual Authors.

No part of this document may be reproduced or transmitted in any form, or by any means, electronicor mechanical, including photocopy, recording, or any information storage and retrieval system, withoutpermission from the copyright owner.

Contents

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Introduction 3

2 SAMMIE-1 and SAMMIE-2 Wizard-of-Oz Experiments 52.1 Summary of the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Collected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Assessment of the data collection methods and lessons learnt . . . . . . . . . . . . . . . . 10

2.2.1 SAMMIE-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 SAMMIE-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Use of results in the development of the SAMMIE system . . . . . . . . . . . . . . . . . . 13

2.3.1 Multimodal Presentation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Lexical and syntactic alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Multimodal Clarification Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Speech Recognition Grammar and Generation Templates . . . . . . . . . . . . . . 17

3 MIMUS Wizard-of-Oz Data Collection 183.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 The USE WoZ platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 The USE multimodal WOZ experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.4 User-Wizard interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4.5 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.6 Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Data annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.2 Personal information and user profile . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.3 Experiments conditions and procedure . . . . . . . . . . . . . . . . . . . . . . . . 25

i

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page ii/60

3.5.4 Tasks and subtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.5 Automatic logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.6 Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.7 Gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.8 Annotation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.9 Inter-annotators agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5.10 Assessment of the data collection methods . . . . . . . . . . . . . . . . . . . . . 29

3.6 The MIMUS corpus in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.7 Experiments description and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.7.1 EXPERIMENT 1A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7.2 EXPERIMENT 1B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7.3 Survey Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7.4 Multimodal interaction: performance . . . . . . . . . . . . . . . . . . . . . . . . 35

3.8 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 The SACTI Data Collections 414.1 Simulated ASR Channel - Tourist Information (SACTI) corpus . . . . . . . . . . . . . . . 41

4.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.2 Collection Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.3 Analysis of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.4 Observations on data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.5 Conclusions on the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Other multimodal collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Consequences for multimodal fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.1 Lack of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.2 Lack of time alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.3 Redundancy of gesture acts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Evaluating Effectiveness and Portability of Reinforcement Learned Dialogue Strategies withreal users: the TOWNINFO Evaluation 485.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.1 Automatic evaluation using simulated users . . . . . . . . . . . . . . . . . . . . . 50

5.4 TOWNINFO System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.1 Overview of system features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.5 Portability: moving between COMMUNICATOR

and TOWNINFO domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page iii/60

5.6 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.7.1 Perceived Task Completion (PTC) and User Preference . . . . . . . . . . . . . . . 53

5.7.2 Dialogue Length and Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.7.3 Qualitative description of the learnt policy . . . . . . . . . . . . . . . . . . . . . . 54

5.8 Conclusion and Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.8.1 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 1/60

Summary

The deliverable D6.4 reports on the various multimodal experiements, which have been carried out in thescope of the TALK project. The purpose of the experiments have been twofold:

• Several data collection experiments have been conducted in order to enable dialog design and sys-tem development.

• Based on the evolving prototype systems experiements were done for evaluation of methodologies,technologies and systems.

D6.4 splits into two parts:

1. The first part concentrates on the evaluation of the final SAMMIE in-car system.

2. In the second part we report on the data collection experiments SAMMIE, MIMUS, and SACTI.Moreover, we present results from the evaluation experiments using the TOWNINFO system.

SAMMIE: The dialogue data collected in the SAMMIE-1 and SAMMIE-2 experiments informed thedevelopment of multimodal presentation strategies in the SAMMIE system, and its natural language gener-ation and speech recognition components. The observed multimodal presentation strategies involve variedcombinations of visual presentation of detailed vs. abbridged search result tables, accompanied by verbaloutput either enumerating the results or just providing descriptive summaries based on search result clus-tering [12, 9]. Additionally, we observed adaptation to the user in terms of lexical and syntactic alignmentto his/her formulations [10]. The collected data was also used to build a cluster-based user simulation fortraining a reinforcement learning-based policy for multimodal clarification strategies [17], and to buildfeature-based models of human clarification behavior [19, 18].

MIMUS: The MIMUS corpus is the result of the WoZ series of experiments described in D6.2 [5].The main objective was to gather information about different users and their performance, preferencesand usage of a multimodal multilingual natural dialogue system in the Smart Home scenario. The subjectprofile for this corpus is that of wheel-chair-bound users, because of their special motivation and interestto use this multimodal technology, along with their specific needs. The corpus compresses a set of threerelated experiments. The focus is on subject’s preferences, multimodal behavioural patterns, and willing-ness to use this technology. The results are important since they have been used to design and configurethe MIMUS system, and they will be used for future implementations.

SACTI: An exploratory Wizard of Oz data collection was carried out in Cambridge, using a interactivemap interface. A detailed description of the data collection can be found in D6.2 Annotators Handbook[5]. Status Report T1.4s2 [1] contains a comprehensive analysis of the SACTI multimodal Wizard of Ozdata collections and describes the lessons learned from it. The main purpose of this data collection was toproduce training data for acoustic models and language models of the Cambridge ATK speech recogniser.Another focus of data analysis was to find the right system architecture for integrating speech and mouseclicks.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 2/60

The most important results of this investigation can be summarised as follows:

• Less than 2.5% of the data collected feature integrated multimodal dialog acts.

• Only 8 of the 36 participants used these acts.

• The time relation between speech and clicks is not on the word level, but rather on the utterancelevel.

This suggests integration of multimodal acts in spoken dialogue systems should preferably not be done inthe language model component of the recogniser, but at a later stage.

TOWNINFO: In the TOWNINFO evaluation of the learned dialogue policy versus a state-of-the-arthand-coded policy (with 18 users) we found that users of the learned policy had an average gain in per-ceived task completion of 14.2% (from 67.6% to 81.8% at p < .03), and that the learned policy dialogueshad on average 3.3 fewer system turns (p < .01). These results are important because they show a) thatresults for real users are consistent with results for automatic evaluation [54, 52] of learned policies usingsimulated users [49, 50], b) that a policy learned using linear function approximation over a very largepolicy space [52, 54] is effective for real users, and c) that dialogue policies learned using data for onedomain can be used successfully in other domains.

Use of data: The data of these experiments have been used in development of systems and furthertechnology evaluation experiments throughout the project e.g. [3, 42, 43]. All the reported data collectionsare available as part of the annotated TALK data archive [2]. This should form an excellent resource forfurther exploitation and research work in advanced multimodal dialogue systems.

Version: 1.0 (Final) Distribution: public

Chapter 1

Introduction

The second part of the deliverable D6.4 concentrates on the experiments for multimodal data collectionand evaluation, which have been conducted by USAAR, USE, UCAM, and UEDIN in the scope of the TALK

project.

Document outline: The document summarizes the following experiments and the main results of thedata analysis:

• Chapter 2 presents the setup and conduction of the SAMMIE Wizard-of-Oz experiments. To de-termine the interaction strategies and range of linguistic behavior naturally occurring in the MP3player scenario, USAAR conducted two series of Wizard-of-Oz experiments: SAMMIE-1 involvedonly spoken interaction, SAMMIE-2 was multimodal, with speech and screen input and output.The used methods are assessed and lessons learnt from the experiments are presented. Moreover,it is described, how the research results of the SAMMIE experiments have been used to build theSAMMIE in-car dialogue system [3].

• Chapter 3 summarizes the multimodal Wizard-of-Oz experiments for the Smart Home scenarioand the collection of the MIMUS corpus. The MIMUS corpus is the result of the WoZ series ofexperiments described in D6.2 [5]. These experiments aimed to collect information about differentusers and their performance, preferences and usage of a multimodal multilingual natural dialoguesystem in an in-home domain. The singularity of these user-system interactions lies on the factthat they exhibited multimodal integrative input and output patterns of communication, which arecurrently helping to extend and configure the already existing spoken dialogue system Delfos byadding new functionality, and new input and output modalities.

• Chapter 4 describes the SACTI ”Wizard of Oz” collection with a user and a ”wizard” in the roleof the system. ”Wizard” and user communicated via a simulated speech recognition channel andan interacive map interface. The data is analysed with regard to the amount, nature and timing ofmultimodal interactions. Finally the recommendation is made that combining multimodal streamspost-recognition given multiple recognition hypotheses appears to be more suitable.

• Chapter 5 discusses the UEDIN experiments with the TOWNINFO tourist information system andcorresponding evaluation results comparing hand-coded vs. learnt dialogue policies. We reportresults from experiments with 18 real users (in 180 dialogues) for a learned dialogue policy (as

3

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 4/60

explained in deliverable D4.1 [54]) versus a hand-coded dialogue policy in the TALK project’s“TownInfo” tourist information system [55], reported in deliverable D4.2 [56]. The learned policy,for filling and confirming information slots, was derived from COMMUNICATOR (flight-booking)data as described in [52] and deliverable D4.1 [54], ported to the tourist information domain, andtested using human users, who also used a state-of-the-art hand-coded dialogue policy embeddedin an otherwise identical system. We also present a generic method for porting learned policiesbetween domains in similar (“slot-filling”) applications.

Version: 1.0 (Final) Distribution: public

Chapter 2

SAMMIE-1 and SAMMIE-2 Wizard-of-OzExperiments

2.1 Summary of the experiments

To determine the interaction strategies and range of linguistic behavior naturally occurring in the MP3player scenario, we conducted two series of Wizard-of-Oz experiments: SAMMIE-1 involved only spokeninteraction, SAMMIE-2 was multimodal, with speech and screen input and output.1

In both SAMMIE-1 and 2 the subjects performed several tasks as users of an MP3 player applicationsimulated by a wizard. The tasks involved exploring the contents of a database of information (but notactual music) of more than 150,000 music albums (almost 1 million songs), to which only the wizard hadaccess. 2

Our goal was not only to collect data on user interactions with such a system, but also to observe whatinteraction strategies humans naturally use and how efficient they are. Since besides the usual goal ofWOZ data collection to get realistic examples of the behavior and expectations of the users, an equallyimportant goal for us was to observe natural behavior of multiple wizards in order to guide our systemdevelopment, the wizards’ responses were not constrained by a script.

The speech-only SAMMIE-1 experiment was essentially a pilot study aimed to get an idea of the range oflinguistic and dialogue phenomena in this domain of application. We used a straightforward method ofrecording human-human interactions. The SAMMIE-1 experiment is described in detail in [7]. We usedour experience from SAMMIE-1 to design the more complex setup for the multimodal SAMMIE-2 exper-iment, which was geared towards our research questions concerning multimodal presentation strategiesand multimodal clarification strategies. We designed a setup where the wizard has freedom of choicew.r.t. their response and its realization through single or multiple modalities. This makes it different fromprevious multimodal experiments, e.g., in the SmartKom project [20], where the wizard(s) followed astrict script. But our design is also different in several aspects from taking recordings of straight human-human interactions: the wizard does not hear the user’s input directly, but only gets a transcription, partsof which are sometimes randomly deleted (in order to approximate imperfect speech recognition); theuser does not hear the wizard’s spoken output directly either, as the latter is transcribed and re-synthesized

1SAMMIE stands for Saarbrucken Multimodal MP3 Player Interaction Experiment.2The information was extracted from the FreeDB database, freely available at http://www.freedb.org.

5

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 6/60

(to produce system-like sounding output). The interactions should thus more realistically approximate aninteraction with a system, and thereby contain similar phenomena (cf. [8]). A detailed description of theSAMMIE-2 experiment is in [6]; the complex setup developed for SAMMIE-2 was published in [12]. Thecorpora collected in the experiments are described in [11].

2.1.1 Experiment Setup

In SAMMIE-1, 24 subjects participated each in one session with one of two wizards. They worked oneight tasks, for maximally 30 minutes in total. Tasks were of three types: (1) finding a specified title; (2)selecting a title satisfying certain constraints; (3) building a playlist satisfying certain constraints.

In SAMMIE-2, 42 subjects participated each in one session with one of six wizards. They worked on twotimes two tasks3 for maximally twice 15 minutes. Tasks were of two types: (1) searching for a title eitherin the database or in an existing playlist; (2) building a playlist satisfying a number of constraints.

In both experiments, some of the tasks were specific, such as searching for a song by a given artist, whereasothers were rather vague, such as searching for favorite songs from the seventies. The purpose of the latterwas to give the users the opportunity to formulate their own goals, and specify their own search criteriabased on their knowledge and music preferences.

Both users and wizards could speak freely. The interactions were in German (although most of the titlesand artist names in the database are English). In SAMMIE-2, the wizards could use speech or display onlyor combine speech and display, and the users could speak and/or make selections on the screen. One ofthe challenges we had to address was to allow the wizards to produce varied screen output a in real time.We implemented modules to automatically calculate screen output options the wizard could select fromto present search results, e.g., various versions of lists and tables [12].

In SAMMIE-1 the users and the wizards could hear each other directly, and there were no disruptions tothe speech signal. In SAMMIE-2, we used a more complex setup with no direct spoken contact, in orderto reproduce more realistic conditions resembling interaction with a dialogue system.

Figure 2.1 shows the major data flow within the SAMMIE-2 experiment. The wizard’s utterances wereimmediately transcribed and presented to the user via a speech synthesizer. The user’s utterances werealso transcribed and the wizard was only presented the transcript. As described in [12] we sometimescorrupted the transcript in a controlled way by replacing parts of the transcribed utterances by dots, inorder to simulate understanding problems at the acoustic level. In order to invoke clarification behaviorwe introduced uncertainties on several levels, for example, multiple matches in the database, lexical am-biguities (e.g., titles that can be interpreted denoting a song or an album), and the above-mentioned errorson the acoustic level. Whenever the wizard made a clarification request, the experiment leader invoked aquestionnaire window on the screen, where the wizard had to classify his clarification request accordingto the primary source of the understanding problem. At the end of each task, users were asked to whatextent they believed they accomplished their tasks and how satisfied they were with the results.

Since it would be impossible for the wizard to construct layouts for screen output on the fly, he getssupport for his task from the WOZ system: When the wizard performs a database query, a graphicalinterface presents him a first level of output alternatives, as shown in Figure 2.2. The choices are found(i) albums, (ii) songs, or (iii) artists. For a second level of choice, the system automatically computes fourpossible screens, as shown in Figure 2.3. The wizard can chose one of the offered options to display to the

3For the second two tasks there was a primary task using a Lane Change driving simulator [15].

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 7/60

Figure 2.1: Dataflow of the experiment. The wizard (left above) controls the system output, thesubject (right above) is performing the lane change task on the left screen while graphics outputof the system is shown on the right screen. The two lower pictures show the respective typists.

user, or decide to clear the user’s screen. Otherwise, the user’s screen remains unchanged. It is thereforeup to the wizard to decide whether to use speech only, display only, or to combine speech and display.

We implemented the experimental system on the basis of the Open Agent Architecture (OAA) [14], aframework for integrating a community of software agents in a distributed environment. We made use ofthe OAA monitor agent to trace all communication events within the system for logging purposes.

In both experiments, the subjects were interviewed using a detailed questionnaire after completing theirsession with the system. The questions addressed different levels like concrete details about the inter-action, advantages and disadvantages of the actual MP3-Player, suggestions for future functionality andcommunication interface. In SAMMIE-2 we additionally asked the subjects to give more detailed feedbackconcerning multimodality, clarification strategies and answer generation. Also the wizards were debriefedat the end of both SAMMIE-1 and 2 experiment as a whole, and asked about their strategies.

2.1.2 Collected Data

For both SAMMIE-1 and 2 the data for each session consists of a video and audio recording and a userquestionnaire; for SAMMIE-2 there also is a log file for each session4 which consists of OAA messagesin chronological order, each marked by a timestamp. The messages contain various information obtainedduring the experiment, e.g., the transcriptions of the spoken utterances, the wizard’s database query and thenumber of results, the screen option chosen by the wizard, the selections made by the user in the graphical

4Due to data loss caused by a technical failure, complete data (video, audio and log files) only exists for 21 ofthe 42 sessions.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 8/60

Figure 2.2: Screenshot from the FreeDB-based database application, as seen by the wizard.First-level of choice what to display.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 9/60

Figure 2.3: Screenshot from the display presentation tool offering options for screen output tothe wizard for second-level of choice what to display an how.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 10/60

output, the wizard’s online classification of clarification requests, user satisfaction and their perceived taskcompletion, etc. The SAMMIE-1 corpus contains 24 sessions with approximately 2600 wizard and subjectturns in total; the transcripts amount to approximately 248 KB plain text. The SAMMIE-2 corpus contains21 sessions with 1700 turns; the transcripts amount to approximately 164 KB plain text. The data hasbeen transcribed and annotated at multiple levels [11, 5].

2.2 Assessment of the data collection methods and lessons learnt

2.2.1 SAMMIE-1

The primary purpose of SAMMIE-1 was to get initial insight into the language and problem-solving strate-gies employed by both the users and the wizards. All in all, the experiment satisfied its purpose.

Subjects Most of the subjects managed quite well to control the ”system” and were very enthusiasticabout ”speaking” with an MP3-Player. The interaction was to some extent so authentic, that some userseven believed that the system was real. Nevertheless, other subjects had enough prior knowledge to realizethat this was a Wizard–of–Oz experiment. Those who own a real MP3-Player compared the advantagesand disadvantages of the two alternatives. The majority found the idea of a voice-operated MP3-Playervery practical and reasonable.

Wizards Not only the subjects’ behavior but also the wizards’ behavior was of great interest. Theyacted very differently: one tried to be very computer-like and the other one was very natural. They variedcertain aspects of their behavior, which triggered different subjects’ reactions: cooperation, initiative,feedback, functionality, etc.

Although for the most part, the SAMMIE-1 experiment ran rather smoothly, we outline some of the prob-lems we encountered below. These were pitfalls to be avoided in the future.

Video capture The resolution of the digital video recordings was not sufficient to analyze how the wiz-ards used the FreeDB interface. Another solution would thus be needed in the future. Other smalltechnical disturbances did not really influence the recordings’ quality.

Microphones Using a free-standing microphone may not have been the best idea for the wizard. Envi-ronmental noises, especially keyboard typing, were picked up by the microphone. This not onlydestroyed what little hope we had for the subjects to believe that they were actually talking to acomputer, it also was archived as part of the speech signal, which may make it less useful for e.g.,training a speech recognizer.

Headphones This problem could also be attributed to the use of free-standing mics. Normal speakerswere used instead of headphones. This created a situation where each participant’s speech waspicked up by the other’s microphone. Thus, although we have separate audio signals for eachspeaker, each signal contains a faint ’copy’ of the other signal. Again, this could negatively affectthings like speech recognition training.

Subject and Wizard Recruitment Although recruitment was done around Saarland University Campus,a large number of Computational Linguistics students ended up participating, including one of the

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 11/60

wizards. Such students had a preconceived notion of what to expect from a dialogue system, andtherefore many seemed to pre-adjust their language accordingly. This is especially apparent in thewizard. He not only used language he thought a computer capable of, in an exit interview, he evenmentioned that he purposely pretended to not understand subjects’ utterances, when he thought areasonable dialogue system would not be able to understand it. In the future we one should recruitsubjects without much knowledge about state-of-the-art dialogue systems, or instruct the wizardsdifferently.

The main issue for future experiment design thus emerged to be to simulate a dialogue-system setup moreaccurately, both for the users, who should be convinced that they are interacting with a computer, and forthe wizards, who should not enact their own preconceptions about the capability of a system.

2.2.2 SAMMIE-2

The goal of SAMMIE-2 was not only to get realistic examples of the behavior and expectations of the users,but an equally important goal for us was to observe natural behavior of multiple wizards in order to guideour system development. In general, we were satisfied with the setup which simulated more accurately adialogue system. However, we also encountered some complications, as described below.

Experience Using LaneChange

LaneChange and Multi-modality: Asking the subject to use the driving simulation system LaneChangewhile operating with the “dialogue system”, imposed an additional cognitive load on the subject.Especially multimodal tasks such as reading from the display or selecting an item via mouse clickwere much more difficult, as driving is an “eyes-and-hands-busy” application. Some of the subjectsreported that they would prefer to not have to use the display at all while driving. 76.2% of thesubjects would prefer to stress verbal feedback while driving. 85.7% of the subjects reported thatgraphical presentation was distracting them. On the other hand subjects in our study perceivedthe graphical output to be very helpful in less stressful driving situations and when not driving.Especially when they wanted to verify whether a complex task was completed (e.g., building aplaylist), they ask for graphical proof (i.e., showing the playlist).

Selecting an item by mouse click seemed to be inappropriate as well, especially for less expertusers. The experimenter observed that an elderly user stopped using the driving simulator (bysimply taking of both hands from the wheel) each time she wanted to operate the mouse or readfrom the display.

LaneChange and Learn-ability: At the beginning of each session each subject got a short introductionon how to operate the driving simulation. Depending on their driving skills and how much theywere used to operating in simulated realities, some of our subjects found it difficult to get started.But after a short training period almost every subject felt comfortable in handling the software.Less skilled drivers seemed to adjust their speed accordingly.

Problems Using Human Typists

The use of human typists for online dialogue presented several challenges described below, which wewere only partially able to resolve.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 12/60

Spelling Errors First, there was the challenge of spelling mistakes. These naturally occur during onlinetranscription, and the necessity to transcribe sometimes unknown artist names or song titles onlyexacerbated the problem. For transcription of the subject utterances (which were presented in textto the wizard), the only real problem was that this made it hard to make the wizard believe that thetranscription they were reading came from a speech recognizer. However, on the other side, i.e.,the transcription of the wizard’s utterances, spelling mistakes were much more problematic. Thiswas because the transcribed text was then sent to a TTS system to be synthesized and played backto the subject. Thus spelling errors sometimes resulted in huge differences in the spoken output ofthe system.

To combat this problem, we integrated a spell-checker into the transcription tools. Although thishelped avoid many spelling problems, mistakes still did occur. Also, the need to do spell-checkingadded to the time delay in dialogue turns: a problem we turn our attention to now.

Speed Probably the biggest problem with the whole experiment was the time delay in the interaction. Weexperienced this somewhat in the SAMMIE-1 data collection [13], where it took time for the wizardto properly perform searches on the database (especially for wizards which were slow typists).However, the introduction of human typists made things incredibly slow. We have not analyzedtime delays systematically in the corpus, but anecdotally, it was not uncommon for a turn to take inthe order of 1-2 minutes to complete (meaning the time from which a participant, e.g., the subjectuttered something to when the response from the wizard was synthesized). This severely affectedthe amount of data (e.g., turns) we were able to gather during a session. We also strongly suspect itaffected the naturalness of the dialogue.

This does not necessarily mean, however, that WOZ experiments with human typists are impossible.We believe several things could have sped things up: first, as stated above, we recruited typistsfrom the university community. We believe professional typists could have done a much betterjob. Also, it may be possible to augment the transcription tool to (a) make it more ergonomic and(b) incorporate functions known to professional transcribers such as word/phrase completion, etc.Also, it may be possible, depending on the research question, to just have a typist for one channelof the conversation (e.g., wizard to subject) and then to use audio for the other.

Problems Using Open Task Descriptions

One of the evaluation criteria for spoken dialogues is whether the user was able to accomplish the goal.By formulating some of the task descriptions in an open manner, we allowed the users to set their owngoals. We experienced two problems with this method.

First, we cannot inspect the user’s own goals and therefore cannot evaluate whether he was able to fullysatisfy his aims. Therefore we had the user to indicate subjective task completion at the end of each task.An inspection of these results indicates that the users were judging their ability to solve the task prettyhigh (as they indicated on the task cards) while judging the system abilities very low (as mentioned in thedebriefing session). Therefore the user’s experience of task success and task satisfaction must be evaluatedcarefully.

Second, we are under the impression that open task descriptions lead to less-motivated users. Subjects inWizard-of-Oz studies are in general less motivated to solve a task than real users. By having underspecifiedtask descriptions, some of the subjects adopted a “do-what-ever-you-want” strategy. For example, if thewizard asked a clarification request, presenting different options, the subject would not want to further

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 13/60

specify his goal, but just take any option to get the task quickly done. This behavior was also reinforcedby the time delay.

Experience with on-line clarification classification by wizards

The problem sources selected on-line by the wizards on the pop-up questionnaire window cannot beconsidered completely reliable. Some of the wizards reported that the categories were unclear to them orwere too general. Furthermore the pop-up window was sometimes distracting them from their primarysearch task. Some wizards just clicked on any button to make the pop-up vanish.

Importance of daily backup

An embarrassing lesson learned was the importance of daily backup. We lost all but the (analog) videodata for the first 15 sessions of the experiment because of a failed hard drive.

The main issues for future experiments would thus concern speeding up the interaction, without compro-mising its dialogue-system-like character, and improving the handling of the primary (driving) task by theusers.

2.3 Use of results in the development of the SAMMIE system

By interviewing the wizards and the subjects and by analyzing the collected data we obtained valuable in-formation concerning various aspects: presentation strategies and the use of the two modalities, linguisticoutput realization strategies involving alignment to the user’s formulations, and multimodal clarificationstrategies. The collected data of course also has served as a guideline with respect to the coverage of thegrammar for the speech recognizer and the range of formulations for the spoken system output generation.

2.3.1 Multimodal Presentation Strategies

Below we summarize the main observations concerning presentation strategies and the use of modalitiesand we indicate how these are reflected in the SAMMIE system implementation. A complete descriptionof multimodal turn planning in the SAMMIE system is provided in [9], sentence planning and realizationare described in [10].

Visual Presentation of Search Results There were differences in how different wizards rated andused the different screen output options: The table containing most of the information about the queriedsong(s) or album(s) was rated best and shown most often by some wizards, while others thought it con-tained too much information and would not be clear at first glance to the users and hence they used itless or never. The screen option containing the least information in tabular form, namely only a list ofsongs/albums with their length, received complementary judgments: some of the wizards found it uselessbecause it contained too little information, and they thus did not use it, and others found it very useful be-cause it would not confuse the user by presenting too much information, and they thus used it frequently.Finally, the screen containing a text message conveying only the number of matches, if any, has beenhardly used by the wizards (we have consequently not implemented it as an output option in the SAMMIE

system).

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 14/60

Figure 2.4: SAMMIE system GUI

As for the subjects, they found the multi-modal presentation strategies helpful in general. However, theyoften thought that too much information was displayed. They found it distracting, especially while driving.They also asked for more personalized data presentation (cf. [12] for more detail).

The differences in the wizards’ opinions about what the users would find useful or not and the variedjudgments by the subjects clearly indicate the need for evaluation of the usefulness of the different screenoutput options in particular contexts from the users’ view point.

The design of the screen output presentation in the SAMMIE system follows the finding that most oftenthe wizards tried to be as informative as possible when using the pre-designed screen output options.Basically our display consists of a table-viewer which is used to present the queried objects (e.g, songsor albums ), a context panel that gives additional grounding information w.r.t. to the user query and asong-viewer which holds information about the currently playing song, album or playlist. In situationswhere presentation of additional information is expedient, the display presents more information aboutthe listed objects. See Figure 2.4 for an example graphical output based on a verbal user request ’Showme all rock albums’, where in addition to each table entry the appropriate artist is listed. However, thistype of content augmentation is implemented in a selective way, in order to avoid the presentation of lessinformative content.

Verbal Presentation of Search Results In speech-only interaction, the wizards typically say thenumber of results and list them, when the number is small (up to approx. 10, cf. (1)). For more results,they often say the number, and sometimes ask whether or not to list them (cf. (2)). For very large sets ofresults, the wizards typically say the number and ask the user to narrow down the search, (cf. (3)).

(1) I found 3 tracks. Blackbird, Michelle and Yesterday.(2) I found 17 tracks. Should I list them?(3) I found 500 tracks. Please constrain the search.

In multimodal interaction, a commonly used pattern is to simultaneously display screen output and de-scribe what is shown (e.g., ’I’ll show you the songs by Prince’); see below for more details.

Multimodality in Output Presentation When showing screen output, the most common patternused by the wizards was to accompany the displayed information by telling the user what was shown

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 15/60

(e.g., ’I’ll show you the songs by Prince’). Some wizards adapted to the user’s requests: if asked to showsomething (e.g., ’Show me the songs by Prince’), they would show it without verbal comments; but ifasked a question (e.g., ’What songs by Prince are there?’ or ’What did you find?’), they would show thescreen output and answer in speech.

According to the above approach employed by the wizards, the SAMMIE system is able to handle explicitmodality requests as follows. If a user asks to show something (e.g., ’Show me all rock songs by theBeatles’) SAMMIEpresents the list just graphically. In such cases, speech is only used to refer to displayedcontent including the amount of found object and the user-constraints for the given request (e.g., ’I found49 rock songs by the Beatles which are presented on the screen’). If a user asks explicitly to namesomething, the resulting objects are both displayed and verbalized, whereas just the objects currentlyvisible within the table viewport are named (e.g.,’I found 49 rock songs by the Beatles. The first sixare:...’). In cases where no explicit modality request is given by the user (e.g., ’What rock song by theBeatles are there’ or ’Scroll down’) the system uses the focused modality of the last explicit modalityrequest derived from the extended information state (see [4] for details).

Summaries Based on Clustering A common characteristic in both SAMMIE-1 and SAMMIE-2 se-tups is that the wizards often verbally summarize the search results in some way: most commonly by justreporting the number of results found, as in (3). But sometimes they describe the similarities or differencesbetween the results, as in (4).

(4) 200 are from the 70’s and 300 from the 80’s.

Such descriptions may help the user to make a choice, and are a desirable type of collaborative behaviorfor a system. Their automatic generation provides an interesting challenge: It requires the clustering ofresults, abstraction over specific values and the production of corresponding natural language realization.

We implemented a domain-independent clustering mechanism which allows to cluster search result-setsin the SAMMIE system. This mechanism is used create an appropriate presentation in order to help theuser to narrow down the result-set when a specific object is requested (e.g., the user wants to hear song,but provides underspecified information, as in ’I would like to hear a rock song’).

Adaptation to Driving Situation Concerning the adaptation of multimodal presentation strategiesw.r.t. whether the user was driving or not, four of the six wizards reported that they consciously usedspeech instead of screen output if possible when the user was driving. The remaining two wizards did notadapt their strategy.

The multiomdal turn planning module of the SAMMIE system is able to take different levels of the cog-nitive load imposed by the primary driving task into account and adapt the amount of displayed and ver-balized content accordingly. However, more experimental results would be needed in order to determineoptimal multimodal fission decisions in various driving contexts.

2.3.2 Lexical and syntactic alignment

The collected dialogues contain many occurrences of the wizard aligning to the user, lexically (e.g., using“Song” vs. “Lied” to refer to a song, depending on what word the user uses) or syntactically (e.g., usingpersonal style which involves the explicit realization of an agent, typically by a personal pronoun, vs.impersonal style which avoids the attribution of an agent; the examples below illustrate this difference).

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 16/60

(5) Personal style:

a. I found 20 albums.

b. You have 20 albums.

(6) Impersonal style:

a. There are 20 albums.

b. The database contains 20 albums.

Consequently, we implemented a range of cases of alignment to the user input in the SAMMIE system.The interpretation module provides the features that characterize the formulation of the user’s input, e.g.the head nouns referring to domain concepts, sentence mood, completeness (i.e., full vs. fragmentary) andagentivity and person, as illustrated by the examples below.

(7) U. How many rock songs are there?song noun=song; mood=interrogative; completeness=full; agentivity=impersonal

S. There are 35 rock songs.

(8) U How many rock tracks do I have?song noun=track; mood=interrogative; completeness=full; agentivity=1st-person

S. You have 20 albums.

(9) U. Rock songs.song noun=song; mood=indicative; completeness=fragmentary; agentivity=impersonal

S. 35 rock songs found.

The NLG module then uses the features to control the output realization for each turn to tailor it ac-cordingly. Based on that, and according to the specific content that has to be verbalized, the NLG mod-ule will realize different combinations of alignment phenomena: alignment w.r.t. lexical choice, per-sonal/impersonal style, use of telegraphic/non-telegraphic expressions and formal/informal address (see[10] for details). Although in our implementation we assume that the system should always align, this doesnot mean that there is only a single output realization option for a specific turn (see examples above). Itshould be mentioned that some of the alignment phenomena are independent from each other (for instancelexical alignment and personal/impersonal style). If some feature slot coming from the interpretation isempty, the NLG module uses predefined (default) values for each alignment phenomenon implemented.

2.3.3 Multimodal Clarification Strategies

The data gathered in the SAMMIE-2 setup is used to “bootstrap” a reinforcement learning-based clari-fication strategy. The overall framework is described in [16]. In particular, we use this data to build acluster-based user simulation for training a reinforcement learning-based policy [17], and to build feature-based models of human clarification behavior [19, 18]. We are currently testing whether features selectedby these models improve the learning results. We hope to use these results to make more general state-ments about similarities and differences between human clarification strategies and clarification strategiesoptimized for dialogue systems.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 17/60

2.3.4 Speech Recognition Grammar and Generation Templates

Last but not least, the collected data was used as a guideline when developing the grammar for the speechrecognizer, as well as an important source of inspiration for the formulation of verbal output of the system.

Version: 1.0 (Final) Distribution: public

Chapter 3

MIMUS Wizard-of-Oz Data Collection

3.1 Overview

This chapter summarizes the Wizard-of-Oz collection of multimodal data in a Smart Home scenario; itdescribes the motivation, collection methodology and format of the MIMUS corpus (MultIModal, Uni-versity of Seville), as well as the in-depth and issue-focused analysis of the data.

The MIMUS corpus is the result of the WoZ series of experiments described in D6.2 [5]. The mainobjective was to gather information about different users and their performance, preferences and usageof a multimodal multilingual natural dialogue system in the Smart Home scenario. The subject profilefor this corpus is that of wheel-chair-bound users, because of their special motivation and interest to usethis multimodal technology, along with their specific needs. Throughout this chapter, the WoZ platform,experiments, methodology, annotation schemes and tools, and all relevant data will be discussed, as wellas the results of the in-depth analysis performed on these data. The corpus compresses a set of threerelated experiments. Due to the limited scope of this deliverable, only the issues related to the first twoexperiments (1A and 1B) will be discussed. The focus is on subjects preferences, multimodal behaviouralpatterns, and willingness to use this technology.

3.2 Introduction

This corpus was collected in order to perform research on some Human-Computer Interaction relatedareas, with the ultimate goal of gaining sufficient information to re-design and configure the MIMUSmultimodal dialogue system. This was originally the speech-only system known as Delfos, developedthroughout previous European projects. The data collected in MIMUS was used to design the new sys-tem extensions and define its configuration and overall behaviour. The chapter is organized as follows.Firstly, the WOZ platform will be briefly introduced. Then, the full set of experiments as well as the mo-tivation behind them will be discussed. Data, annotation tools and methodology, assessment of the useddata collection methods, inter-annotators agreement, and final format will then be described and justified.Following this supporting information, the data from experiments 1A and 1B will be dissected, presentedand interpreted. Finally, some conclusions will be drawn and future research areas will be proposed.

18

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 19/60

3.3 The USE WoZ platform

The platform is based on Delfos, the original spoken dialogue system developed at the University ofSeville. Since the objective of this corpus was to obtain relevant information in order to design, implementand configure a multimodal version of this original system, all the previous spoken functionality was madeavailable (as well as the new multimodal additions).

In terms of hardware, the platform consists of a PC used by the wizard, a tablet PC used by the subject, aWi-fi router by means of which both PCs can communicate, and a set of real home devices which makeup the Smart Home set up. In addition, software consisting of a set of wizard agents and subject agentshas been developed.

The former set consists of:

1. A Wizard Helper, which is a control panel that enables the wizard to talk to the user and remotelyplay audio and video files.

2. A Device Manager, which enables the user to control the physical home devices and to see whatthe subject is clicking on if that is the case.

The set of subject agents consists of:

1. A Home Setup agent, which displays the virtual house and its devices and where the subject mayclick using a pen or mouse.

2. A Telephone Simulator, where the subject can simulate a phone call and other regular telephoneoptions.

3. A TTS Manager, which synthesizes the wizards messages when appropriate.

4. A Log Manager, where all the interaction data is logged, and

5. A Video Client, used to simulate an outside camera.

3.4 The USE multimodal WOZ experiments

3.4.1 Motivation and Objectives

The MIMUS corpus is the result of a multimodal WoZ set of three experiments. The experimental designwas stimulated by Oviatt’s previous experiments and results [22] [23]. As mentioned before, the originalobjective of these experiments was to collect data in order to extend and configure an existing spoken dia-logue system by adding new input and output modalities. The goal was to identify and gather informationregarding [21]:

• any possible obstacles or difficulties to communicate,

• any biases that prevent naturalness,

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 20/60

Figure 3.1: The subject’s touchscreen display

• a corpus of natural language in the home domain,

• modality preference in relation to task,

• modality preference in relation to task and scenario,

• output modality preference in relation to the type of information provided,

• modality preference in relation to system familiarity

• task completion time,

• combination of modalities for one particular task,

• inter-modality timing,

• user evolution, learnability and change in attitude,

• how new modalities affect interaction in other modalities,

• context relevance and interpretation in multimodal environments,

• pro-activity and response thresholds in multimodal environments,

• relevance of scenario specific-factors/needs,

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 21/60

• multimodal multitasking: multimodal input fusion and ambiguity resolution, and

• multimodal multitasking.

The experiments bring some insight into the users’ speech and pen multimodal integration patterns ona system application that controls lights, a blind, a radio, a heater, an alarm, the main door, a securitycamera, and a telephone. The interactions between users and the human wizard were recorded fromdifferent perspectives.

3.4.2 Subjects

Two groups of naive informants were recruited for the experiment. A primary group was formed by anumber of wheel-chair bound subjects (17); a secondary group included subjects without disabilities (6).Informants’ ages ranged between 19 and 54 years old. There was a total of 7 women and 16 men amongthe subjects. All were native speakers of Spanish. They showed varying levels of computer expertise.

Figure 3.2: Subjects’ age distribution

3.4.3 Experiments

The experiments took place in a lab especially prepared to simulate a smart house, where all the devicesto be controlled were at sight. Subjects were alone and undisturbed during the experiments. The overallarchitecture of the user-system multimodal interaction could be summarized as follows:

1. Experiments 1.A and 1.B: the subject is a naive user; the wizard is a trained expert.

• User moves: spontaneous speech, as well as graphical input.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 22/60

• Wizard’s moves: both pre-defined and spontaneous sentences, as well as graphical input.

2. Experiment 2: the subject adopts the wizard’s role; a knowledgeable user pretends to be naive.

• User moves: pre-established commands, and graphical input.

• Wizard’s moves: spontaneous speech, and graphical output.

Figure 3.3: The experiments’ set-up

Instructions were provided for all tasks, which ranged from simple actions (turning a light on) to morecomplex simultaneous actions (making a phone call while monitoring the camera and opening the door).

Additional information on the subjects was collected prior to the experiments (computer expertise, etc);more information about their perception of the interaction with the system and the system performancewas collected after the experiments.

3.4.4 User-Wizard interactions

The interaction between subject and system was recorded from different perspectives. A digital camerarecorded the progression of the experiment. A web camera captured the subjects’ faces as they performedtasks (Figure 3.5). The touch-screen activity was logged.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 23/60

Figure 3.4: Naive subject during the experiments (DV recording)

Figure 3.5: Naive subject (webcam recording)

3.4.5 Logging

This is the information recorded in execution time during the experiment. The logging is therefore focusedon low-level information, and especially on the time at which each utterance occurs.

The information automatically logged can be summarized as:

• Modality

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 24/60

• Clicks

• Time of events

• Wizard Messages

• Wizard Message Time

The format chosen to record the information is EMMA [24], which distinguishes two properties for theannotation of input modality: (1) indicating the broader medium or channel (medium) and (2) indicatingthe specific mode of communication used on that channel (mode). The input medium is defined from theusers’ perspective and indicates whether they use their voice (acoustic), touch (tactile), or visual appear-ance/motion (visual) as input.

3.4.6 Surveys

Various surveys were also conducted before and after each experiment in order to collect and comparedata. Skills, experience, preferences, biases and personal data were collected. Then, after each experiment,subjects were enquired about their experience with the system, their level of satisfaction, preferences, etc.

3.5 Data annotation

3.5.1 Annotation

Given the different types of analyses to be performed on these data, different levels of annotations wereestablished. These levels can be summarized as follows:

• Personal information and user profile

• Experiment conditions and procedure

• Tasks and Subtasks

• Automatic Logging

• Dialogue

• Gestures

3.5.2 Personal information and user profile

This is the information related to the users, along with their computer skills, disabilities, age, gender,cultural level, degree of familiarity with speech and/or graphical interfaces, nationality and language pro-ficiency.

It also includes the information collected on pre-experimental and post-experimental surveys regardingthe users biases towards automated interfaces and their opinions, suggestions or satisfaction level afterinteracting with the system.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 25/60

3.5.3 Experiments conditions and procedure

This is the type of information that defines the conditions under which the experiments were conductedand the procedures followed to ensure the data reliability and coherence.

Time of the day at which the experiments were conducted, duration, general instructions given to thesubjects, incidents or mistakes, if any, were the main parameters to be taken into account at this level.

The experiments were always conducted between 4:00 and 6:00 P.M., to ensure the smallest possibledifference in terms of biological cycles. All three experiments were conducted sequentially and lastedbetween 1 and 1,5 hours.

The technicians were knowledgeable about the experiments requirements and were also given preciseinstructions as to what they had to say and how. In order to ensure coherence in the instructions for allsubjects, check-lists were used. Every subject was given exactly the same information, and was completelynaive as far as what the objective of the experiments was.

Some small incidents took place in some of the experiments. All of them were duly annotated and includedin the corpus documentation.

3.5.4 Tasks and subtasks

The description of the tasks and subtasks to be performed by the subjects is recorded at this level. In thiscase it is also relevant to record the exact way in which the subjects were given the information to performeach task, and when and how such information was provided. This is particularly important since someof these tasks and subtasks were especially designed to encourage, or at least allow, subjects to performseveral tasks simultaneously. It is also important in order to determine the cognitive load imposed on thesubjects.

3.5.5 Automatic logging

This includes all the information logged automatically during the experiments. This is mainly low levelinformation (time stamps, modality, icons clicked on, etc).

It also includes all the information predetermined and/or introduced manually by the wizard (pre-specifiedor spontaneous wizard messages, etc).

3.5.6 Dialogue

This level includes transcription and segmentation of the user’s utterances as well as the Dialogue Move(DM) and Subdialogue annotation.

In MIMUS, dialogue-level annotations follow the classification of the Natural Command Language Dia-logues (NCLDs), as defined in [25]. Since it is the broader concept of NCL that encapsulates the presentframework of analysis, it seems natural to also employ Dialogue Moves in annotating dialogue turns.

A reason for choosing the NCL approach over Traum’s [26] Conversation Acts is that the former focuseson the internal aspects of dialogue, whereas the latter builds up “to a level of common ground that isnecessary for communication of beliefs, intentions, and obligations” [26]. That is, a model built on thegrounds of a NCL should be based more on what is said than what is in the minds of the participants

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 26/60

when things are said. In other words, it should try to model external aspects of the dialogue rather thanthe participants’ internal state.

The Dialogue Moves are therefore classified as follows:

• Command-Oriented Dialogue Moves

– askCommand: The system requests the user to specify a command or function to be per-formed.

– specifyCommand: A specific command or function is selected.

– informExecution: The system acknowledges the execution of the task.

• Parameter-Oriented Dialogue Moves

– askParameter: The system asks for the value of a specific parameter.

– specifyParameter: The assignment of some value to one parameter.

• Interaction-Oriented Dialogue Moves

– askConfirmation: Once a command has been completed, some situations will require anexplicit and/or implicit confirmation.

– answerYN: The user replies yes/no.

– askContinuation: The system asks for the continuation of the dialogue.

– askRepeat: Any of the participants may request the other to repeat the last utterance, or evena specific parameter or command.

– askHelp: A petition for help (general, for a specific command, or for a specific parameter).

– answerHelp: The reply to an askHelp move.

– errorRecovery: For a situation in which the continuation of the dialogue is impossible.

– greet: The usual greeting .

– quit: The usual closing operation.

As to Subdialogue Annotation, and as [25] state, “An important aspect of NLCDs is that they exhibitfunctional embeddings . . . [that] occur when the goal of a sub-dialogue shifts to another dialogue type,”the following types of sub-dialogues are distinguished within a NCLD:

1. Deliberation dialogue

2. Action-oriented dialogue

3. Information-seeking dialogue

4. Negotiative dialogue

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 27/60

3.5.7 Gestures

The Gestures classification and annotation is defined according to a closed set of values for the attribute“gestureType”. These have been adapted from the SmartKom Project collection of multimodal data [27]:

1. anger/irritation

2. pondering/reflecting

3. joy/gratification (being successful)

4. surprise

5. helplessness

6. neutral/anything else

7. face partially not visible

3.5.8 Annotation Tools

ANVIL [28] is the annotation tool used for the transcription process and the encoding of the elementsrecorded during the experiments.

The ANVIL track “UserInput.spoken” includes the manual segmentation and transcription mentionedabove. The track “UserInput.graphical” was generated automatically from the information logged (inexecution time) in the XML file “gui-in.xml”. The tracks “GUIOutput.spoken” (from the log “speech-out.xml”) and “GUIOutput.graphical” (from the log “gui-out.xml”) are also automatically loaded.

The resulting ANVIL tracks are listed below:

1. Track 1: Waveform

2. Track 2: WizardActions (wizard’s clicks)

3. Track 3: UserInput.graphical (user’s clicks)

4. Track 4: UserInput.spoken (manual segmentation and transcription of user’s speech)

5. Track 5: GUIOutput.graphical (graphical output)

6. Track 6: GUIOutput.spoken (endpointed TTS speech)

7. Track 7: DialogueMoves (hand-annotated)

8. Track 8: Subdialogues (had-annotated)

9. Track 9: FacialExpressions (hand-annotated)

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 28/60

3.5.9 Inter-annotators agreement

In manual coding, the “inter-annotators agreement” plays a relevant role. It provides an easy way to obtaininsights on the overall reliability of the annotation process [29]. The quality of a given annotation maybe understood as the amount of agreement of two or more human annotators. This can be measured bypercentage agreement (that is, the percentage of cases annotated with the same category over the totalnumber of cases). The resulting number indicates the degree of agreement among annotators. In Bayerl’s[29] own words, “percentage agreement (PA) refers to the proportion of judgements that two coders makein agreement in relation to the total number of judgements”.

In order to ensure a high level of reliability and coherence throughout the corpus, only two annotatorsperformed the annotations. They were given precise instructions as to how to proceed, and worked togetherduring the full annotation process. The set of labels and various criteria to annotate the events were definedand agreed in advance. Moreover, the first two subjects were annotated jointly by the two annotators, toensure methodology coherence. The annotators were also instructed to discuss any unforeseen events orannotating anomalies before making any decisions.

Even so, once the corpus annotation process was over, according to the inter-annotators agreement analy-sis, some mismatches were found. They were immediately corrected. A second inter-annotators analysiswas then carried out.

Both reliability of classification (i.e., the allocation of categories to elements) and reliability of segmen-tation (i.e., annotators agree on how to segment elements) have been taken into account on the presentanalysis. The segmentation of dialogues into Dialogue Moves and Subdialogue types, together with thecategorization of the resulting elements, both entailed the focus of attention in this analysis.

Reliability of classification

Annotators coded (independently) the same instance of a user-system interaction. These are the results ofthe inter-annotators agreement evaluation:

• For the DialogueMoves track:

– PA=96%.

• For the Subdialogues track:

– PA=100%.

Reliability of segmentation

As regards the reliability of segmentation, agreements were counted as the total number of times for bothannotators marking the start/end boundary at the same exact point on the ANVIL time-line.

• For the DialogueMoves track:

– START: PA=98%

– END: PA=90%

• For the Subdialogues track:

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 29/60

– START: PA=100%

– END: PA=91.67%

3.5.10 Assessment of the data collection methods

Regarding self-evaluation, it is worth considering the following observations.

Subjects

• Recruitment: Subjects were recruited through two associations for people with disabilities, bothof which are located in Seville: FAMS-Cocemfe and ASPHEB.

• Motivation: Most of the subjects felt extremely motivated to be a part of the research which couldultimately have a positive impact on their own quality of living. More often than not they sharedtheir enthusiasm during the post experimental questionnaire.

• Expected behaviour: Disabled subjects were entirely naive. None of them had gone through asimilar (Wizard-of-Oz) experience before. They were really collaborative and eager to performwell.

Wizard

The wizard was well-trained before the actual experiments were conducted. This was important sincesome of the wizard’s tasks were complex and very demanding in terms of cognitive load. A poor wizardperformance would have had a negative impact on the experiments. Nonetheless, as expected, the wizardbecame more proficient as the experiments went through, and performed reasonable well at all times. It isalso worth noting that the wizard ended up being more collaborative and responsive than instructed. Theexperiment design was not follow to the letter in terms of verbal confirmations or wizard’s proactivity.Therefore part of the supporting data regarding these issues could not be collected.

Data collection

• Webcam recording: The webcam was connected to the tablet PC where the user’s platform wasrunning. This increased somewhat the computational load of the computer’s CPU.

• Digital videos: The process of digitalizing the analog recordings was thorough and time-consuming.Tool limitations (regarding codec incompatibilities for their use in Anvil) slowed down the process.

• Microphone: The clip-on microphone was attached to a lamp placed near the users. This turnedout to be very useful to prevent them from making a conscious or unconscious effort to makethemselves heard.

Scenario

• X-10 modules: The connection between the serial port in the wizard’s computer and the X-10modules attached to the devices was somewhat fragile; on a few occasions, the experiment had to be

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 30/60

interrupted artificially for a couple of minutes, until the connection was restored. The WoZ make-believe was nonetheless NOT jeopardized, and the subjects believed the system was recovering byitself.

• Devices: The presence of the actual devices helped users become more involved in the experiments,and build a mental image of the virtual smart house they were controlling.

3.6 The MIMUS corpus in a nutshell

The MIMUS corpus is the result of a conscientious design, and a rigorous and methodical predefinedprocedure to conduct the experiments, as well as to log and annotate all relevant information. This istherefore a reliable and growing source of information for research on HCI and Multimodal DialogueSystems as well as other related disciplines. It consists of a total of 73 dialogues, by 23 different users in32 different tasks.

3.7 Experiments description and results

In the pre-analysis of the tasks at hand and previous to the design of the experiments, some hypotheseswere established:

Hypothesis 1: In general, subjects use more speech-only communication than graphics-only or mixed-mode communication. However, the type of information exchanged or the context will in some casesdemand graphics-only or mixed-mode communication.

Hypothesis 2: Subjects’ communication needs are scenario-dependent: the same type of information maynot require the same communication type in different scenarios.

Hypothesis 3: The presence of additional modalities has a significant effect on the Natural Language usedby subjects.

Hypothesis 4: The preferred interaction mode will or may vary according to the subject’s familiarity withthe system. Once the subject is familiar or skilled using the system, they may choose a different interac-tion mode more in accordance with their personality type or physical skills. This is particularly relevantin the scenario chosen (subjects with disabilities in the Smart Home scenario).

Hypothesis 5: System response time will vary depending on the interaction mode: subjects will needmore or less time to complete their requests or commands depending on the input mode used. This im-plies that the system should be configured with different degrees of pro-activity and/or allotted times forresponses or command completion.

In order to confirm or refute these hypotheses as well as others that emerged as the experiments weredesigned and the WoZ platform implemented, three experiments were conducted, and two of them will bedescribed here.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 31/60

3.7.1 EXPERIMENT 1A

Objective

The main objective was to record the interactions between human users and the wizard from differentperspectives, in order to gather information to configure the basic system and then test it again. Completelynaive subjects provided reliable data about the first reaction of an untrained user before becoming morefamiliar with the system.

Subjects

The subjects were given just enough information to perform the tasks, but were also given precise instruc-tions as to how to proceed with the system. They were given very general information such as “you maytalk to the system”, “you may select things by touching the screen” or “you may do both things at thesame time”. As far as the subjects were concerned, they were interacting with an intelligent multimodaldialogue system and no other human was involved. Subjects were provided with one task at a time, whichappeared on the computer screen.

Scenario

Subjects were alone in a room especially prepared for the experiment.

Data Collection

The interaction between subject and system was recorded from all perspectives. The camera videorecorded the experiment. Special software was used to record the touch-screen activity and all agentsin the experiment set-up log all their actions.

Methodology

The wizard was out of sight but able to hear what the subject said and to see their touch-screen. Althoughthe subjects’s input was processed and logged by a speech recognition engine, the wizard pretended tounderstand everything (within a predefined set of guidelines), except for a few artificially introducedrecognition errors. In response to the subjects’ actions, the wizard produced speech, displayed a writtenmessage or image, executed an action, or any combination of the former. When producing speech, thewizard used synthetic speech.

Tasks

For this particular experiment, tasks were very basic and focused on a general survey of the functionalityavailable. Accessibility, friendliness, usability and naturalness were taken into account before and afterthe experiment.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 32/60

Wizard

The wizard was a knowledgeable system expert perfectly capable of simulating the system functionalityas well as its constraints. The wizard was also provided with a more detailed script that shows not onlywhat the user knows, but also more precise instructions as to how to proceed in each case. Special toolswere also developed in order to enable the wizard to simulate the system functionality within a reasonabletime–frame. The wizard script ensured the same wizard behavior throughout the experiments.

3.7.2 EXPERIMENT 1B

Objective

The main objective for this phase was not only to record the interactions between human users and thewizard, but also to compare results. These subjects were already somewhat familiar with the system andprovide valuable data about learning, usability for more experienced users, possible modality preferenceevolution, possible task completion time reduction and possible variations in prejudice for or againstautomated systems.

Subjects

The subjects were given more information to perform the tasks, but were not given precise instructionsas to what to say. Nonetheless, they were given more information about the kind of expressions that arealso valid, as well as examples of modality combinations that they may have not tried before. As withthe previous experiment, as far as the subjects are concerned, they will be interacting with an intelligentmultimodal dialogue system and no other human will be involved.

Scenario, Data Collection, Methodology and Wizard

Identical to the processes and conditions previously described for experiment 1A.

Tasks

Tasks were more complex than in the former experiment. Accessibility, friendliness, usability and natu-ralness were all taken into account before and after the experiment. Multimodal multitasking and mixed-modality events were encouraged, although not enforced, during the experiment.

3.7.3 Survey Information

The analysis of these surveys provides very valuable insight about the subjects’ perception, expectationsand level of satisfaction.

The basic differences between 1A and 1B were:

1. In 1B subjects were asked to use a vocative when addressing the system (“Ambrosio”).

2. Tasks were more complex and combined everyday tasks.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 33/60

3. The subjects were already somewhat familiar with the system, had learnt how to interact, and felt alittle less insecure or nervous about the tasks.

Figure 3.6: Opinion on Naturalness after 1A

Figure 3.7: Opinion on Naturalness after 1B

Although the number of subjects, tasks and dialogues does not suffice to perform a reliable statisticalanalysis, the results seem to indicate that the subjects felt more at ease as they became more familiar withthe system. It is also possible that they were a little more forgiving with the system when the latter waspersonified (given a name), according to some previous studies [30].

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 34/60

Figure 3.8: Pre-existing bias vs. satisfaction

Bias

One of the objectives was also to observe the effect of any pre-existing biases the subjects may have hadin relation to the overall subject satisfaction after interacting with the system.

According to the data, although some of the subjects had a strong bias against automated dialogue systems,this did not seem to be an obstacle. As a matter of fact, in some specific cases, the impression was that theprevious bias had been annulled.

Output of modality preference

As to the presentation modality of preference, the subjects’ answers seem to indicate that in some casesthey had no preference at all, at least not any they were fully aware of.

The results seem to confirm that the slightest increase in complexity in the tasks performed might havehad a impact on the output modality of preference, since a number of subjects changed their opinionand appreciated the importance of written messages. Given that 30% of the subjects did not specify amodality, it may be possible that the information redundancy provided was also appreciated (informationwas at times provided by both speech and text on the screen).

Subjects’ preferred input modality

In this case, it seems that in the first experiment some users were not quite sure about the system reliability,or the way to interact with it. However, with the increase in complexity of the tasks and the systempersonification, it seems that subjects’ preferences become more prominent.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 35/60

Figure 3.9: Presentation modality of preference in 1A

Figure 3.10: Presentation modality of preference in 1B

As illustrated in figures 3.11 and 3.12, there is a clear evolution of the subjects’ preferences towards thespoken input, at least that is what they say. Some subjects seem to be “uni-modal”, that is, they chose themodality they feel comfortable with (usually speech) and they go along with it. This trend of action onlyseems to be interrupted:

• When multitasking

• When there is ambiguity

Speech is clearly the preferred input modality.

3.7.4 Multimodal interaction: performance

In the first experiments, subjects were not given any specific example as to what to say or do with thesystem. The information was very general and subjects knew that they could use either modality (speech

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 36/60

Figure 3.11: Input modality of preference in 1A

Figure 3.12: Input modality of preference in 1B

or click) or a combination of both (speech + click). In the second experiment, subjects were given specificexamples and were somewhat encouraged to use a combination of modalities. This “encouragement” wasnot overt but induced in the experiment set-up.

The results indicate that the subjects were more inclined to use mixed-modality events in the second ex-periment. Those subjects who already used this strategy in 1A seem to increase its use in 1B in most cases.Some subjects who did not use it in 1A started to do so in 1B, and the rest did not use the combination ofmodalities in either experiment.

At this time, it can not be determined whether this is due to personal preferences, skills or other factors.

Time-range in mixed-modality inputs

One of the important goals of this data collection process was to determine the time frame expected whenusers combine complementary inputs in different modalities:

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 37/60

Figure 3.13: Actual mixed-modality inputs in 1A

Figure 3.14: Actual mixed-modality inputs in 1B

• Turn this on (click on a icon)

The literature [22] [23] indicates that subjects could click either before starting to speak, during the ut-terance or after the utterance. In order to configure the system and minimize ambiguity, it is important toknow what the appropriate time-frame to assume a mixed-modality input is.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 38/60

In experiment 1A, the time-frame between the speech start and the click ranged from (-2,44) seconds to5,67 seconds, ranging the in-subject variation from 4,52 to 1,63 seconds.

In experiment 1B however, the overall time-frame ranged from (-3,57) seconds to 5,20 seconds, and thein-subject variation from 1,01 to 5,98 seconds.

Since the level of familiarity reached by the subjects can not really be considered very high, these datamay not be altogether definitive.

Figure 3.15: Subject evolution in hybrid inputs

Since most of the events (near 70%) were in the [-2, 2] interval, and taking into account that some clickswill occur after the end of speech (note this is relative to speech start), any graphical event more than 2seconds before the start of speech, or more than 2 seconds after the end of speech is considered unrelatedto the previous or following one.

3.8 Conclusions and future work

The results presented in this article seem to confirm most of the previous hypotheses (1,3,4 and 5), andprevious results in similar experiments. In order to confirm or refute hypothesis 2, these data must becompared with similar data in a different scenario.

• The complexity of the tasks to be performed or the information presented has an impact on thepresentation modality of preference. The higher the complexity of the task or the information, thehigher the need or demand for graphical presentation and modality redundancy.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 39/60

Figure 3.16: Hybrid inputs per subjects in 1A and 1B

Figure 3.17: Modality distribution in 1A

• The modality of preference for subjects once they become familiar with the system and they learnto trust it is clearly speech, although certain situations require other modalities.

• The time-frame to consider that two different events in different modalities are related is too widein its extremes, but can be modeled in terms of the most frequent intervals. Although the data cannot confirm it, it seems to decrease as the subjects become more familiar with the system.

This corpus contains a great deal of information. The results presented here represent an important part ofthe information to be extracted from the corpus, although further analysis is needed.

Experiment 2 deals with a different kind of issues that are more related to subject coherence. The resultsobtained from this analysis will be presented in forthcoming publications.

Although the corpus does not quite have statistical weight, it does provide interesting information aboutthe subject tendencies and the researchers intuitions about the most convenient system configuration.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 40/60

Figure 3.18: Modality distribution in 1B

Version: 1.0 (Final) Distribution: public

Chapter 4

The SACTI Data Collections

The SACTI data collection was carried out in two parts. The first part was speech only, the second partwas a multimodal data collection. This chapter will outline the results of the multimodal data collections.Particular regard will be made to the amount, nature and timing of the multimodal interactions collectedfrom the users in the task.

4.1 Simulated ASR Channel - Tourist Information (SACTI)corpus

The Simulated ASR Channel Tourist Information (SACTI) corpus was collected in Cambridge duringFebruary 2004 [45]. The basic setup was a “Wizard of Oz” collection with a user and a “wizard” in therole of the system [46]. The user’s utterances were transcribed and ASR-like errors added [41].

A second collection SACTI-2, was made October 2004 using the original setup and tasks augmented witha map on which the user and wizard could both click [44].

4.1.1 Setup

The basic setup is shown in figure 4.1. Two experimental participants, the “subject” and the “wizard”communicate via a simulated ASR channel. The participants are located in different rooms and cannot seeeach other. The speech of both participants is end-pointed (i.e., segmented into utterances for performingrecognition) using a standard energy-based end-pointers.

The subject can hear the wizard directly. However, the wizard cannot hear the subject; rather, both partic-ipants are told that the subject is speaking to a speech recogniser, which will take its best guess of whatthe subject says, and display it on a screen in front of the wizard.

A turn-taking model patterned after typical HC turn-taking models is used in which the user may “barge-in” over (interrupt) the wizard, but the wizard may not interrupt the user [45]. Allowable transitionsbetween states are summarised in Figure 4.2. In reality, the subject is speaking to a typist, who quicklytranscribes the user’s utterance. This transcription is passed to a system which simulates ASR errors, theoutput of which is displayed to the user [41].

The task set is a number of different tourist-information scenarios. Users are presented with general goals

41

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 42/60

Figure 4.1: Setup for Wizard-of-Oz experiments

Figure 4.2: State machine for wizard interaction

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 43/60

to achieve. Example tasks were:

• “You are staying at the Hotel Primus, located at the corner of ‘Alexander Street and West Loop.You are in the park and need to get back ... what public transportation options exist?”

• “You are at the fountain, and would like to find the nearest bar or cafe”

• “You have just arrived in town and are looking for a moderately priced hotel (double room)”

For SACTI-2, the interface was extended to allow both the user and the wizard to click on locations onthe map. To make this more useful, the wizard could also add the locations of all the hotels, restaurants,etc in the town [44]. A screen shot of the enhanced map is shown in figure 4.3. Six new tasks were addedinvolving driving a car around the town pictured in the map, parking the car and coping with a system ofone-way streets.

CLEAR TRAM BUS HOTELS RESTAURANTS BARS

i am here and one together by public transport are theremany bar hotels for restaurants

H

H

H

H

H

H

R

R

R

R

R

R

B

B

B

B

B

B

1

display

Figure 4.3: The SACTI-2 user map

4.1.2 Collection Summary

The SACTI-2 collection ran for 6 days. For the collection 6 wizards were used, with each wizard runninga full days experiments with 6 different users, for a total of 36 users. Each user attempted roughly 5 taskswith their wizard. In all, 179 separate dialogs were recorded and transcribed in anvil format XML.

A range of ASR-simulation error rates were used. For 30 dialogs, no errors were added. A “medium”ASR error rate setting of about 30% word error rate was used for 120 dialogs. A“high” error rate of over60% WER was used for the final 30 dialogs.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 44/60

4.1.3 Analysis of Data

The collection consisted of 4,972 turns in total, of which 2,567 were user utterances. A total of 522 clickswere recorded from users.

Of all user utterances, 92% were unimodal-spoken only. A further 1.5% were clicking-only unimodalutterances, leaving 6.9% of utterances where the user used both the speech and gesture interfaces (mul-timodal). In the 6.9% multimodal data, over 90% of the speech and gesture acts were redundant orreinforcements where the speech stream alone would be sufficient: ie, of the form “The hotel is left of thecinema <click>”. Only 2.5% of the multimodal data (ie, about 0.2% of the total user dialog acts) actuallyfeature any anaphoric references, ie “The hotel is here <click>”.

4.1.4 Observations on data

The wizards exhibited more multimodal behaviour than the users: about 18% of the wizard dialog actsfeatured speech and gestures. The wizards also exhibited some learning behaviour, with the averagenumber of clicks per dialog tripling (from 10 to 30) over the course of the sessions. Large variations inthe average number of clicks could also be observed between wizards.

Under high noise conditions (over 60% error rate), users eventually adapted to using the clicking interfacemore: the average number of user clicks per dialog tripled (from 2 to 6) when moving from the mediumto high noise tasks.

For wizards five and six, the interface was altered to add the “Display all hotels/bars/restaurants” optionsfor the wizards. Enriching the interface thus doubled the average user clicks per dialogue.

Two types of user behaviour have been noted [40]. The first type is sequential integration, where theuser switches between modes of operation. The second is a simultaneous operator, where the speechand gesture acts are concurrent and complementary. As mentioned previously, 6.9% of the collectioncontained multimodal data. Of this, about 35% (ie, 2.5% overall) featured clicks during the user’s speech.

There were also marked patterns of consistent user behaviour. Of 36 users:

• 11 no multimodal at all

• 17 sequential integrators (Oviatt 97)

• 8 simultaneous integrators

4.1.5 Conclusions on the data

Generally, very little ”simultaneous” multimodal (2.5%) data was collected in SACTI-2. There wasvery little time synchronisation between modes and less than 1% of the collected data feature anaphoricreferences. Instead, the multimodal interface was used to present information redundant to the speech act.In other words, when the users used a click, this had no effect on the utterance accompanying it.

On the other hand, the wizard role seemed to elicit much more multimodal interaction. This may be inpart due to the much longer time the wizards were able to use the system and the associated learningeffect. However, the role of wizard seems more suited to a multimodal interface. Providing informationor instructions requires a richer information channel than issuing requests, and the wizard role is suitedto supplying multiple redundant cues and disambiguating references to multiple objects. The user is alsotypically the first to reference a place name, and the (written) descriptions tend to encourage spoken input.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 45/60

Less than half of the multimodal data featured clicks during the wizard speech. Only 2.5% of the mul-timodal speech featured any form of anaphora or speech where the timing of the click was important.Generally speaking, the users fell into categories of “early” and “late” clickers, with very little timinginformation between the modalities.

4.2 Other multimodal collections

It is possible to review the literature for other multimodal data collection experiments. Other work hassuggested methods and techniques for combining multiple input modalities to produce natural user inter-faces [32]. An agent-based interface was created to allow the rapid production of synergistic multimodalinterfaces. Spoken language, gestural and handwriting could all be used as input modalities. A multi-modal map task was developed based on this interface [33]. Following previous work with multimodalinterfaces, a set of Wizard-of-Oz experiments were carried out [37]. The fully automated system was usedin a set of hybrid Wizard-of-Oz experiments using the system and wizards with users [34]. Results onthis data tended to follow the results on the SACTI data listed above [34]. The most common form ofinteraction with the system was unimodal speech or pen actions. The use of multimodal actions varied sig-nificantly over the set with some users issuing almost no commands. Although there was some increaseduse of anaphora such as “this” or “here”, most users still used the full noun-phrase to refer to an object ina multimodal act. These results follow the results of the SACTI data. In other words, users tended to usethe multimodal interface to reinforce information and provide redundancies and did not significantly alterthe form of speech collected.

Work by Oviatt [38] has shown more preference for multimodal actions in a Wizard-of-Oz setting. Themajority of users expressed a preference for the multimodal interface and roughly 20% of the user actswere multimodal. However, there was significant spread of multimodal usage over various commandtypes: some complex commands were issued over 50% multimodally, whereas simple actions were lessthat 10% multimodal. The more complex a task, the more likely the multimodal interaction: “move”“add” and “calculate distance” were the acts most likely to feature multimodality. This follows earlierobservations that the more expressive an act, the more likely it is to be multimodal. In this collection, therewas also little strong alignment between modalities. Users divided between simultaneous and sequentialintegrators, with the gestural or written act preceding the speech [39].

4.2.1 Conclusions

From analysis of the SACTI-2 data and other multimodal data collections, it appears that it is difficult tocollect truly “concurrent” data from users. This can be partly attributed to the experimental setup, but itseems that the general forms of task seem to be unlikely to elicit data of this sort. The multimodal datathat has been collected has poor or no time alignment between words and the associated clicks.

Very few systems exist that encourage the users to generate truly concurrent multimodal streams. Further,it seems that users will only interleave modalities if it is a requirement of the system. One example is thecase of a drawing task where commands and gestures must be aligned. However, if it is not necessary, theuser will not interact in this way. Instead, the speech is of the same form with or without clicks. Users arereinforcing the speech acts with clicks or gestures.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 46/60

4.3 Consequences for multimodal fusion

4.3.1 Lack of data

In the SACTI task, only 6.9% of the total user acts were multimodal. Of the 36 users, 11 issued nomultimodal commands at all. Generally, there was a reluctance to use a multimodal interface in the task.The input to the system was primarily spoken. There are several possible reasons for this which arediscussed below.

First, the task itself may not be encouraging or requiring users to use multimodal input. Ideally, to elicitmultimodal interaction from users, a task would encourage concurrent and complementary use of speechand gesture acts, with little time lag when switching between modalities. To encourage multimodal inter-action, the task should feature actions not possible by unimodal input alone. In addition to this, multimodalacts can be encouraged by the presence of multiple entities which can only be disambiguated by the useof gesture acts.

Second, the setup of the task itself may not encourage users to use multimodal input. The users had tocomplete a form for each task, which may have distracted from using the multimodal interface. Also,the strict turn-taking model did not provide wizards with alignment information for the speech and clickstream. Changing the interface and the prompts in the tasks more than doubled the average number of userclicks per dialogue. However, the overall ratio of unimodal to multimodal acts was still extremely low.

A solution to these problems would be to replace, constrain or modify the task with the above consid-erations in mind. The SACTI tourist information tasks were designed to be generic and flexible in thestrategies that the user and wizard could adopt. Changing the task would place more constraints on theuser behaviour and make the interface less natural. Another problem is that there are few examples oftasks or systems which actually require concurrent multimodal input, especially for the in-home or in-cardomains. An alternative solution would be to brief or prompt the users to generate more multimodal acts.This is unlikely to yield natural responses or dialog acts from the subjects and is unlikely to be of any usein real systems.

4.3.2 Lack of time alignment

The majority of the users were not simultaneous integrators. Of the users who issued multimodal com-mands, the majority favoured sequential integration. Sequential integration would not be modelled by theoriginal proposal to treat speech and gesture acts as a single symbol stream in the SLM.

When sequential commands were used in the SACTI task, there was still a significant variation in the timeoffset between modalities. Users could usually be classed as “early” or “late” clickers, and the integrationbetween modalities was at an utterance rather than word level. As in section 4.3.1, this could be attributedto the choice of task or setup of the experiments. In the SACTI collections, the multimodal informationwas presented at the utterance end, with no information aligning clicks to words for the wizard, althoughsequential information about click order was provided to the wizard.

Given the relative lack of data, the further lack of time alignment between clicks in the word sequencepresents a further modelling challenge and reduces the information content of the clicks relative to thesurrounding words. In the case where the clicks are not aligned with the words, the combination of speechand gesture acts can only really be performed at the utterance end.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 47/60

4.3.3 Redundancy of gesture acts

When simultaneous multimodal data did occur, the overwhelming majority of the data was providingredundant information. Less than 3% of the total user dialogue acts were providing complementary infor-mation such as anaphoric and deictic resolution. If the gesture information is redundant, the speech actalone is a complete act.

Given an unfamiliar interface and an error-prone channel, users tended to reinforce the presentation ofdata through the interfaces to maximise their chances of being understood. In the vast majority of cases,the speech acts have the same form regardless of whether a click was present or not. Trying to conditionor switch a language model based on the presence of a click would not improve recognition performancefor the majority of speech and would further split the training data set.

4.4 Conclusions

Reviewing existing systems, the results so far indicate that users will only generate truly simultaneouscomplementary information when the interface and the form of task dictate it. Such applications havethe user in a heavily information-providing role, and encourages rapid switching of modalities with near-instantaneous feedback and reward [35]. Examples of such a task would be “command post” tasks suchas WITAS [35] or CommandTalk [36], or drawing tasks.

Such tasks are not easily suited to the TALK project in-home and in-car domains, which tend to put the userin information-seeking roles, or provide multiple complete interfaces. For the current TALK project task,the established solution of combining the information streams post-recognition given multiple recognitionhypotheses appears to be more suitable.

Version: 1.0 (Final) Distribution: public

Chapter 5

Evaluating Effectiveness and Portability ofReinforcement Learned DialogueStrategies with real users: the TOWNINFO

Evaluation

5.1 Overview

In this chapter we report evaluation results1 for real users of a learnt dialogue management policy versus ahand-coded policy in the TALK project’s “TownInfo” tourist information system [55]. The learnt policy,for filling and confirming information slots, was derived from COMMUNICATOR (flight-booking) datausing Reinforcement Learning (RL) as described in [52], ported to the tourist information domain (usinga general method that we propose here), and tested using 18 human users in 180 dialogues, who also useda state-of-the-art hand-coded dialogue policy embedded in an otherwise identical system. We found thatusers of the (ported) learned policy had an average gain in perceived task completion of 14.2% (from67.6% to 81.8% at p < .03), that the hand-coded policy dialogues had on average 3.3 more system turns(p < .01), and that the user satisfaction results were comparable, even though the policy was learnedfor a different domain. Combining these in a dialogue reward score, we found a 14.4% increase for thelearnt policy (a 23.8% relative increase, p < .03). These results are important because they show a) thatresults for real users are consistent with results for automatic evaluation [52] of learned policies usingsimulated users [49, 50], b) that a policy learned using linear function approximation over a very largepolicy space [52] is effective for real users, and c) that policies learned using data for one domain can beused successfully in other domains. We also present a qualitative discussion of the learnt policy.

5.2 Introduction

In this chapter we evaluate the “TownInfo” [55] system of Deliverable D4.2, a multimodal dialogue systemusing reinforcement learning (RL) for tourist information scenarios. This is the first “Information State

1These results are also published in [53].

48

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 49/60

Update” (ISU) dialogue system to employ a learned dialogue policy (using the methods described indeliverable D4.1), mapping complex dialogue contexts, or Information States (IS), to dialogue actions.Figure 5.2 shows the system’s GUI, using a map and database entities developed by [61], which is usedfor system output only, marking the locations of various entities (hotels, restaurants, bars) on the map.

In prior work on RL for dialogue systems (e.g. [57, 58, 60, 59]), only very simple state/context repre-sentations have been used, often consisting of only the status of information “slots” (e.g. destination cityis filled with low confidence), and only specific choice points were available for learning (e.g. initiativeand confirmation in [58]), whereas the policy we test [52] was learnt using linear function approximationmethods over a very large state space (including e.g. speech act history) and the full range of potentialdialogue actions in every state. The issues arise of how effective policies learned with this method arefor real users rather than in evaluation with simulated users [52, 49], and whether evaluation results fromsimulated users are consistent with those for real users.

Figure 5.1: The TOWNINFO system GUI.

We also address in this work the question of to what extent dialogue policies learnt from data gatheredfor one system, or family of systems, can be re-used or adapted for use in another system. We propose ageneral method for porting policies between domains in section 5.5. Our hypothesis is that the slot-fillingpolicies learnt from our experiments with COMMUNICATOR will also be good policies for other slot-fillingtasks – that is, that we have learnt “generic” slot-filling or information seeking dialogue policies.

The evaluation presented here was thus designed to answer the following questions:

• How good is the learnt strategy compared to a baseline hand-coded system strategy, for real users?

• Are results for automatic evaluation using simulated users consistent with evaluation results withreal users?

• Can dialogue strategies learned in one domain be successful when they are ported to other domains?

Section 5.3 discusses related work, section 5.4 describes the system functionality, and in section 5.5 wedescribe how the dialogue policies learnt for slot filling on the COMMUNICATOR data set can be ported tothe TOWNINFO scenarios. Section 5.6 presents the evaluation methodology and section 5.7 describes ourresults, including a qualitative discussion of the learnt policy (section 5.7.3).

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 50/60

5.3 Related work

We only know of one evaluation of a learned dialogue policy with real users, the NJFUN system [58], andwe are not aware of any prior work on portability of learned dialogue policies.

In [58] 21 subjects performed 6 tasks with the NJFUN system. However, note that the baseline strategyfor comparison was not a fully hand-coded policy (as we use here), but one where random choices ofaction were made at certain points (the “Exploratory for Initiative and Confirmation” strategy). In theseconditions task completion for the learnt policy rose from 52% to 64%, with p < .06.

Data was then divided into the first 2 tasks (“novice users”) and tasks 3-6 (“expert users”) to control forlearning effects. The learned strategy led to significant improvement in task completion for experts but anon-significant degradation for novices.

5.3.1 Automatic evaluation using simulated users

In [52] we automatically evaluated learnt dialogue management policies by running them against usersimulations [49]. Both the policies and the user simulations were trained using the annotated COMMU-NICATOR data described in [51] and developed on the basis of [62]. We compared our results againstthe performance of the original COMMUNICATOR systems, using an evaluation metric derived from thePARADISE methodology [63]. The results presented in [52] showed that the learnt policies performedbetter than any of the COMMUNICATOR systems, and 37% better than the best system, when tested in sim-ulation. The important next step is to discover whether this result carries over into tests with real humanusers, which we present below.

5.4 TOWNINFO System overview

Two versions of the TOWNINFO dialogue system (hand-coded vs. learnt policy) are built around the DIP-PER dialogue manager [47]. This system is used to conduct information-seeking dialogues with a user(e.g. find a particular hotel, bar, or restaurant). This allows us to compare hand-coded against learnt strate-gies within the same system (i.e. the other components such as the speech-synthesizer, recognizer, GUI,etc. all remain fixed).

5.4.1 Overview of system features

The following features are implemented, as decribed in deliverable D4.2:

• Interfaced to learnt or hand-coded dialogue policies

• Multiple tasks: information for hotels, bars, and restaurants

• Mixed-initiative, question accommodation/overanswering

• Open speech recognition using n-grams (HTK) [64]

• Use of dialogue plans (hand-coded version only)

• Open-initiative initial question (“How can I help you?”)

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 51/60

• User goal/task recognition (i.e. hotels/bars/restaurants)

• Confirmations: explicit and implicit based on ASR confidence (hand-coded version only)

• Template-based NLG for presentation of database results

• Multimodal output: highlighting and naming on GUI

• Start over, quit, and help commands

• Simple user commands (e.g. “Show me the hotels”)

• Logging in TALK ISU format [51]

5.5 Portability: moving between COMMUNICATOR

and TOWNINFO domains

The learnt policies in [52] are derived from the COMMUNICATOR systems corpora [62, 51], which are inthe domain of flight-booking dialogues. In deliverable D4.1 and [52] we reported learning a promisinginitial policy for COMMUNICATOR dialogues, that was evaluated only in simulation (see section 5.3.1), butthe issue arises of how we could transfer this policy to new domains – for example the tourist informationdomain of TOWNINFO, and test its effectiveness for real users.

There are 2 main problems to be dealt with here:

• mapping between TOWNINFO system dialogue contexts/ information states and COMMUNICATOR

information states (IS)

• mapping between the learnt COMMUNICATOR system actions and TOWNINFO system actions.

The learnt COMMUNICATOR policy tells us, based on a current context (or IS), what the optimal systemaction is (for example “request info(dest city)” or “explicit confirm (depart date)”). Obviously, in theTOWNINFO scenario we have no use for task types such as “destination city” and “departure date”. Ourmethod therefore is to abstract away from the particular details of the task type, but to maintain theinformation about dialogue moves and the slot numbers that are under discussion. That is, we construethe learnt COMMUNICATOR policy as a policy concerning how to fill up to 4 information slots, and thenaccess a database and present results to the user. We also note that some slots are more important thanothers. For example, in COMMUNICATOR it is essential to have a destination city, otherwise no resultscan be found for the user. Likewise, for the TOWNINFO tasks, we considered the food-type, bar-type, andhotel-location to be more important to fill than the other slots. This suggests future work on investigatinglearned policies for partial orderings on slots via their importance for an application.

We defined the mappings shown in table 5.1 between COMMUNICATOR dialogue actions and TOWNINFO

dialogue actions, for each subtask type of the TOWNINFO system. For example, if we are in the restau-rant subtask, when the learnt policy outputs the COMMUNICATOR action “request info(dest city)”, thatdialogue move gets mapped to the TOWNINFO action “request info(food type)”.

Note that we treat each of the 3 TOWNINFO subtasks (hotels, restaurants, bars) as a separate slot-fillingdialogue thread, governed by COMMUNICATOR actions. This means that the very top level of the dialogue

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 52/60

COMMUNICATOR action TOWNINFO action

dest-city food-typedepart-date food-pricedepart-time food-locationdest-city hotel-locationdepart-date room-typedepart-time hotel-pricedest-city bar-typedepart-date bar-pricedepart-time bar-location

Table 5.1: Porting between domains: subtask mappings for system actions and InformationStates.

(“How may I help you?”) is not governed by the learnt policy. Only when we are in a recognized subtaskdo we ask the COMMUNICATOR policy for the next action. Since the COMMUNICATOR policy was learntfor 4 slots, we simply “pre-fill” a slot (origin city, since this was usually already known at the start ofCOMMUNICATOR dialogues) in the IS when we send it to the learned policy in order to retrieve an action.

As for the context/information state mappings, these follow the same principles. That is, we abstract overthe TOWNINFO states to form states that are meaningful for COMMUNICATOR policies, using the samemapping (table 1). This means that, for example, a TOWNINFO state where food-type and food-priceare filled with high confidence is mapped to a COMMUNICATOR state where dest-city and depart-dateare filled with high confidence, and all other state information is identical (modulo the task names).

5.6 Evaluation Methodology

We implemented both the learned policy and a hand-coded policy in the TOWNINFO dialogue system of[55]. The hand-coded policy was constructed using our experience in dialogue system design. It has fixedconfidence score thresholds for determining types of confirmation, and allows mixed initiative, and thusis a reasonable “state-of-the-art” dialogue policy. This hand-coded policy system has a 67.6% perceivedtask completion score (see section 5.7), which is comparable to the 52% task completion score for theNJFUN baseline policy [58]. Both policies had the same (fixed) information presentation routines, andthe grammar, recognizer, GUI, synthesizer, and database were equivalent across the two conditions. Weevaluated the system with 18 real users, via Perceived Task Completion, Dialogue Length, and subjectiveevaluations for each dialogue in the two conditions.

Following the methodology of [63] we present each subject with 10 tasks (5 in each condition), controlledfor learning and temporal ordering effects (i.e. for half the subjects, the learned policy was encounteredfirst, and for the other half the hand-coded policy was encountered first).

The tasks were presented to the subjects in the following way, to prevent subjects “reading” the tasks tothe system:

Task 1: You are on a business trip on your own.You need to find a hotel room in the middle

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 53/60

of town. Price is no problem.

The users’ perceived task completion (PTC) was collected like so:

Write the name of result that the systempresented to you (e.g. FOG BAR) here:_________

Does this item match your search? Yes/No

For User Preference scores, the users were then asked, for each dialogue, to evaluate the following on a5-point Likert scale:

• “In this conversation, it was easy to get the information that I wanted.”

• “The system worked the way I expected it to, in this conversation.”

• “Based on my experience in this conversation, I would like to use this system regularly.”

We also collected dialogue length, since longer dialogues are known to be less satisfactory for users[63] and dialogue length is a component of the reward signals used in RL for dialogue management[52, 59, 57, 58] (usually a small negative score for each system turn).

The full corpus currently consists of 180 dialogues with 18 users, and we are collecting more data withthe system. The corpus is also currently being transcribed and n-best recognition hypotheses are beinggenerated for user utterances, using ATK [64]. The final corpus will be freely released to the researchcommunity.

5.7 Results

5.7.1 Perceived Task Completion (PTC) and User Preference

As shown in table 2 the results of the evaluation were a 14.2% gain (from 67.6% to 81.8%) in averageperceived task completion for the learnt policy (significant at p < 0.03), which is a 21% relative increasein PTC. User preference is not significantly different between the two systems.

This compares favourably with the results for NJFUN [58] where task completion rose from 52% to 64%,with p < .06. Thus, our results are consistent with those of [58] and support the claim that learned policiescan outperform hand-coded ones.

Policy PTC User pref. System turns Reward(Av. %) (Av.) (Av.) (Av.)

hand-coded 67.6 2.75 14.9 60.5learnt 81.8 2.67 11.6 74.9

Table 5.2: Evaluation of the policies (18 users, 180 dialogues).

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 54/60

5.7.2 Dialogue Length and Reward

For the 180 test dialogues we found that the average number of system turns in learnt policy dialogues is11.6 whereas for the hand-coded policy dialogues the average number of system turns is 14.9 (a significantdifference at p < 0.01).

This means that hand-coded strategy dialogues had on average 28.4% more system turns (3.3 per dialogue)than dialogues with the learnt policy. Combined with the task completion results above, we can see thatusers of the learnt policy had more effective overall interactions with the system (on average shorterdialogues with greater task completion). When we compute dialogue reward by giving a 100 score forperceived task completion and -1 per system turn (as is common in RL approaches [52, 59, 57, 58]), theaverage reward of the learned policy dialogues is 74.9, versus 60.5 for the hand-coded policy (14.4%reward gain for the learnt strategy, significant at p < .03, a 23.8% relative reward increase). This isconsistent with the simulated evaluation results of [52] which showed 37% more reward (than the bestCOMMUNICATOR system) for the learnt policy.

Overall, this evaluation shows that a policy learned using large feature spaces and action sets, using linearfunction approximation [52] can outperform a hand-coded policy, and it shows also that evaluations withour simulated users [52, 49, 50] are consistent with results for real users, and that a dialogue policy learnedfor one domain can be ported to similar domains successfully.

5.7.3 Qualitative description of the learnt policy

The question naturally arises of why or in what respects the learnt policy is “better” than the hand-codedone. Qualitative analysis of the results led to the following observations:

1) The learnt policy did not confirm as often as the hand-crafted policy, which was designed to implicitlyconfirm utterances with confidence scores over a fixed threshold and explicitly confirm utterances withscores under that threshold. This threshold was tuned by hand in the baseline policy. Note however thatthe learnt policy was not optimized at all for this threshold, since the COMMUNICATOR data does notcontain ASR confidence scores for training.

2) The learnt policy could choose to skip slots whereas the hand-crafted policy would insist on alwaysfilling a slot before moving to the next unfilled one (i.e. although the hand-coded policy does allow theuser to skip ahead and overanswer, it always returns to unfilled slots). That feature of the hand-craftedpolicy caused problems for users which the system had trouble recognizing. This finding is similar to theresult of [48] where a learnt policy shows (in simulation) a successful “focus switching” strategy whichasks for a different slot if it is encountering problems with the current slot.

3) Our database consisted of 6 hotels, 6 bars, and 6 restaurants. Therefore if the system skipped a slot,the worst thing that could happen would be to present all hotels (or bars, or restaurants) to the user. Thiswould be more of a problem with a much larger database. Thus, in some cases even though the learnedpolicy did not fill all slots, users got the information they wanted since the superset of the query results,caused by unfilled slots, was still small enough to be presented. Part of our future work will thus be toinvestigate the effect of the database size on perceived task completion and general user satisfaction withsuch a strategy.

Clearly, these sorts of considerations could be implemented in a better hand-coded baseline system forcomparison. However, it is policy learning itself that has revealed these strategies. In addition, we canexpect that learned policies will only become better when we train them on more data and allow them to

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 55/60

optimize on more features e.g. ASR confidence thresholds for implicit/explicit confirmation.

5.8 Conclusion and Future directions

We reported evaluation results with 18 real users (in 180 dialogues) for a learned dialogue policy (asexplained in deliverable D4.1 [54]) versus a hand-coded dialogue policy in the TALK project’s “TownInfo”tourist information system [55], reported in deliverable D4.2 [56]. The learned policy, for filling andconfirming information slots, was derived from COMMUNICATOR (flight-booking) data as described in[52] and deliverable D4.1 [54], ported to the tourist information domain, and tested using human users,who also used a state-of-the-art hand-coded dialogue policy embedded in an otherwise identical system.We also presented a generic method for porting learned policies between domains in similar (“slot-filling”)applications.

We found that users of the (ported) learned policy had an average gain in perceived task completionof 14.2% (p < .03) and comparable user satisfaction results, even though the policy was learned for adifferent domain. For dialogue length we found that hand-coded strategy dialogues have on average 3.3more system turns (at p < .01) than dialogues with the learnt policy. Combining these in a dialoguereward score, we found a 14.4% increase for the learnt policy (a 23.8% relative increase, p < .03). Theseresults are important because they show that a) a policy learned using linear function approximation overa very large policy space [52] is effective for real users, b) the automatic evaluation of the learned policy[52] using simulated users [49, 50] is consistent with results for real users, and c) policies learned usingdialogue data for one domain can be used successfully in other similar domains or applications.

5.8.1 Future directions

Future work, as well as conducting a larger evaluation and using other domains/tasks, would be to explorethe notion of similarity between domains/tasks, and under exactly what conditions learned policies arereliably portable (e.g. comparing number, type, and ordering constraints on slots, size of database etc.). Weshould also develop automatic evaluation metrics for dialogue policies in simulations which are stronglycorrelated with results from real evaluations.

Version: 1.0 (Final) Distribution: public

Bibliography

[1] S. Young, M. Stuttle, and K. Weilhammer. Integrated SLMs for multiple input modes Status reportT1.4s2, TALK Project, 2006.

[2] K. Weilhammer (ed.), Andreas Korthauer, I. Kruijff-Korbayova, P. Manchon, C. del Solar, andK. Georgila. Annotated data archive. Deliverable D6.5, TALK Project, 2006.

[3] T. Becker, C. Gerstenberger, N. Perera, P. Poller, J. Schehl, J. Steigner, F. Steffens, A. Korthauer,R. Stegmann, and N. Blaylock In-Car Showcase Based on TALK Libraries. Deliverable D5.3,TALK Project, 2006.

[4] G. Amores, S. Ericsson, C. Gerstenberger, P. Manchon, and J. Schehl (ed.). Plan library for multi-modal turn planning. Deliverable D3.2, TALK Project, 2006.

[5] N. Blaylock, B. Fromkorth, C. Gerstenberger, I. Kruijff-Korbayova (ed.), O. Lemon, P. Manchon,A. Moos, V. Rieser, C. del Solar, and K. Weilhammer. Annotators handbook. Deliverable D6.2,TALK Project, 2006.

[6] N. Blaylock, C. Gerstenberger, P. Poller, V. Rieser, and J. Schehl. Description of the SAMMIE-2experiment. Internal TALK project report, oct 2005.

[7] N. Blaylock, D. Steffen, and D. Bobbert. The Saarbrucken TALK prepilot MP3 player experiment.Internal TALK project report, aug 2004.

[8] Christine Duran, John Aberdeen, Laurie Damianos, and Lynette Hirschman. Comparing severalaspects of human-computer and human-human dialogues. In Proc. of the 2nd SIGDIAL Workshopon Discourse and Dialogue, Aalborg, 1-2 September 2001, pages 48–57, 2001.

[9] I. Kruijff-Korbayova (ed.), G. Amores, N. Blaylock, S. Ericsson, G. Perez, K. Georgila, M. Kaisser,S. Larsson, O. Lemon, P. Manchon, and J. Schehl. Extended information state modeling. DeliverableD3.1, TALK Project, 2005.

[10] I. Kruijff-Korbayova (ed.), G. Amores, N. Blaylock, S. Ericsson, G. Perez, K. Georgila, M. Kaisser,S. Larsson, O. Lemon, P. Manchon, and J. Schehl. Modality-specific resources for presentation.Deliverable D3.3, TALK Project, 2006.

[11] I. Kruijff-Korbayova, T. Becker, N. Blaylock, C. Gerstenberger, M. Kaißer, P. Poller, V. Rieser, andJ. Schehl. The SAMMIE corpus of multimodal dialogues with an mp3 player. In Proc. of LREC (toappear), 2006.

56

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 57/60

[12] I. Kruijff-Korbayova, T. Becker, N. Blaylock, C. Gerstenberger, M. Kaißer, P. Poller, J. Schehl, andV. Rieser. An experiment setup for collecting data for adaptive output planning in a multimodaldialogue system. In Proc. of ENLG, pages 191–196, 2005.

[13] I. Kruijff-Korbayova, T. Becker, N. Blaylock, C. Gerstenberger, M. Kaißer, P. Poller, J. Schehl, andV. Rieser. Presentation strategies for flexible multimodal interaction with a music player. In Proc. ofDIALOR’05, 2005.

[14] D. L. Martin, A. J. Cheyer, and D. B. Moran. The open agent architecture: A framework for buildingdistributed software systems. Applied Artificial Intelligence: An International Journal, 13(1–2):91–128, 1999.

[15] S. Mattes. The lane-change-task as a tool for driver distraction evaluation. In Proc. of IGfA, 2003.

[16] Verena Rieser, Ivana Kruijff-Korbayova, and Oliver Lemon. A corpus collection and annotationframework for learning multimodal clarification strategies. In Proc. of SIGdial6-2005, pages 97–106, 2005.

[17] Verena Rieser and Oliver Lemon. Cluster-based user simulations for learning dialogue strategies. InProc. of Interspeech06 - ICSLP, 2006.

[18] Verena Rieser and Oliver Lemon. Using logistic regression to initialise reinforcement-learning-baseddialogue systems. In IEEE/ACL workshop on Spoken Language Technology (SLT), 2006.

[19] Verena Rieser and Oliver Lemon. Using machine learning to explore human multimodal clarificationstrategies. In Proc. ACL, 2006.

[20] Ulrich Turk. The technical processing in smartkom data collection: a case study. In Proc. of Eu-rospeech2001, Aalborg, Denmark, 2001.

[21] P. Manchon, C. del Solar, G. Amores, and G. Perez. The MIMUS Corpus. In LREC InternationalWorkshop on Multimodal Corpora: From Multimodal Behaviour Theories to Usable Models, Genoa,Italy, pp. 56-59, 2006.

[22] S. L. Oviatt. Multimodal interactive maps: Designing for human performance. In Human-ComputerInteraction (special issue on ”Multimodal interfaces”), pp. 93-129, 1997.

[23] S. L. Oviatt, A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes duringmultimodal human-computer interaction. In Proceedings of Conference on Human Factors in Com-puting Systems (CHI’97), 1997.

[24] W. Chou, D. A. Dahl, G. McCobb, and D. Raggett (eds.). EMMA: Extensible MultiModal Annotationmarkup language. W3C Working Draft, 16 September 2005.

[25] J. Gabriel Amores and J. Francisco Quesada. Dialogue Moves for Natural Command Languages. InProcesamiento del Lenguage Natural, 27: 89–96, 2001.

[26] D. Traum and T. Allen. A Speech-Acts Approach to Grounding in Conversation. In Proceedings ofthe 2nd International Conference on Spoken Language Processing (ICSLP-92), pp. 137-40, 1992.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 58/60

[27] S. Steininger, F. Schiel, and A. Glesner. User-state labeling procedures for the multimodal datacollection of Smartkom. Report 28, SMARTKOM Project, 2002.

[28] Michael Kipp. Gesture Generation by Imitation - From Human Behavior to Computer CharacterAnimation. Boca Raton, Florida: Dissertation.com, 2004.

[29] P. S. Bayerl. Investigating inter-coder agreement: log–linear models as alternatives to the Kappastatistics. Justus–Liebig University, Giessen, 2003.

[30] B. Reeves and C. Nass. The Media Equation. CSLI/Cambridge University Press, 1998.

[31] J. Carletta et al. The reliability of a dialogue structure coding scheme. In Computational Linguistics,23(1): 13–31, 1997.

[32] A. Cheyer and L. Julia. Multimodal Human-Computer Communication, Lecture Notes in ArtificialIntelligence 1374, chapter Multimodal Maps: An Agent-based Approach, pages 112–121. Springer,1998.

[33] A. CHEYER, L. JULIA, and J.C. MARTIN. A Unified Framework for Constructing MultimodalExperiments and Applications. . In Proceedings CMC, 1998.

[34] A. Kehler, J.C. Martin, A. Cheyer, L. Julia, J.R. Hobbs, and J. Bear. On Representing Salience andReference in Multimodal Human-Computer Interaction. In Proceedings of the AAAI-98 Workshopon Representations for Multi-modal Human-Computer, 1998.

[35] O. Lemon, A. Gruenstein, and S. Peters. Collaborative Activities and Multi-tasking in DialogueSystems. Traitement Automatique des Langues, special issue on dialogue, 43(2):131–154, 2002.

[36] R. Moore, J. Dowding, H. Bratt, M.J. Gawron, Y. Gorfu, and A. Cheyer. CommandTalk: A Spoken-Language Interface for Battlefield Simulations. In Proceedings of the Fifth Conference on AppliedNatural Language Processing, pages 1–7, 1997.

[37] S. Oviatt. Multimodal interfaces for dynamic interactive maps. In Proc. CHI96, 1996.

[38] S. Oviatt. Multimodal interactive maps: designing for human performance. Human ComputerInteraction, 12:93–129, 1997.

[39] S. Oviatt. Ten myths of multimodal interaction. Communications of the ACM, 42(1):74–81, 1999.

[40] S. Oviatt, A. DeAngeli, and K. Kuhn. Integration and synchronisation of multiple input modes duringmultimodal human-computer interaction. In Proc. SIGCHI conf. on human factors in computingsystems, 1997.

[41] M.N. Stuttle, J.D. Williams, and S.J. Young. A framework for Wizard of Oz experiments with asimulated ASR channel. In Proc. ISCLP, 2004.

[42] K. Weilhammer, R. Johnson, and M. Stuttle. D1.3: Generating SLMs from abstract context-specificgrammars. Technical report, TALK project, 2006. D1.3.

[43] K. Weilhammer, M. N. Stuttle, and S. Young. Bootstrapping language models for dialogue systems.In Proc. of the International Conference on Spoken Language Processing.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 59/60

[44] K. Weilhammer, J.D. Williams, and S.J. Young. The SACTI-2 corpus: guide for research users.Technical report, CUED, 2004. CUED/F-INFENG/TR 133.

[45] J Williams and S. Young. The SACTI-1 corpus: guide for research users. Technical report, CUED,2004. CUED/F-INFENG/TR 482.

[46] J.D. Williams and S.J. Young. Characterizing Task-Oriented Human-Human Dialog using a Simu-lated ASR Channel. In Proc. ICSLP, 2004.

[47] Johan Bos, Ewan Klein, Oliver Lemon, and Tetsushi Oka. DIPPER: Description and Formalisa-tion of an Information-State Update Dialogue System Architecture. In 4th SIGdial Workshop onDiscourse and Dialogue, pages 115–124, Sapporo, 2003.

[48] Matthew Frampton and Oliver Lemon. Learning more effective dialogue strategies using limiteddialogue move features. In Proceedings of ACL 2006, 2006.

[49] Kallirroi Georgila, James Henderson, and Oliver Lemon. Learning User Simulations for InformationState Update Dialogue Systems. In Interspeech/Eurospeech: the 9th biennial conference of theInternational Speech Communication Association, 2005.

[50] Kallirroi Georgila, James Henderson, and Oliver Lemon. User simulation for spoken dialogue sys-tems: Learning and evaluation. In Proceedings of Interspeech/ICSLP 2006, 2006.

[51] Kallirroi Georgila, Oliver Lemon, and James Henderson. Automatic annotation of COMMUNICA-TOR dialogue data for learning dialogue strategies and user simulations. In Ninth Workshop on theSemantics and Pragmatics of Dialogue (SEMDIAL: DIALOR), 2005.

[52] James Henderson, Oliver Lemon, and Kallirroi Georgila. Hybrid Reinforcement/Supervised Learn-ing for Dialogue Policies from COMMUNICATOR data. In IJCAI workshop on Knowledge andReasoning in Practical Dialogue Systems, 2005.

[53] Oliver Lemon, Kallirroi Georgila, and James Henderson. Evaluating Effectiveness and Portabilityof Reinforcement Learned Dialogue Strategies with real users: the TALK TownInfo Evaluation. InSpoken Language Technology, page (to appear), 2006.

[54] Oliver Lemon, Kallirroi Georgila, James Henderson, Malte Gabsdil, Ivan Meza-Ruiz, and SteveYoung. D4.1: Integration of Learning and Adaptivity with the ISU approach. Technical report,TALK Project, 2005.

[55] Oliver Lemon, Kallirroi Georgila, James Henderson, and Matthew Stuttle. An ISU dialogue sys-tem exhibiting reinforcement learning of dialogue policies: generic slot-filling in the TALK in-carsystem. In Proceedings of EACL, 2006.

[56] Oliver Lemon, Kallirroi Georgila, and Matthew Stuttle. D4.2: Showcase exhibiting ReinforcementLearning for dialogue strategies in the in-car domain. Technical report, TALK Project, 2005.

[57] E. Levin and R. Pieraccini. A stochastic model of computer-human interaction for learning dialoguestrategies. In Proc. Eurospeech, 1997.

[58] Diane Litman, Micheal Kearns, Satinder Singh, and Marilyn Walker. Automatic optimization ofdialogue management. In Proc. COLING, 2000.

Version: 1.0 (Final) Distribution: public

IST-507802 TALK D6.4 (Part II) 20th December 2006 Page 60/60

[59] Olivier Pietquin. A Framework for Unsupervised Learning of Dialogue Strategies. Presses Univer-sitaires de Louvain, SIMILAR Collection, 2004.

[60] Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker. Optimizing dialogue man-agement with reinforcement learning: Experiments with the NJFun system. Journal of ArtificialIntelligence Research (JAIR), 2002.

[61] Matthew Stuttle, Jason Williams, and Steve Young. A framework for dialog systems data collectionusing a simulated ASR channel. In ICSLP 2004, Jeju, Korea, 2004.

[62] M. Walker, A. Rudnicky, Aberdeen J., E. Bratt, Garofolo J., H. Hastie, A. Le, B. Pellom, A. Potami-anos, R. Passonneau, R. Prasad, Roukos S., G. Sanders, S. Seneff, D. Stallard, and S. Whittaker.DARPA Communicator Evaluation: Progress from 2000 to 2001. In Proc. ICSLP, 2002.

[63] Marilyn Walker, Candace Kamm, and Diane Litman. Towards Developing General Models of Us-ability with PARADISE. Natural Language Engineering, 6(3), 2000.

[64] Steve Young. ATK: an application toolkit for HTK, version 1.4, 2004.

Version: 1.0 (Final) Distribution: public