An advanced platform to speed up the design of multilingual dialog applications for multiple...

25
An advanced platform to speed up the design of multilingual dialog applications for multiple modalities q Luis Fernando D’Haro a , Ricardo de Co ´ rdoba a, * , Javier Ferreiros a , Stefan W. Hamerich b , Volker Schless b , Basilis Kladis c , Volker Schubert b , Otilia Kocsis c , Stefan Igel d , Jose ´ M. Pardo a a Grupo de Tecnologı ´a del Habla, Universidad Polite ´cnica de Madrid, Ciudad Universitaria s/n, 28040 Madrid, Spain b Harman/Becker Automotive Systems, Ulm, Germany c Knowledge S.A. (LogicDIS group), Patras, Greece d Forschungsinstitut fu ¨ r anwendungsorientierte, Wissensverarbeitung (FAW), Ulm, Germany Received 16 February 2005; received in revised form 30 September 2005; accepted 2 November 2005 Abstract In this paper, we present a complete platform for the semiautomatic and simultaneous generation of human–machine dialog applications in two different and separate modalities (Voice and Web) and several languages to provide services ori- ented to obtaining or modifying the information from a database (data-centered). Given that one of the main objectives of the platform is to unify the application design process regardless of its modality or language and then to complete it with the specific details of each one, the design process begins with a general description of the application, the data model, the database access functions, and a generic finite state diagram consisting of the application flow. With this information, the actions to be carried out in each state of the dialog are defined. Then, the specific characteristics of each modality and lan- guage (grammars, prompts, presentation aspects, user levels, etc.) are specified in later assistants. Finally, the scripts that execute the application in the real-time system are automatically generated. We describe each assistant in detail, emphasizing the methodologies followed to ease the design process, especially in its critical aspects. We also describe different strategies and characteristics that we have applied to provide portability, robust- ness, adaptability and high performance to the platform. We also address important issues in dialog applications such as mixed initiative and over-answering, confirmation handling or providing long lists of information to the user. Finally, the results obtained in a subjective evaluation with different designers and in the creation of two full applications that confirm the usability, flexibility and standardization of the platform, and provide new research directions. Ó 2005 Elsevier B.V. All rights reserved. 0167-6393/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.11.001 q This work was partly supported by the European Commission’s Information Society Technologies Programme under contract no. IST- 2001-32343. The authors are solely responsible for the contents of this publication. * Corresponding author. Tel.: +34 91 3367366x4209; fax: +34 91 3367323. E-mail addresses: [email protected] (L.F. D’Haro), [email protected] (R. de Co ´ rdoba), jfl@die.upm.es (J. Ferreiros), [email protected] (S.W. Hamerich), [email protected] (V. Schless), [email protected] (B. Kladis), [email protected] (V. Schubert), [email protected] (O. Kocsis), [email protected] (S. Igel), pardo@ die.upm.es (J.M. Pardo). Speech Communication 48 (2006) 863–887 www.elsevier.com/locate/specom

Transcript of An advanced platform to speed up the design of multilingual dialog applications for multiple...

Speech Communication 48 (2006) 863–887

www.elsevier.com/locate/specom

An advanced platform to speed up the design ofmultilingual dialog applications for multiple modalities q

Luis Fernando D’Haro a, Ricardo de Cordoba a,*, Javier Ferreiros a,Stefan W. Hamerich b, Volker Schless b, Basilis Kladis c, Volker Schubert b,

Otilia Kocsis c, Stefan Igel d, Jose M. Pardo a

a Grupo de Tecnologıa del Habla, Universidad Politecnica de Madrid, Ciudad Universitaria s/n, 28040 Madrid, Spainb Harman/Becker Automotive Systems, Ulm, Germanyc Knowledge S.A. (LogicDIS group), Patras, Greece

d Forschungsinstitut fur anwendungsorientierte, Wissensverarbeitung (FAW), Ulm, Germany

Received 16 February 2005; received in revised form 30 September 2005; accepted 2 November 2005

Abstract

In this paper, we present a complete platform for the semiautomatic and simultaneous generation of human–machinedialog applications in two different and separate modalities (Voice and Web) and several languages to provide services ori-ented to obtaining or modifying the information from a database (data-centered). Given that one of the main objectives ofthe platform is to unify the application design process regardless of its modality or language and then to complete it withthe specific details of each one, the design process begins with a general description of the application, the data model, thedatabase access functions, and a generic finite state diagram consisting of the application flow. With this information, theactions to be carried out in each state of the dialog are defined. Then, the specific characteristics of each modality and lan-guage (grammars, prompts, presentation aspects, user levels, etc.) are specified in later assistants. Finally, the scripts thatexecute the application in the real-time system are automatically generated.

We describe each assistant in detail, emphasizing the methodologies followed to ease the design process, especially in itscritical aspects. We also describe different strategies and characteristics that we have applied to provide portability, robust-ness, adaptability and high performance to the platform. We also address important issues in dialog applications such asmixed initiative and over-answering, confirmation handling or providing long lists of information to the user. Finally, theresults obtained in a subjective evaluation with different designers and in the creation of two full applications that confirmthe usability, flexibility and standardization of the platform, and provide new research directions.� 2005 Elsevier B.V. All rights reserved.

0167-6393/$ - see front matter � 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2005.11.001

q This work was partly supported by the European Commission’s Information Society Technologies Programme under contract no. IST-2001-32343. The authors are solely responsible for the contents of this publication.

* Corresponding author. Tel.: +34 91 3367366x4209; fax: +34 91 3367323.E-mail addresses: [email protected] (L.F. D’Haro), [email protected] (R. de Cordoba), [email protected] (J. Ferreiros),

[email protected] (S.W. Hamerich), [email protected] (V. Schless), [email protected] (B. Kladis),[email protected] (V. Schubert), [email protected] (O. Kocsis), [email protected] (S. Igel), [email protected] (J.M. Pardo).

864 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

Keywords: Automatic dialog systems generation; Dialog management tools; Multiple modalities; Multilinguality; XML; VoiceXML

1. Introduction

1.1. Motivation

The growing interest from companies in usingnew information technologies as means to gettingcloser to the final users has led to the quick growthand improvement of automatic dialog systems.They are now able to provide services such as reser-vations (San Segundo et al., 2001; Lamel et al.,2000; Levin et al., 2000), customer care (Striket al., 1997) or information retrieval (Zue et al.,2000; Seneff and Polifroni, 2000), 24 h a day 7 daysa week. Besides, given the different characteristicsand requisites of the final users, the service isexpected to be available in several languages (Tur-unen et al., 2004; Meng et al., 2002; Uebler, 2001)and to provide quick assistance in real time usingdifferent modalities such as: voice, gestures, touchscreens, Web pages, interactive maps, etc. (Almeidaet al., 2002; Wahlster et al., 2001; Gustafson et al.,2000; Oviatt et al., 2000). Obviously, to build thiskind of systems, a complete platform is necessaryto design, debug, execute and maintain these ser-vices and, at the same time, they have to providethe maximum number of features to the designerand final user, a high level of portability, standard-ization and scalability to minimize design time andcosts. With these objectives in mind, and based onthe results and experience obtained in previous pro-jects (Cordoba et al., 2001; Lehtinen et al., 2000;Ehrlich et al., 1997), we undertook the EuropeanProject GEMINI (Generic Environment for Multi-lingual Interactive Natural Interfaces) developedfrom 2002 to 2004 (Gemini Project Homepage,2004). The result is a complete platform that con-sists of a set of tools and assistants that guide thedesigner from the first steps of the design until theexecution of the service. Such platform, its assis-tants, the strategies to simplify the design process,and the solutions to the problems of handling multi-ple modalities and multilinguality in a unified designenvironment is what we describe here.

1.2. Alternative approaches

Nowadays, several companies and academicinstitutions work in the development of such toolsor platforms. To begin with, we have studied several

approaches to find out their characteristics, positiveaspects and limitations, so that we could contributenew ideas to the field.

The following tools developed in an academicenvironment are significant: CSLU’s RAD toolkitfrom Oregon University (McTear, 1999; Cole,1999), which allows the development of multimodalsystem initiative dialogs (voice and images), using arepresentation based on state graphs (McTear,1998), that are built using a toolbar with objectsthat represent the different functions, such as flowdiagrams and actions in the dialog. It also providesan interface to execute Tcl/Tk scripts that increasesthe possibilities of interaction, data retrieval, analy-sis, grammar development, etc., in the modules andapplication development. GULAN (Gustafsonet al., 1998) is a platform used to build a yellow-pages system using voice and interactive maps inWeb that searches for several services in Stockholm.Dialog flow is defined using a tree representationwhose nodes model the structure, focus and theactions that are going to be executed inside eachdialog state. SpeechBuilder (Glass and Weinstein,2001), developed at the Spoken Language SystemsGroup from MIT, lets the designer use modules thatconform the Galaxy architecture (Polifroni andSeneff, 2000) as a real-time platform, which commu-nicates with the application using the HTTP proto-col and parameter passing using a CGI script. Tomake the design, there is a Web interface to define,using an action definition language based on exam-ples, the relevant semantic concepts and actions thatare allowed in the application.

Although all these tools are easy to use thanks totheir visual interface and their real-time platform,their main problem is that in order to use or increasetheir features the designer has to know severalprogramming languages, and they offer a low stan-dardization as they are tied to a specific executionplatform. Besides, they have serious limitationswhen trying to implement dialog strategies that takeinto account the user level or when simultaneouslybuilding the application in several languages.

Regarding commercial platforms (e.g., Nuance,IBM WebSphere, Microsoft Speech ApplicationSDK, Audium Builder, UNISYS, etc.), in generalthey provide several high-level tools to build multi-modal and multilingual dialog applications (focusedmainly in voice access systems) using widespread

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 865

standards such as VoiceXML and SALT (Wang,2002), and a support technology that acceleratesthe development of any service. In addition, thereare several Web portals, e.g., the ones from BeVocalCafe, Tellme Studio, VoiceGenie, etc., that help inscript development, provide out-house hosting ser-vice, and offer a real-time platform for companiesthat do not have the technology required. However,a large drawback is that the runtime platformdepends on the underlying technology (speech rec-ognizer, text to speech systems, dialog managers,etc.) so the behavior may vary between platformsand it is difficult to integrate proprietary modules.They also present difficulties in integrating newmodalities or transferring the service between oper-ating systems.

We should also mention the growing interest insystems that use markup languages based on XMLto define input/output data exchanged between theirdifferent modules (Flippo et al., 2003; Katsuradaet al., 2002; Wang, 2000). Examples of this kind oflanguages are VoiceXML, and, more recently,SALT, that have helped a lot in the definition of dia-logs, while they offer flexibility and portability toprovide complete voice services quickly (Komataniet al., 2003; Bennett et al., 2002). Considering thisand several factors such as the possibility of includ-ing new modalities and improvements, the indepen-dence of the service execution platform, and the easeof debugging of tag based languages, we decided tocreate our own language (called GDialogXML) asa format for internal communication between theplatform assistants and to use the standard Voice-XML and/or xHTML as output scripts for theruntime system.

Regarding systems that offer strategies to acceler-ate dialog design we should mention (Denecke,2002) where a complete three-layer architecture forrapid prototyping of dialog applications is pre-sented. In the first layer, language and domain-inde-pendent algorithms are provided to describe thedialog objectives, discourse history and the semanticrepresentation of the speech recognizer output. Inthe second layer, the interaction mechanismbetween user and system is described (e.g., variablesused by the recognizer, database access variablesand methods, dialog states, etc.) Finally, the thirdlayer contains the dialog controller that uses theinformation from the other two layers and interactswith the final user. This system is similar to our pro-posal in some aspects, as the handling of concepts tofacilitate multilingual interaction, the use of special

variables related to the system and dialog status,and the use of automatic templates for each dialogstate. Nevertheless, as the author admits, the tem-plates fail when not all the states of the dialog canbe covered. In our system, we have tried to avoidthis by using more flexible and general templates,although less automatic.

In (Polifroni et al., 2003) a rapid developmentenvironment for speech dialogs from online reso-urces is described. The development process firstextracts knowledge from various Web applicationsand composes a dynamic database from it. Thedesign automation is mainly based on the contentsof such database. This is one of the differences withour approach: our platform uses the database struc-ture, not its contents, to extract knowledge for thedesign process. Because of this, the design is moredomain-independent, as it is more feasible to finddata structures that are similar between several ser-vices and, therefore, can be applied in several appli-cations. Besides, there are databases whose contentcannot be easily used for research purposes for secu-rity reasons as in banking databases. Anotherimportant difference is that the speech dialog appli-cations generated by our platform will be imple-mented in VoiceXML, which allows the generateddialogs to be executed with any VoiceXMLinterpreter.

More recently, in (Pargellis et al., 2004) a com-plete platform to build voice applications isdescribed. The dialog structure can be modifiedusing a set of templates adapted to the final userof the system, as well as several resources andservice features. As in our proposal, the platformautomates the generation of the final script in Voice-XML, the grammars and prompts, and the applica-tion flow; nevertheless, their proposal differs in thatthe automation efforts, in a similar way as in (Polif-roni et al., 2003), because it is more focused on thedynamic contents of the database than on its struc-ture, so it could be more domain dependent.

In relation to multimodal systems, we shouldespecially mention the work in (Johnston et al.,2002), where a multimodal architecture for a dialogsystem based on finite states is described. This archi-tecture allows a synchronous multimodal input/out-put of data using speech and/or gestures/imageswith a pen on a PDA. The authors emphasize themethods followed to guarantee the multimodalinteraction, and the features provided for eachmodality and for the design of new applications.Even though we do not provide this kind of

866 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

interaction in our system right now, our future workis oriented towards the creation of a similar mecha-nism using a standard language as X + V, the useand generation of multimodal grammars and theadaptation of our platform to a more distributedarchitecture that guarantees the synchronization ofthe different modalities.

Regarding tools that can be used to debug andevaluate the application off-line, we should mentionthe SUEDE platform (Klemmer et al., 2000), devel-oped in the Group for User Interface Research fromBerkeley University, that offers a graphical interfaceto capture and analyze training data using the Wiz-ard of Oz (WOZ) technique. The objective is to pro-vide the designer with a way of testing the promptsand system behavior by studying the system mis-takes arising from problems in real time or bad rec-ognition of user input. In our case, we have tried tominimize some of these design problems using auto-matic and configurable templates for the treatmentof common recognition errors.

Finally, we have found out that our platform fol-lows a very similar approach to the Agenda system(now called RavenClaw) from CMU (Rudnicky andXu, 1999; Bohus and Rudnicky, 2003), in that thedesigner can create a service using a hierarchicalrepresentation of the task and its subcomponents,facilitating maintenance and scalability, and thateach state is described using a set of forms withinformation regarding its restrictions and optionalslots.

1.3. Relevant definitions

Throughout this paper we are going to use someterms that do not necessarily have the same meaningas the ones used in common literature or that do notpresent a general accepted definition. To clarifythem and avoid confusions we want to define themhere from the perspective of our platform.

Designer and user: The term designer refers to theperson that uses the platform to build the service,and user refers to the final client of the developedservice.

Multiple modalities: The common usage of theterm multimodality in dialog applications refers tothe ability to support the communication with theuser through several channels to obtain and providehim/her information (Nigay and Coutaz, 1993). Themost widely used modalities are voice, gestures,mouse, images or writing, which can be combinedsimultaneously or otherwise during the dialog.

However, instead of multimodality, we havefocused on providing multiple modalities from thedesigner point of view, referring to the platform’sability to generate the service for two modalities ina unified and simultaneous way: Web and voice.Right now, these modalities work apart from eachother instead of being combined (synchronized) inthe real-time system.

Mixed initiative and over-answering: It is wellknown that the concept of mixed initiative includesover-answering, as mixed initiative is a genericterm used to refer to a flexible interaction betweenthe user and the system to get together to reach acommon final solution (Allen et al., 1999). How-ever, we preferred to differentiate them to maintainthe consistence with the specifications and imple-mentation of the VoiceXML standard (McGlashanet al., 2004). In this sense we will use the termmixed initiative to indicate the system’s ability toask simultaneously for two or more compulsorydata from the user, and, if the user’s answer isincomplete—or the recognizer fails—new subdia-logs are started to obtain the missing data. Withover-answering, we indicate the user’s ability toprovide additional data—not compulsory at thatstate—to the system.

Dialog, state and action: From the terminologyestablished by the W3C for an event-driven modelof dialog interaction (W3C, 1999), we can find thefollowing definitions:

• Dialog: a model of interactive behavior underly-ing the interpretation of the markup language.The model consists of states, variables, events,event handlers, inputs and outputs

• State: the basic interactional unit defined in themarkup language . . . A state can specify vari-ables, event handlers, outputs and inputs.

In spite of the differences in these definitions,throughout the paper we will use both terms withvery little difference, as they will refer, from the per-spective of a finite state machine, to each interactionwith the user—or a set of them—needed to fulfill aservice task. Nevertheless, the term dialog will bemore associated to the interaction with the user,whereas state will mostly refer to a set of interac-tions and other additional actions, such as a data-base access.

On the other hand, the term action refers to eachprocedure needed to complete a state or a dialog,for example: calls to other dialogs, arithmetic or

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 867

string operations, programming constructs, variableassignments, etc.

Slot: This term refers to each piece of compulsoryinformation that the system has to ask the user inorder to offer the service.

1.4. Paper organization

The paper is organized as follows. In Section 2we present the overall architecture of the platform,its internal communication format based on XMLand a brief description of its scope and limitations.In Sections 3–5, a full description of each platformmodule is presented, emphasizing the techniquesand strategies used to ease the dialog design ofapplications. The separation of the sections corre-sponds to the three levels in the architecture. InSection 6 we present the approaches followed toprovide portability and standardization to the plat-form, and some aspects of the implementation andthe runtime system. In Section 7 we show the resultsof a subjective evaluation of the platform made byseveral designers. After that, we present some futureplans in Section 8 and finish with the conclusions inSection 9.

Runtime-System

DialogsLayer

Dialog Model

SpeechGenerator

Web Generator

VoiceXML Grammars xHTML

Linker

Modality &LanguageExtensionAssistant

ModalityExtensionRetrievalAssistant

for Speech

LMT VB

FrameworkLayer

ApplicationDescriptionAssistant

DataModelingAssistant

DataConnectorModelingAssistant

RetrievalLayer

StateFlowModelingAssistant

UserModelingAssistant

RetrievalModelingAssistant

Oda

Dla

Fig. 1. Platform a

2. Platform structure

Fig. 1 shows the complete platform architecture,also called Application Generation Platform(AGP). All these modules are independent of eachother; nevertheless, they were integrated into a com-mon graphical interface (GUI) to guide the designerin the design step by step and, at the same time, lethim go back and forth. The platform is divided intothree main layers. The reason for this division is toseparate clearly the aspects that are service specific(general characteristics of the application, databasestructure, database access), those corresponding tothe high level dialog flow of the application (modalityand language independence), and the specific detailsimposed by each modality and language. In this way,the designer is able to create several versions of thesame service (for different modalities and languages)in a single step at the intermediate level.

In more detail, the assistants of the first layer areused to specify the overall aspects of the service(e.g., modalities and languages to be implemented,general default values for each modality, libraries,etc.); then, the database structure, not its contents,is described (classes, attributes, relations, etc.); and

verall aspects related to the application, tabase structure and access functions

ialog flow definition in a modality and nguage independent way

Dialog completion with all aspects that are language and modality dependent

Automatic Script generation for each modality (VoiceXML and xHTML)

Additional assistants

rchitecture.

868 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

finally, the database access functions, needed for thereal-time system, are defined (not their implemen-tation).

In the second layer, the general flow of the appli-cation is modeled, including all the actions thatform it (transitions and calls between dialogs,input/output information, calls to subdialogs, pro-cedures, etc.) It is important to mention that in thislayer no modality/language specific details aredefined, such as prompts/grammars, recognitionerrors, design of the Web page, etc., as all these willbe defined in the next layer. To be able to be modal-ity and language independent, in this layer all theinput/output data provided by/to the user are man-aged as concepts.

Finally, the third layer contains the assistantsthat complete the general flow specifying for eachdialog the details that are modality and languagedependent. Here, the prompts and grammars foreach language, the appearance and contents of theWeb pages, the error treatment for speech recogni-tion mistakes or Internet access, the presentationof information on screen or using speech, etc., isdefined. Furthermore, in this layer the final scriptsof the service are generated, unifying all the infor-mation from the previous assistants.

As we describe in Section 2.2, all the assistantscommunicate between themselves using a commonGDialogXML syntax. More details of the architec-ture can be found in (Hamerich et al., 2004a,b). InSections 3–5, each layer and assistant of the platformare described. To clarify the design process and theinteraction with the assistants, we will show somesteps of an example dialog where a bank transferbetween accounts is carried out, asking the user forthe number of the source account, the destinationaccount and the amount of money to be transferred.

2.1. Scope and limitations

The main objective of the platform is to allow theconstruction of dialog applications for multiplemodalities and languages at the same time. The gen-erated applications can be used to access servicesbased on database queries/modification (e.g., bank-ing, train reservations, share prices in real time, etc.)through a telephone or a Web browser. Consideringthe limitations imposed by the standards used in thescripts generated by the platform (see Section 6), itis limited in the current version to the execution ofeach modality on its own. In any case, we considerthat the platform is well prepared for true multimo-

dality. The only missing things right now are newcode elements for synchronization in our XML syn-tax and a new code generator (e.g., for X + V).

For the speech modality, the platform generatesa script using the VoiceXML 2.0 standard, so themain limitation is the impossibility of creating userinitiative dialogs, but it allows a certain degree ofmixed initiative. Regarding the Web modality, theplatform generates pages made up with Web forms(including radio buttons, textboxes, combo boxes,etc.), and coded using the xHTML language, so theyare accessible from a conventional Web browser.Besides, the platform allows the coding of multime-dia contents (e.g., videos, recordings, images, etc.)as part of user output. Finally, because the outputis coded in xHTML, an expert designer might useit as a base to add other more complex audiovisualresources, such as animations, interactive maps,etc., using specialized Web design tools.

In (Allen et al., 1999) four levels of mixing initia-tive are identified: unsolicited reporting, subdialoginitiation, fixed subtask initiative, and negotiatedmixed initiative. Unsolicited reporting allows anagent to inform others about critical informationneeded out of turn. Subdialog initiation allows thesystem to initiate a subdialog in certain situations,e.g., to ask for a clarification. In a fixed subtask ini-tiative, the system keeps the initiative for a task,and it executes the task interacting with the userwhen necessary. In the negotiated mixed-initiativelevel, there is no fixed assignment of responsibilitiesor initiative, so agents can negotiate who takes theinitiative and proceeds with the interaction basedon it. On the other hand, (McTear, 2002) states thatfinite-state models are always fixed system-initiative,while frame-based systems may permit some degreeof mixed initiative, but that they may also be fixeduser-initiative. Finally, (Allen et al., 2001) survey fivelevels of systems in increasing complexity of softwarearchitecture: finite-state, frame-based, sets of con-texts, plan based, and agent based models. Consider-ing this perspective, our platform covers the first twolevels of task complexity in Allen’s classification, andsupports, as a frame-based system, the lower levels ofmixed-initiative interaction: unsolicited reportingand a few cases of subdialog initiation for the man-agement of lists of objects (see Section 5.2.1).

2.2. GDialogXML

In order to ease communication inside the plat-form we have developed a new object oriented

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 869

abstract language based on XML tags named Gem-ini Dialog XML or GDialogXML. Its main featureis its flexibility, which allows the modeling of allapplication data, the database access functions,the definition of all variables and actions neededin each dialog state, system prompts, grammars,user models, Web graphical interface, etc. Then, thisinformation is used to carry out the conversion tothe languages used for the final presentation of theservice according to the modality (VoiceXMLand/or xHTML).

Besides, the syntax allows the addition of newmodalities, and the update to new versions of thescript languages generated by the platform withlittle effort.

As (Schubert and Hamerich, 2005; Hamerichet al., 2003; Wang et al., 2003) describe in moredetail, the GDialogXML syntax provides the meansneeded to model the following aspects: general con-cepts, data modeling, and dialogue modeling in adependent and independent modality and languageway. As general concepts, we can mention: variableand constant definition, variable assignments, filepaths, arithmetic, boolean or string operations, con-trol structures for loops and jumps, variable types(lists, objects, references to objects, atomic data),etc. For data modeling, we can specify the classeswith attributes, which can have simple data typessuch as string, integer, boolean or complex typesas embedded or referenced objects or lists, support-ing inheritance from base classes, etc. Regardingdialog modeling, all dialog models consist of dialogmodules that call each other. As the length of thispaper does not allow us to include the completespecification of the syntax, the reader can find it in(Gemini Project Homepage, 2004). In any case, toclarify the input/output representation used in theassistants, we will include some fragments of thegenerated code in some assistants, giving suitableexplanations of it.

3. Framework layer

This layer has three assistants that allow theoverall specification of the service, the descriptionof the database structure and the database accessfunctions.

3.1. Application description assistant (ADA)

In this assistant, several overall aspects of theapplication, such as the number of modalities and

languages, the location of some services such as thedatabase access, database connection settings (totalnumber of connection errors, timeouts), etc. arespecified; for the speech modality, the timeout valuesfor some events such as no input, default confidencelevels for speech recognition, maximum number ofrepetitions/errors before transferring to the opera-tor, etc.; for the Web modality, possible errors(e.g., page not found, non-authorized, timeouts,etc.). More information regarding the error handlingcapabilities of the AGP can be found in (Wang et al.,2003). Besides, the default overall strategy fordialogs is defined: system-driven or mixed initiative.

Finally, the designer specifies the libraries, whichwill be used to speed up the design process. Severaltypes of libraries can be selected containing the def-inition of: data models, database access functions,list of prompts and grammars for each language,and dialogs from the general model of the applica-tion (see Section 4.2). The platform provides somegeneric libraries, such as prompts and grammarsfor confirmations, generic data models, etc., but itsmain potential is the possibility of saving most ofthe work done in the platform as libraries, includingcomplete dialogs, so that after the creation of a fewapplications, the designer will have a complete set oflibraries adapted to his liking and that can be reusedin future applications. The platform allows the load-ing of libraries and provides the functionality to edittheir code to adapt them to a new application.

3.2. Data model assistant (DMA)

This assistant defines the data structure (or datamodel) of the service specifying the classes, includinginheritance, attributes and types that make up thedatabase. It uses as input the location of librariesand files specified in the ADA. It is possible to definea class with attributes inherited from other classes.The attributes can be of several types: (a) atomic(e.g., strings, boolean, float, integer, date, time,etc.), (b) full embedded objects or pointers to exist-ing classes, or (c) lists of atomic attributes or com-plex objects. A graphical view of a class and itsattributes can be seen in Fig. 2 where, for the banktransfer example, the Transaction class has beendefined, which is made up of two object type attri-butes from the class Account: the first one, DebitAc-count, to specify the source account and the secondone, CreditAccount, to specify the destinationaccount. On the other hand, the class Account hasseveral atomic type attributes (balance and account

Fig. 2. Graphical details of a class and its attributes, and code fragment generated for the Transaction class.

870 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

number in the example) and other complex ones(account holder and last transactions list). We canalso see the code generated for the Transaction class,together with a reference to its base class (Transac-tion inherits everything from the base class), calledTransactionDescription (number 1) and the attri-butes that will be inherited (number 2). In number3, the DebitAccount attribute is an object reference(ObjRefr) to the class Account, and the same appliesto the CreditAccount attribute.

The design of the data model is accelerated in theassistant by the following features:

1. Re-utilization of libraries with models previouslycreated, which can be copied totally or partially,or a new class can be created by mixing severaloriginal classes.

2. Automatic creation of a class when it is refer-enced as an attribute inside another one.

3. Definition of classes inheriting the attributes of abase class.

Finally, one advantage of this way of defining thedata model is that it is not necessary to have theinformation contained in the database. This couldbe important if the real database cannot be accessedfor security reasons (e.g., a bank database with con-fidential information regarding the clients). Thiswas one reason to base the automation efforts onthe structure of the database, not on its contents.

3.3. Data connector modeling assistant (DCMA)

This assistant allows the definition of the struc-ture of the database access functions that are called

from the runtime system. These functions are spec-ified as interface definitions including their inputand output parameters. This allows the use of data-base functions by dialog designers, without needingto know much about database programming at all.It uses the libraries specified in the ADA and thedata model defined in the DMA.

As the runtime platform itself must be indepen-dent from backend systems and databases used inan application scenario, we leave the concrete imple-mentation (in any suitable programming language:SQL, ORACLE, Informix, etc.) of the access func-tions to database or backend experts, meaning theywill provide the functionality for the database func-tions, which have been created by the dialog experts.As the resulting model is independent from anyimplementation detail, it is not affected by changesin the system backend as long as the interfaceremains stable.

The main acceleration strategy in this assistant isthe possibility of relating the input/output variablesto attributes and classes from the data model, whichwere defined in the previous assistant. All this infor-mation is kept in the output model, which is goingto be used automatically in future stages of thedesign process (see Section 4.2.2 item 2). In Fig. 3,the code generated by the assistant for the bankingexample is shown. In this case, the tag xArgument-Vars (number 1) contains the information regardingthe input parameters (the debit account number, thedestination account and the amount to be trans-ferred) and the tag xReturnValueVars (number 2)contains the return arguments (in this case, a bool-ean variable that indicates if the transfer is success-ful or not).

Fig. 3. GDialogXML code generated by the DCMA for the bank transfer.

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 871

4. Retrievals layer

In this layer, the service flow is defined at a highlevel, i.e., in a language and modality independentway, so all it is done using concepts.

4.1. State flow modeling assistant (SFMA)

This assistant is very important because it drasti-cally accelerates the design process, especially in thenext assistant. As input, it uses the general strategyfor the service that is specified in the ADA, andthe data model specified in the DMA. Then, thedesigner has to specify the states that make up thedialog flow, the data (slots) that have to be filledby the user in each state and the transitions betweenthe current state and the following one(s). Addition-ally, it is possible to specify which slots are optional(for over-answering) and which ones can be askedfor by using mixed initiative. Here, only the flowstructure is defined, not the conditions that deter-mine the transitions between states, internal actions,nor other more detailed aspects because these aredefined in the next assistant in a rule-based manner.

Fig. 4 shows the code generated by the SFMAfor the example dialog; it includes informationregarding the slots (field xInputFieldVars), dialogtransitions (field xCalls), and generic informationof the application, such as the name of the initialdialog. The figure also shows the definition of thestate where the bank transfer data are collected. Inthis example, only the account names in that state

have been selected, and the collection of the amountto be transferred has been left for the next state,called GetTransactionAmount. Besides, both slotsare collected using mixed initiative (the tag ‘‘xIs-MixedInititative’’ is set to true).

As a speed up strategy, the slots can be defi-ned by making reference to atomic attributes ofthe classes defined in the data model, which is themost usual case, or as independent items. Thesereferences to the data model ease the work in thefollowing assistant where action proposals for thatspecific dialog are automatically presented (see Sec-tion 4.2.2). Moreover, if a transition to an undefinedstate is specified, that state is created automatically.

4.2. Retrieval modeling assistant (RMA)

In this assistant, a detailed definition of each dia-log in the application is made; so, it uses all theinformation from the previous assistants, generatingthe detailed definition of all the actions (e.g., loops,if-conditions, math or string operations, transitionsbetween states, information regarding mixed initia-tive and over-answering, calls to dialogs to pro-vide/obtain information to/from the user, etc.) tobe done in each state defined in the SFMA, or innew states that can be defined later.

4.2.1. Capabilities and dialog types

Given the large amount of actions that can becarried out in each state, a large programming effortwas necessary here, looking for its automation and

Fig. 4. GDialogXML code generated by the SFMA.

872 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

flexibility. Starting with the main window, it allowsseveral editing and visualization capabilities such asa tree-structured flow diagram where each leaf andbranch represents the states and possible transitionsdefined in the previous assistant. A color-codingconvention shows whether a dialog has been editedor not, the dialog type, etc. In addition, it is possibleto access information regarding actions and vari-ables already defined for each leaf (dialog) in theflow. All the automatically generated dialogs(DGets and DSays), libraries and database accessfunctions already defined can also be used and edi-ted. Other available capabilities are the safe crea-tion/deletion of dialogs, variables and constants,and the visualization of information from previousassistants.

The platform provides four basic dialogs typesthat cover the usual possibilities in programming:based on a loop, based on a sequence of actions(or subdialogs), a switch construct based on infor-mation input by the user or a switch construct based

on the value of a variable. Besides, empty dialogs,with no action inside, can be created (used to specifythe call to a dialog that will be defined completelyafterwards) so that a top-down design of dialogscan be made; in this case, the dialog type is selectedwhenever the designer tries to edit the empty dialog.Another possibility is dialog cloning, useful whenthe dialog to be defined is very similar to an existingone.

The tool also provides the possibility of manuallycreating dialogs to obtain information from the user(called DGet), and dialogs to provide informationto the user (called DSay), which will be describedfurther in the next section.

4.2.2. Strategies to accelerate the design

The following useful strategies have been addedto this assistant to accelerate dialog design (D’Haroet al., 2004):

(1) Automatic dialogs: When the RMA is started,it analyses the information from the Data Model

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 873

looking for all attributes defined as atomic types togenerate dialogs to obtain information from theuser (DGet) and dialogs to provide information tothe user (DSay). These dialogs include a tag usedby the Modality Extension Assistants (see Sections5.2 and 5.3) to find out whether the prompt/outputconcept to be presented to the user (for DSay), thegrammar/input concept used by the recognizer/Web generator and the confirmation strategies (forDGet) have to be specified.

If the attributes are complex or include objectinheritance, the assistant is not able to generateautomatic dialogs for them. However, the assistantprovides configurable DSay dialogs using a tem-plate (see Fig. 5) that shows the class and its attri-butes, expanding the complex attributes (withinheritance and objects) and selecting any attributethat will form part of the prompt. Other DSaydialog templates are also available: generic DSayto provide concepts, configurable DSay to presentvariables from a dialog, DSay to present lists ofobjects, specific DSay to provide the value returnedby the database access functions, and predefinedDSay such as: Welcome, Goodbye, Transfer tooperator, etc. Besides the DGet and DSay dialogs,

Fig. 5. Auxiliary screen of the RMA and po

dialogs from loaded libraries and database accessfunctions can be used.

Fig. 5 shows all the dialogs mentioned above, withan example of the configurable template for theTransaction class called ‘DSay for Transaction’where several attributes have been selected and willbe provided to the user in the real-time system. Theflexibility of this template lets the designer selectattributes from the different child classes of the Trans-action class (e.g., AccountNumber), complex attri-butes coming from inherited classes and containedin another class (e.g., LastName from class Account-Holder included in class CreditAccount). When dia-log definition is over, it is added to the list of dialogsin the DSay tab, so that it can be used later on.

(2) Automatic generation of action proposals in

each state: Every time a dialog defined in the SFMA(see Section 4.1) is edited, a popup window called‘‘SFM proposals’’ is shown (see Fig. 6) in whichthe assistant includes all the actions that are consid-ered relevant (and with many chances to be used)for that state. To decide which actions are relevant,all the information already defined in previous assis-tants, especially the SFMA, is analyzed using thefollowing strategies for each of the four sections:

pup window for dialog configuration.

Fig. 6. Example with automatic dialogs and database accessfunction proposals.

874 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

• Slots asked in the current state, the transitionsand the corresponding slots in those destinationstates. Direct information from the SFMA isused for that.

• State specific DGets: to select them, the systemlooks for the slots defined in the SFMA; if theyare related to the Data Model, the system selectsthe corresponding dialogs automatically; if not, amore relaxed criterion is used, which is to lookfor a match in the name or attribute type.

• Database access functions: to filter the possiblefunctions already defined in the DCMA, the sys-tem first considers functions with the same num-ber and type of input parameters as the definedslots for the current dialog. The next criterion isas follows: if the parameter includes a referenceto the data model (see Section 3.2), there shouldbe a match in class and attribute between slot andparameter; if not, they should match in type. Ifno function passes these filters, a more relaxed fil-ter is applied (e.g., similarity between names). Ifeven with the relaxed filter, there is no functionin this window, it would probably mean thatthere is no database access function suitable forthat state and it should have been defined before(see Section 3.3). The assistant offers the possibil-ity of creating those functions, and then reloadthe information in this window.

• State specific DSays: they are selected in a similarway to DGet dialogs, but we also include DSaysspecific to the values returned by the databasefunctions selected in the previous step.

Fig. 6, shows the banking application example,where given the currency name the system providesits specific information (buy and sell price, generalinformation, etc.), all the designer would need todo is to select the corresponding DGet in the win-dow (DGet_ATTR_CurrencyName_IN_CLASS_Currency), then the database access functionGetCurrencyByName and finally the DSays thatprovide the desired attributes from the currency.To finish the design, he/she would drop the call tothe next state (e.g., AskOtherExchangeRates).

(3) The passing of arguments between actions is

automated: This is a critical aspect of dialog applica-tions design. Several actions and states have to be‘connected’ as they use the information from thepreceding dialog.

To automate this connection, the assistantdetects the input/output variables required in eachaction and, using a popup window, it offers the mostsuitable already defined variable of a compatibletype; if there is more than one variable of a compat-ible type, the assistant sorts them according to thename similarity between variable and dialog. Ifthere is no compatible variable already defined inthe system or the name proposed by the assistantis not desired, a new local or global variable canbe created in the same window.

Moreover, if the designer makes a mistake orneeds to edit the matching made in the previoussteps, the assistant provides a window where all thismatching can be edited.

4.2.3. Mixed initiative and over-answering

This capability allows the creation of complexdialogs where the system can ask for several slotsat the same time or the user can answer withoptional information. Even though these two func-tionalities might be considered as speech modalitydependent or unnecessary for the Web modality,so they should not be handled at this stage, we pre-ferred to include them here for two main reasons:first, because some other modality that can beincluded in the future might benefit from their func-tionality, and, second, we can make it possible thatin a Web page the user does not have to fill in allrequired fields at the same time. In this case, theWeb system would detect that a certain slot is miss-

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 875

ing and, instead of generating an error, it would askfor the missing data in a posterior form, in a similarway as VoiceXML handles mixed initiative withseveral slots.

To provide this functionality, the system offers aMixed Initiative Template which the designer candrag and drop over the dialog that is being edited.The template shows available slots which can beselected (by default, the ones specified in the SFMAfor the current dialog). Moreover, the templategives the possibility of adding optional slots to beused for over-answering at the same time. With thisinformation, the system generates the necessary pro-gramming code (calls and automatic dialogs) inGDialogXML syntax that controls the mixed initia-tive handling: ask for several slots at the same time,handle the situation in which the user answerspartially or only some of the slots are filledafter the recognition, so the system has to ask

Fig. 7. Code generated by the RMA

again for unsolved slots, and handle the keepingof the optional slots when they are input by theuser.

To admit over-answering, the procedure is verysimilar: when the designer drops any DGet (actionto obtain data from the user) he/she is offered toselect additional slots as over-answering from thatspecific state and slots from the following statesin the flow (with a limit of two in the hierarchy).As default, the slots defined as optional in theSFMA are automatically converted into over-answering slots here. In the runtime system, thebehavior is that before any DGet the system checkswhether the data to be asked has been alreadyobtained in a previous state in the flow (as wouldbe the case with over-answering). To help in thischecking, in the final script all slots are declaredas global variables, so they can be accessed fromany state.

for the bank transfer example.

876 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

4.2.4. RMA output

We can see the output generated by this assistantin Fig. 7. The first section shows the global variablesof the application, which store the slots defined forthe application which may need to be accessed in alldialogs (for over-answering, as we will see in Section5.5.1). The actions executed in the two dialogs thatform the bank transfer (included in the xReactiontag) are also shown: SFM_TransactionDialog(number 1) and SFM_GetTransactionAmount(number 4). In the first one, there is a call to a sub-dialog (marked as number 2) that fills the sourceand destination accounts data using mixed initia-tive; then, there is a call (marked as number 3) tothe second dialog. In the second dialog there is a callto a subdialog that collects the amount to be trans-ferred (number 5), a call to the database access func-tion (number 6), using as input parameters the threeitems already collected, which returns a booleanvariable called TransactionPerformed, and a callto the next two dialogs specified in the SFMA (num-bers 7 and 8).

5. Dialogs layer

In this layer, the dialog is completed with allmodality and language dependent aspects. It hasfour main assistants that are dedicated to the fol-lowing tasks:

• To define the user levels and their settings(UMA, see Section 5.1), to complete the modalitydependent aspects of dialog design (MERA, seeSection 5.2),

• To complete the language dependent aspects andthe input and output concepts for each modality(MEA, see Section 5.3) and finally,

• To unify all the information and generate theexecution scripts according to the modality (seeSections 5.4 and 5.5).

Finally, in Section 5.6, we will briefly outlinesome additional assistants.

5.1. User modeling assistant (UMA)

This assistant allows the specification of differentuser levels and settings for each dialog in the appli-cation in order to provide a more personalizedattention to the final user. It uses as input thedefault confidence and error values defined in theADA, and all dialogs defined in the RMA.

To start with, all the values are specified first forsome specific user levels, but they later can be cus-tomized for each specific dialog state, so that all set-tings can be user-level dependent and dialog statedependent. This way, the designer may impose, forexample, a stricter confirmation for some criticaldata such as the amount in a banking transaction.To speed up the process, defaults are used: user-level settings take the values defined in the ADAand dialog dependent settings inherit the user-levelsettings.

The designer can specify different settings as thepossibility of barge-in for a particular user level,the maximum number of retries if there is an error(considering several error types), the maximumtimeouts for several events, etc. Besides, the confi-dence levels that should be used in recognition foreach user level are specified as they determine theconfirmation type that should be used (see Section5.2.2): no confirmation (confidence between thespecified value and 1.0), implicit (confidencebetween the specified value and the value for ‘noconfirmation’), explicit (between the specified valueand the value for ‘implicit’) and repeat (between 0and the value for ‘explicit’).

The decision as to the current user level is madeby a runtime component that is called after eachinteraction in the script generated by the platformand that sets the common internal variable that isused in the final script. This way, the platform isindependent of the user modeling technology thatis used.

5.2. Modality extension retrieval assistant for

speech (MERA)

In this assistant, the modality extension ends upby adding special subdialogs that complement thedialogs already defined for the application in theRMA. This way, the designer can include complexdialogs to deal with modality specific problems.We have focused on researching semiautomaticsolutions for two basic problems that are specificto the speech modality: the presentation of objectlists in several steps (applied to DSay dialogs con-cerning a list) and confirmation handling, i.e. howto handle recognition errors in dialogs that obtaininformation from the user (applied to DGet dia-logs). The input is the database model specified inthe DMA and the dialogs defined in the RMA(especially those marked as DGet and DSay forlists).

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 877

5.2.1. Presentation of object lists

Object lists are the result of a database query, sothere is usually a lot of information to be providedto the user. We will consider four different cases as afunction of the number of items in the list. For eachcase, a simple form allows the designer to specify theactions that have to be carried out. After filling theforms, the actions and new dialogs needed toprovide/obtain information to/from the user areautomatically generated. Furthermore, the assistantprovides the most reasonable default values for alldialog and slot names after the analysis of the inputfiles. The four different cases and their actions are asfollows:

1. The list is empty. The system tells the user thatthere is no available information and then jumpsback to a state selected by the designer, where theuser is asked again for some selected slots definedby the designer, looking for a less restrictivequery.

2. The list has one item. The designer defines a con-figurable DSay that provides complete or partialinfo from the item found.

3. More than one item and less than a maximum

allowed. This is a more complex situation, asthe items have to be provided in groups. Afterplaying the info in each group, the user is askedif he/she wants to continue, repeat the group,begin from scratch, exit or select a specific itemto receive more detailed information. Here, wehave used some command names proposed bythe Universal Speech Interface project (Tothet al., 2002). When the user selects the item he/she desires, the system provides detailed informa-tion of the object using a new selection of attri-butes specified by the designer. Anothersituation that we must face is when the systemfinishes reading the whole list and the user doesnot like any item or has cancelled before theend of the list. In this case, the system informsthe user and repeats the same process as for case1. The designer also has the option of informingthe user how many items there are in the list, andthe user may choose how many he/she wants tolisten to.

4. More items than the maximum allowed. As thereare too many items, the search has to be morerestrictive. We can have two different situations.First, if all slots of the application are alreadyfilled, the user has to change some of them tomake them more restrictive (e.g., he/she wants

last month’s transactions but there are toomany). The designer specifies these slots (theywill be cleared) and the questions will berepeated, in a similar way as in case 1. In the sec-ond situation, if there are still some slots to beasked the system continues with the normal dia-log flow until the next database query. We havealso considered a simplified case: when the listonly depends on one slot input by the user, e.g.,when he/she asks for a list of banking transac-tions. In this case, we present a simplified versionof the previous windows where the designer doesnot need to specify the slots to be cleared.

5.2.2. Confirmation handling

The result of speech recognition has to be con-firmed before making a database query. We confirmboth normal slots, mixed initiative slots and over-answering slots. We have considered two types ofconfirmation in this assistant: Simple and Complete.Simple is recommended for dialogs that need a veryhigh confidence, such as Yes/No or passwords ques-tions; in this case only two levels are allowed: noneand repeat the question (like in a no-match situa-tion). However, Complete uses several levels of con-fidence to determine the confirmation type: none,implicit, explicit or repeat the question.

The MERA uses the same common internal var-iable mentioned in the UMA assistant to store theconfidence value returned by the last recognitioncall. Then, the final script tells the system tocompare this value with the current confidencelimits, stored in four fixed name variables, asdefined in the UMA for each user level and specificdialog.

The assistant automatically selects the input dia-logs (DGet) that need confirmation and analyzestheir flow to propose the most suitable confirmationtype (Simple or Complete), but the designer canchange that proposal and the assistant checkswhether the type is feasible. The algorithm that ana-lyzes the dialog is as follows: first, the system exam-ines the number and type of slots to be retrieved bythe DGet dialog, and if there is only one slot, itstype is Boolean or string (as used to contain analphanumeric password) and the number of actionsin the calling dialog is not too high, the systemselects the simple case; if not, it selects the Completecase. The assistant also controls whether implicitconfirmation can be allowed. For example, if thenext step in the dialog flow is the database access,explicit confirmation should be used regardless of

878 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

the confidence level. Moreover, the assistant storesthe name of this dialog and with this informationsets the place where to jump back in case the userrejects the implicit confirmation in the followingstate in the flow (the rejection is detected in thisstate).

Finally, the assistant automatically generates thedialog flow (consisting of calls to internal automat-ically generated dialogs for each type of confirma-tion and dialog state) to carry out all theconfirmation and subsequent correction. Theseinternal dialogs are named in a smart way (usingthe USI project) so that the designer can easily iden-tify them when he/she has to define grammars andprompts in the next assistant.

5.3. Modality and language extension assistant

(MEA)

Here, the language dependent aspects of anapplication (input and output prompts/concepts)are specified for each modality. For speech modal-ity, the extensions consist of links to grammar andprompt concepts, while for Web modality the exten-sions consist of links to input and output concepts.In both cases, the extensions are language indepen-dent. In addition, language dependent information,specifically wording for both speech prompts andWeb output concepts, is also set here. All this infor-mation is saved in different files for each languageand modality, whose content and organization isexplained at the end of this section. As input, it usesall dialogs defined in the RMA and MERA,together with the specification of the user levelsfrom the UMA.The assistant detects the input/out-put dialogs (DGets and DSays) defined in previousassistants and asks the designer to define promptsand recognition grammars for them. In addition,the assistant lets the designer define the helpprompts for high level dialogs that are not classifiedas input/output.

For the speech modality, several prompts foreach input dialog have to be defined: the defaultone, for the different user levels and for all possiblerecognition errors. In all dialogs, the input/outputparameters and the global variables can be used aspart of the prompt. To speed up the process of typ-ing all these prompts, the assistant offers two possi-bilities: reuse prompts already available for thecurrent application or reuse prompts generated inprevious applications and saved as libraries.Prompts are set using three alternatives: text-to-

speech (TTS) prompts, prerecorded audio files, orgenerated by a Natural Language Generation(NLG) module in the runtime system.

In case of TTS prompts, the SSML markup lan-guage can be optionally used. The tags that we haveconsidered for the runtime system are as follows:

1. ‘‘emphasis’’ (to emphasize specific fragments),with the following values for the ‘‘level’’ attri-bute: ‘‘strong’’, ‘‘moderate’’, ‘‘none’’, and ‘‘redu-ced’’,

2. ‘‘break’’ (a break of a specific duration in ms),with the ‘‘time’’ attribute,

3. ‘‘prosody’’, with ‘‘pitch’’, ‘‘rate’’ and ‘‘volume’’attributes. To specify them, in the platform wehave used a relative value as a positive or nega-tive percentage, e.g., ‘‘+10%’’.

In Fig. 8 we can see how this information is inputusing the assistant, together with the SSML tagsused and the slots that are used as arguments forthe prompt (e.g., DebitAccountNumber). We canalso see in the bottom part the sections dedicatedto the specification of prerecorded audio files(Audio Prompt) and the file used by the NaturalLanguage Generation (NLG) to generate theprompt in real time. In the Natural Language Gen-eration (NLG), the prompts can be defined usingthe Language Modeling Toolkit (see Section 5.6),so they will be coded in JSGF format.

Once the prompts for the main language havebeen specified, the designer has to specify them forthe additional languages. This process is acceleratedby using the main language prompt as a template toedit the string parts of a prompt. These prompts canbe specified either at once for one language for alldialogs, or for each dialog for all additionallanguages.

For the Web modality, the procedure is some-what different because of different concepts for userinteraction. On the one hand, each output conceptcorresponds to some xHTML markup code, option-ally parametrized. On the other hand, each inputconcept corresponds to a Web form control like atext input field, a text area, a select or a choicebox, etc. In addition, a set of attributes can bedefined for each component: text elements for alabel, a hint, an alert or an error message can beset and the rendering behavior of the control canbe defined.

As output, the assistant generates four differentfiles for each modality. The first file contains infor-

Fig. 8. Example of the definition of a TTS prompt using SSLM tags.

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 879

mation regarding every dialog in the applicationand references to the input/output concepts usedin each one. In Fig. 9 we can see the code for thisfile for the speech modality and for the dialog thatcollects the amount to be transferred in the bankexample. We highlight the use of the tag Realisationbecause it tells the linker that this code is an exten-sion of a dialog already defined in the RMA.Besides, the tag xPresentation holds the informationrelated to the system prompt concepts (marked asnumber 1); the tag xFilling gives the informationrelated to the behavior of the recognizer, i.e., theprompts used to inform the user of an unrecognizedutterance (number 3) and no input detected (num-ber 4), together with the grammar to be used inthe recognition (number 2). As we have alreadymentioned, multilinguality is achieved using con-cepts, so all definitions here for prompts and gram-mars refer to the concepts (PC suffix for promptsand GC for grammars). These references are solvedin auxiliary files that we describe below. Finally, weshould mention that for the Web modality the sametags are used (xFilling and xPresentation) but,instead of prompts and grammars, input (usingthe tag InputControlCall) and output (tag Output-ControlCall) concept references are used.

The second file, for the speech modality case, iscalled ‘grammar concept file’, and it contains theassociation between ‘grammar concept’ (GC) andthe filename of the grammar(s) that will be used inthe real-time system, so it is language independent.As we have different grammars for each language,to achieve multilinguality the grammars in all lan-guages have the same name but are in separate direc-tories; the directory name is the language code, sothe real-time application just concatenates the lan-guage code with the filename to retrieve the correctgrammar. The third file is the ‘prompt concept file’and it contains, for each input/output dialog, theassociation between ‘prompt concept’ (PC) and thename of the text concept or the audio file that hasto be used for it (not the prompts for each language).The fourth file, called ‘text concept file’, holds theactual prompts (the real texts in SSML format aswe mentioned above) that correspond to the textconcepts defined in the ‘prompt concept file’. There-fore, this file is language dependent and is repeatedfor every language and, again, is kept with the samename in separate directories. This separation mightseem complicated, but it is the only way to ensuremultilinguality and the flexibility to handle audiofiles, prompts, etc., in the same application.

Fig. 9. GDialogXML code generated by the MEA for speech modality.

880 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

For the Web modality, similar files are generated,but now, instead of the ‘grammar concept file’ and‘prompt concept file’, the files are the ‘input conceptfile’ and ‘output concept file’, again language inde-pendent, which describe the appearance of eachinput or output item (radio button, submit button,secret text, labels, combo boxes, lists, etc.) and alsoinclude a reference to the text concepts that theyuse, which are also specified in a ‘text concept file’just like the speech modality.

5.4. Dialog model linker (DML)

This module generates one file for each selectedmodality where all the information from previousassistants is automatically linked together: dialogs,actions, input/output concepts, prompts and gram-mars, etc.

The final dialog model is a combination of thefiles produced by the RMA, the MERA and theMEA. Both kinds of model are linked together byfilling different sections of GDialogXML dialogunits, see (Hamerich et al., 2003) for furtherinformation.

5.5. Script generators

In this section, the modules that convert the dia-logs coded in GDialogXML syntax into the execu-tion scripts needed for each modality (VoiceXMLand xHTML) are described. To carry out the pro-cess, they solve the problems and limitations of each

standard and manage those issues regarding thehandling of multilinguality, database access, thepreparation of prompts or Web text, and the han-dling of concepts in the language-independent spec-ification of the dialog.

One important issue is how the system handles inreal-time a prompt that includes informationreturned by a database query. To solve this, thescript generators include one variable in the scriptcalled xLanguageId_global, that codes the languageof the service with an identifier in ISO639-1 format(language code) followed by an identifier ISO3166(country code). This variable is set by the languageidentification module at the beginning of the sessionand is always passed as an argument to all databaseaccess functions specified in the DCMA, where it isconcatenated with the field name that is going to beretrieved. Obviously, the database (there is a singledatabase for all languages) should contain the sameinformation for each language used in the serviceusing fields with the same base name but differentcodes as suffix, e.g., info_text_en_UK for the fieldwith the information in English, info_text_es_ESwith the information in Spanish, etc. Once the queryis made, the right variables are filled and the infor-mation is provided to the user in the correctlanguage.

5.5.1. VoiceXML Generator and connection withthe runtime platform

Using the file created by the linker (DML) in theprevious step for the speech modality, this module

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 881

generates a file in VoiceXML format for each lan-guage used in the service.

The script generator for VoiceXML has to over-come the limitations imposed by this language in itsversion 2.0. The main limitation is probably thatVoiceXML does not allow returning calls (subrou-tines), which are needed to solve the problem of pre-senting lists of objects (see Section 5.2.1), asordinary statements (returning calls are onlyallowed at certain positions). Therefore, all complexstatements and value expressions have to be ‘flat-tened’ into simpler operations and into calls tointermediate dialogs that allow jumps to other statesin dialog flow.

Besides, VoiceXML does not allow input fields tobe global variables; however, we used global vari-ables for over-answering so that they can be filledin previous states and do not lose their contentswhen jumping to other states. Therefore, a synchro-nizing strategy had to be implemented to mapglobal variables to local input fields and vice versa.To handle over-answering it is also necessary that, ifa slot is optional or is already filled, the recognitionprocess may be omitted. In VoiceXML all the slotsin a form must be filled, so we introduced additionalintermediate variables and conditional blocks tofind out whether the slots associated to over-answer-ing are filled or not. Another issue is how to clearthe content of a slot in a form and jump back to pre-vious dialogs or states, which is another behaviorneeded for list handling, as the VoiceXML managerwould automatically repeat the filling process forthat slot and that is not the desired behavior. Tosolve this, the VoiceXML generator creates interme-diate variables so that, when the slot has beencleared intentionally, the filling process is notrepeated.

GDialogXML supports the idea of connectingservices during runtime, e.g., services providingaccess to databases, services generating promptson the fly, and so on. The VoiceXML generatorimplements these calls as HTTP requests from aCGI script. This CGI script works as a data bridgeand contacts the actual services. It is needed becauseit has to produce VoiceXML code, since this is theonly way to integrate dynamic data (result values)into the dialog flow. By using the bridge, the servicesare freed from the burden of producing VoiceXMLthemselves.

To generate prompts on the fly, we decided to uselanguage-dependent JSGF grammars, in which thecorrespondence between prompt and concept is

specified, and the recognizer would return in real-time the concept specified in the grammar insteadof the prompt.

To increase performance, the VoiceXML genera-tor uses a reference resolution strategy for result val-ues of the runtime services. This means that if theresult of a service request is a reference to a complexnamed object (e.g., a person), only the reference(consisting of the identifying attributes) is transmit-ted. At the time when details of the object areneeded, the complete data structure of attributevalues is transmitted. This is particularly important,when the result of a service request is a large listof references to complex data objects, which hap-pens a lot when navigating through informationbases.

Finally, the VoiceXML generator automaticallycreates global variables in the final script, wherethe dynamic runtime values returned by the corre-sponding modules, are kept to handle severalaspects of the runtime system: the user level, thespeaker Id, the confidence value from the last recog-nition, the current language, etc. The user levelvariable, for example, is needed for switchingprompts depending on the user level, which is setby calling the User-Level-Detector runtime service.

In (Cordoba et al., 2004; Hamerich et al., 2003)other limitations of VoiceXML are described,together with some recommendations to improvethe standard.

5.5.2. Web script generator

Using the file created by the linker (DML) in theprevious step for the Web modality, this modulegenerates a file in xHTML format for each languageused in the service.

Unlike the voice modality, for Web the distinc-tion between the flow control, the data and the pre-sentation (buttons, images, frames, etc.) is not tooclear. Nowadays, there are a lot of integrated devel-opment environments for the presentation part(Web editors), whereas for the control and dataaccess they have to be specified using scriptlanguages (perl, php, python,. . .) or usual program-ming languages (Java, .Net,. . .). In other cases, theoverall flow control is supported by frameworks(Jakarta Struts,. . .), but in general there is no wide-spread language, so the task of integrating all ofthem is difficult. Therefore, the objective of thisassistant is not to compete with widespread Webeditors but to provide a complementary support totry to facilitate the separation between the modeling

882 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

of the flow control, the data and the presentation.To this end, the assistant automatically transformsthe GDialogXML models related to input and out-put concepts into xHTML files with embeddedxForms elements. This way, the generated files canbe used as templates for Web designers who canadd additional design elements (xHTML tags,images, styles, etc.), while the dialog flow is pre-served separately. A runtime interpreter for GDia-logXML may execute the dialog model ‘‘as is’’ inthe Webserver environment to control the dialogflow and take care of the database transactions.

Thanks to this separation, the final script is plat-form independent and easily adaptable to multipledisplay devices (browsers, PDA, public terminals,etc.). Although there are some limitations withxForms as not all browsers support them, the useof plugins or rendering programs in the serverprovides that support. Besides, it is expected thatthe use of xForms becomes a Web standard forforms design, as part of the xHTML 2.0 specifica-tion, so its popularity will grow. The use of xHTMLtags could favor the integration of the two modali-ties, in a future development, so that they can workat the same time using the standard X + V(xHTML, 2004).

5.6. Auxiliary assistants

Besides the assistants described above, there aresome other assistants that complement the workneeded for voice modality.

The first one is the vocabulary builder (VB)which prepares the vocabularies that will be usedby the recognizer. Thus, this component gets inputfrom the language model resources and producesthe lexicon. The lexicon contains the phonetic tran-scription of each word and in most cases the pho-netic alternatives. There are equivalent dictionariesfor each of the different languages allowed in theplatform.

A second assistant is the Language ModelingToolkit (LMT) that allows the designer to specifythe language models that will be used in the runtimesystem to ‘‘understand’’ the different user answers tothe system questions. The assistant allows the crea-tion and edition of grammars in JSGF (Hunt, 2000)both for recognition and for prompt generationusing the Natural Language Generation (NLG)module.

Finally, in order to be able to edit the differentGDialogXML models as produced by the assistants

of the AGP, a separate component has been devel-oped. This assistant is called DiaGen and allowsthe creation and edition of GDialogXML modelsand libraries by providing auto-complete templatesand other editing functionalities.

6. Portability and use of standards

Amongst the main objectives of the platformwere portability, meaning independence of the oper-ating system and the runtime platform, scalability,widespread use of standards and feasibility to useexisting or new technologies. The main efforts madein this direction are mentioned below.

6.1. OpenVXI

To test the VoiceXML script generated by theplatform, an interpreter to execute it in a runtimeplatform is needed. We selected OpenVXI 2.0.1from Scansoft (OpenVXI Web page, 2004) becauseit is an open source solution and, thanks to its por-tability, it does not impose any recognition or text-to-speech engine nor any specific telephoneplatform, so it can be adapted to any runtime plat-form. Besides, it provides an important part of thefunctionality required to execute dialog applica-tions, namely an XML interface that processes theVoiceXML script, JavaScript API, WWW functionsAPI and Register API (Eberman et al., 2002).Regarding the functions related to input/output(recognition, text-to-speech and telephone control)it provides interfaces that can be modified and com-pleted to adapt them to the needs of any platform.

An additional advantage of this interpreter wasthat it allowed us to work with our own technologyin all respects, so the tests could be controlled evenfurther. In (Cordoba et al., 2004) some proposals ofimprovements and adaptations are described, whichwe have had to make to the interpreter to adapt it toour runtime system.

6.2. Programming environment and other

standards

All the graphic components of the platform havebeen programmed and generated using Qt fromTrolltech, which is a multi-platform (Linux,Windows, Mac, X11) integrated development envi-ronment, compatible with C++, with which thedesigner can write code that can be executed indifferent operating systems and development envi-

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 883

ronments, e.g., for Visual Studio, Visual.Net, Bor-land, just by recompiling. Moreover, Qt providesseveral methods and tools to translate all texts inthe graphical interface to adapt them to anotherlanguage.

We have also used the UTF-8 format (8-bit Uni-code Transformation Format), which is a variable-length character coding for the Unicode standardin multiple languages. Besides, it is the default cod-ing in XML and it is used by all Internet protocols.This format is crucial in the definition of prompts/grammars for the multilingual service.

The platform also uses some ideas extracted fromthe USI project (Universal Speech Interface) (Tothet al., 2002) that are applied in the generation ofnames in automatic dialogs, as they are derivedfrom the attributes in the database, so that theyare easily recognizable. Furthermore, we also usekeywords for some universal commands availablein the runtime system.

Finally, as we mentioned in Section 5.3, to gener-ate the prompts used by the text-to-speech system inthe voice modality we have adapted the platformand the runtime system so that they could processthe SSML format (Speech Synthesis MarkupLanguage) (Burnett et al., 2002).

7. Evaluation and test applications

7.1. Subjective evaluation of the platform

To rate the acceptance and friendliness of the plat-form for the rapid and efficient generation of multi-lingual/multimodal dialog applications, we carriedout a subjective evaluation with 41 subjects (24 nov-ice in dialog applications design, 11 intermediate and

Table 1Subjective evaluation of the platform

Question

The provision of data modeling and connecting to external data sourcThe provision of application state flow modelingEasy adaptability to other languagesEasy adaptability to other modalitiesReady-made error-handling (nomatch, noinput)Speed up of development time as compared to writing VoiceXML/+xHProvision of user modelingProvision of mixed-initiative dialog handlingProvision of list handlingProvision of over-answeringProvision of easy connection to run-time modules

6 experts) from Greece, Germany and Spain, withages ranging from 21 to 49 years and, in general, withbasic knowledge of programming.

A short introductory tutorial was given to theparticipants and then they were asked to carry outpredefined tasks of dialog design in each assistant.These tasks were part of the design of an overallapplication, e.g., get the information about a houseloan, do a transaction of a certain amount betweentwo accounts using mixed initiative, obtain the cur-rent value of a currency, etc., and covered almost90% of the functionality of the AGP, obviouslyincluding all the AGP assistants. The participantshad to use all of them in turn: they created the datamodel for three complex classes, prepared at leastthree database access functions, designed at leastfour states in the SFM, completed the states in theRMA to carry out the mentioned tasks, they createdand reused some libraries, etc. The reason not tomake a complete system from scratch was to reducethe evaluation time to a limit of 3–4 h, includingtime for the tutorials.

Finally, they had to answer a questionnaire thatconsisted of two parts: one to evaluate each assis-tant individually and the other to evaluate the plat-form as a whole, with scores between 1 and 10representing very poor and excellent rates respec-tively. In Table 1, the results are shown.

The overall score was an average of 8.37, with themaximum scores in the following aspects:

• Speeding up the development time of anapplication

• Over-answering and Mixed initiative function-ality

• Lists handling for speech applications

Average rating

es 8.78.68.27.98.1

TML code by hand 9.07.88.58.69.07.9

884 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

Regarding the results for the specific assistants,in general, all assistants have average qualificationsbetween 7 and 8.5, which are very homogeneous.We should mention some remarks from the evalua-tors regarding the RMA and the MEA. The RMAwas rated as having a very good overall appearanceand extremely good functionality. However, it hadsome difficulties in learning and was perceived asless intuitive. The most probable reason is that itis the assistant that provides the biggest functional-ity for dialog design, as it is the place where detaileddesign is made, and the evaluating designers weregiven very little time to learn each assistant andpractice them, so in practice evaluators only readpart of the documentation written about the designsthey were asked to do. In the case of the MEA, thelow results are probably the result of the timerequired to set the prompts and grammars (text typ-ing), especially for different language definitionwhere machine translation and more prompt andgrammar libraries would be desirable.

7.2. Services tested in the runtime system

To evaluate and validate the efficiency of theplatform and its service independence, two main testapplications were set up.

The first one is a banking application calledEGBanking, with several services: general informa-tion (deposit products, loans, cards, e-banking orexchange rates), an authentication process wherethe user inputs the account number and the PIN,personal information (personal accounts and cards,last transactions, balance) and a transaction dialogto transfer money between accounts or to makepayments (VAT, Social Security, public pension,TAX Income, credit card and other funds pay-ments). It has been implemented in four languages:Greek, German, English, and Spanish. It has beenrunning since September 2003 and is used as a com-mercial product by a Greek bank (Egnatia bankWeb page, 2004) for its registered clients with verygood results.

The second application, called CitizenCare, offersbasic voice information retrieval system functional-ity in the context of public authorities. The callercan either ask for solutions for dedicated concernsor for general information about certain depart-ments such as addresses, phone numbers or openinghours. Specific information may be retrieved in onecall. This application was developed for Germanand English and for both modalities, voice and Web.

8. Future plans

Considering the good results that we haveobtained in the evaluation and test applications,the growing demand of better and more complexdialog systems, and to broaden the functionalityof this kind of tool, we are considering the followingimprovements to the platform:

• Increase the number of automatically generateddialogs making a more complex and exhaustiveanalysis of database structure. These dialogswould consist of multiple actions grouped bythe assistant, but with the possibility of beingedited by the designer.

• Increase the number of libraries available withthe platform.

• Provide more flexibility in the confirmation han-dling for the voice modality by using moreoptions adapted to a specific data type and pro-viding more fine-tuning possibilities.

• Implement new strategies to reduce design time,e.g., in the generation of the structure and accessfunctions of the Data Model, and in the genera-tion of grammars and prompts.

• Integration of the current two modalities so thatthey can work at the same time through theX + V standard (xHTML, 2004).

9. Conclusions

Throughout the Gemini project we have studieddifferent systems and strategies for the design ofhuman-computer dialog applications and as a resultwe have developed an application generation plat-form which is both powerful and flexible, with ahigh degree of automation, allowing the generationof state of the art speech and Web based applica-tions. The platform architecture is able to generatein a semiautomatic way valid dialogs for multiplelanguages and two modalities using just a descrip-tion of the database, a basic state flow of the appli-cation and a simplified user-friendly interactionwith the designer. In this paper we have shown somestrategies to speed up the design process, as the pos-sibility to handle mixed initiative and over-answer-ing dialogs using the same framework. Detailedprocedures to handle list and confirmation handlinghave been presented too. Features like user model-ing, speaker verification, language identificationcan be included easily through runtime modules

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 885

included in the platform. Moreover, the use of stan-dards like VoiceXML and xHTML keeps the plat-form open for further development, and allows theuse of other tools and existing technology. Anotherimportant result of the project is the newly designedabstract dialog description language, called GDia-logXML. Furthermore, we have carried out a sub-jective evaluation and developed two testapplications that demonstrate the user anddesigner-friendliness and robustness of the plat-form, as well as the fulfillment of all objectives ini-tially planned, together with a proposal forimprovements and future plans for the platform.

Acknowledgements

We want to thank the following people for theircontribution in the design and development of thisplatform. To Fernando Fernandez, ValentınSama-Rojo, Ignacio Ibarz, and Javier Morantefrom the Universidad Politecnica de Madrid(GTH, Spain) for their contribution in the codingof this platform and the runtime system. To Yu-Fang Helena Wang from TEMIC Speech DialogSystems (Germany) for evaluating all the Citizen-Care applications, to Thomas Eppelt, Manfred Pau-la and Robert Stegmann from FAW (Germany) forsetting up the backend database of the CitizenCareapplication, to Nikos Katsaounos and Alex Sara-fian from Knowledge S.A., and to Georgila Kalliroi,Todor Ganchev and Nikos Fakotakis from PatrasUniversity (Greece) for their work in several assis-tants. And we cannot forget Gianna Tsakou andAnastasios Tsopanoglou for their excellent job asproject coordinators. Finally, we want to thankthe reviewers of this paper for their comments andappropriate suggestions.

References

Allen, J., Guinn, C., Horvitz, E., 1999. Mixed-initiative interac-tion. IEEE Intelligent Syst. 14 (5), 14–23.

Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L.,2001. Towards conversational human–computer interaction.AI Magazine 22 (4), 27–37.

Almeida, L. et al., 2002. Implementing and evaluating amultimodal and multilingual tourist guide. In: Proc. Internat.CLASS Workshop on Natural, Intelligent and EffectiveInteraction in Multimodal Dialogue Systems, pp. 1–7.

Bennett, C., Llitjos, A.F., Shriver, S., Rudnicky, A., Black, A.W.,2002. Building VoiceXML-based applications. In: Proc.Internat. Conf. of Spoken Language Processing (ICSLP),pp. 2245–2248.

Bohus, D., Rudnicky, A.I., 2003. RavenClaw: dialog manage-ment using hierarchical task decomposition and an expecta-tion agenda. In: Proc. 8th European Conf. on SpeechCommunication and Technology (Eurospeech), pp. 597–600.

Burnett, D.C., Walker, M.R., Hunt, A., 2002. Speech SynthesisMarkup Language Version 1.0. W3C Working Draft. Avail-able from: <http://www.w3.org/TR/speech-synthesis>.

Cole, R., 1999. Tools for research and education in speechscience. In: Proc. Internat. Conf. of Phonetic Sciences(ICPhS), pp. 1277–1280.

Cordoba, R., San-Segundo, R., Montero, J.M., Colas, J.,Ferreiros, J., Macıas-Guarasa, J., Pardo, J.M., 2001. Aninteractive directory assistance service for Spanish with large-vocabulary recognition. In: Proc. 7th European Conf.on Speech Communication and Technology (Eurospeech),Vol. II, pp. 1279–1282.

Cordoba, R., Fernandez, F., Sama, V., D’Haro, L.F., San-Segundo, R., Montero, J.M., 2004. Implementation of dialogapplications in an open-source VoiceXML platform. In: Proc.Internat. Conf. of Spoken Language Processing (ICSLP), pp.I-257–I-260.

Denecke, M., 2002. Rapid prototyping for spoken dialoguesystems. In: Proc. 19th Internat. Conf. on ComputationalLinguistic (COLING’02).

D’Haro, L.F., de Cordoba, R., San-Segundo, R., Montero, J.M.,Macıas-Guarasa, J., Pardo, J.M., 2004. Strategies to reducedesign time in multimodal/multilingual dialog applications.In: Proc. Internat. Conf. of Spoken Language Processing(ICSLP), pp. IV-3057–3060.

Eberman, B., Carter, J., Goddeau, D., 2002. Building VoiceXMLbrowsers with OpenVXI. In: Proc. 11th Internat. Conf. onWorld Wide Web, pp. 713–717.

Egnatia bank Web page, 2004. Available from: <http://egnati-asite.egnatiabank.gr/EN/default.htm>.

Ehrlich, U., Hanrieder, G., Hitzenberger, L., Heisterkamp, P.,Mecklenburg, K., Regel-Brietzmann, P., 1997. ACCeSS—automated call center through speech understanding system.In: Proc. European Conf. on Speech Communication andTechnology (Eurospeech), pp. 1819–1822.

Flippo, F., Krebs, A., Marsic, I., 2003. A framework for rapiddevelopment and multimodal interfaces. In: Proc. Internat.Conf. on Multimodal Interfaces, pp. 109–116.

Gemini Project Homepage, 2004. Available from: <http://www-gth.die.upm.es/projects/gemini/>.

Glass, J., Weinstein, E., 2001. SPEECHBUILDER: facilitatingspoken dialogue system development. In: Proc. EuropeanConf. on Speech Communication and Technology (Euro-speech), pp. 1335–1339.

Gustafson, J., Elmberg, P., Carlson, R., Jonsson, A., 1998. Aneducational dialogue system with a user controllable dialoguemanager. In: Proc. Internat. Conf. on Spoken LanguageProcessing (ICSLP), pp. 33–37.

Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund,J., Granstrom, B., House, D., Wiren, M., 2000. AdApt—amultimodal conversational dialogue system in an apartmentdomain. In: Proc. Internat. Conf. of Spoken LanguageProcessing (ICSLP), pp. II-134–137.

Hamerich, S.W., Wang, Y.-F., Schubert, V., Schless, V., Igel, S.,2003. XML-based dialogue descriptions in the Gemini Pro-ject. In: Proc. Berliner XML-Tage, pp. 404–412.

Hamerich, S.W., Schubert, V., Schless, V., Cordoba, R., Pardo,J.M., D’Haro, L.F., Kladis, B., Kocsis, O., Igel, S., 2004a.

886 L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887

Semi-automatic generation of dialogue applications in theGemini Project. In: Proc. Workshop for Discourse andDialogue (SigDial), pp. 31–34.

Hamerich, S.W., de Cordoba, R., Schless, V., D’Haro, L.F.,Kladis, B., Schubert, V., Kocsis, O., Igel, S., Pardo, J.M.,2004b. The Gemini Platform: semi-automatic generation ofdialogue applications. In: Proc. Internat. Conf. of SpokenLanguage Processing (ICSLP), pp. IV-2629–2632.

Hunt, A., 2000. JSpeech Grammar Format. W3C Note. Avail-able from: <http://www.w3.org/TR/jsgf>.

Johnston, M., Bangalore, S., et al., 2002. MATCH: an architec-ture for multimodal dialogue systems. In: Proc. 40th AnnualMeeting of the ACL, pp. 376–383.

Katsurada, K., Ootani, Y., Nakamura, Y., Kobayashi, S.,Yamada, H., Nitta, T., 2002. A modality independent MMIsystem architecture. In: Proc. Internat. Conf. of SpokenLanguage Processing (ICSLP), pp. 2549–2552.

Klemmer, S.R., Sinha, A K., Chen, J., Landay, J.A., Aboobaker,N., Wang, A., 2000. SUEDE: a Wizard of Oz prototypingtool for speech user interfaces. In: CHI Letters, ACMSymposium on User Interface Software and Technology(UIST), Vol. 2 (2), pp. 1–10.

Komatani, K., Ueno, S., Kawahara, T., Okuno, H.G., 2003. Usermodeling in spoken dialogue systems for flexible guidancegeneration. In: Proc. European Conf. on Speech Communi-cation and Technology (Eurospeech), pp. 745–748.

Lamel, L., Rosset, S., Gauvain, J.L., Bennacef, S., Garnier-Rizet,M., Prouts, B., 2000. The LIMSI ARISE system. SpeechCommun. 31 (4), 339–354.

Lehtinen, G., Safra, S., Gauger, M., Cochard, J.-L., Kaspar, B.,Hennecke, M.E., Pardo, J.M., Cordoba, R., San-Segundo, R.,Tsopanoglou, A., Louloudis, D., Mantazas, M., 2000. IDAS:interactive directory assistance service. In: Proc. COST249ISCA Workshop on Voice Operated Telecom Services(VOTS-2000), pp. 51–54.

Levin, E., Narayanan, S. Pieraccini, R., Biatov, K., Bocchieri, E.,Di Fabbrizio, G., Eckert, W., Lee, S. Pokrovsky, A., Rahim,M., Ruscitti, P., Walker, M., 2000. The AT&T-DARPAcommunicator mixed-initiative spoken dialog system. In:Proc. Internat. Conf. of Spoken Language Processing(ICSLP), Vol. 2, pp. 122–125.

McGlashan, S., Burnett, D.C., Carter, J., Danielsen, P., Ferrans,J., Hunt, A., Lucas, B., Porter, B., Rehor, K., Tryphonas, S.,2004. Voice Extensible Markup Language (VoiceXML)Version 2.0. W3C Recommendation. Available from: <www.w3.org/TR/voicexml20>.

McTear, M., 1998. Modelling spoken dialogues with statetransition diagrams: experiences with the CSLU toolkit. In:Proc. Internat. Conf. of Spoken Language Processing(ICSLP), pp. 1223–1226.

McTear, M., 1999. Software to support research and develop-ment of spoken dialogue systems. In: Proc. European Conf.on Speech Communication and Technology (Eurospeech),pp. 339–342.

McTear, M., 2002. Spoken dialogue technology: enabling theconversational user interface. ACM Comput. Surveys 34 (1),90–169.

Meng, H.M., Lee, S., Wai, C., 2002. Intelligent speech forinformation systems: towards biliteracy and trilingualism.Interacting with Computers 14 (4), 327–339.

Nigay, L., Coutaz, J., 1993. A design space for multimodalsystems—concurrent processing and data fusion. In: Proc.

INTERCHI—Conf. on Human Factors in Computing Sys-tems, pp. 172–178.

OpenVXI Web page, 2004. Available from: <http://fife.speech.cs.cmu.edu/openvxi/>.

Oviatt, S.L. et al., 2000. Designing the user interface formultimodal speech and gesture applications: state-of-the-artsystems and research directions. J. Human Comput. Interact.15 (4), 263–322.

Pargellis, A.N., Kuo, H.J., Lee, C., 2004. An automatic dialoguegeneration platform for personalized dialogue applications.Speech Commun. 42, 329–351.

Polifroni, J., Seneff, S., 2000. Galaxy-II as an architecture forspoken dialogue evaluation. In: Proc. Internat. Conf. onLanguage Resources and Evaluation (LREC), pp. 725–730.

Polifroni, J., Chung, G., Seneff, S., 2003. Towards the automaticgeneration of mixed-initiative dialogue systems from Webcontent. In: Proc. European Conf. on Speech Communicationand Technology (Eurospeech), pp. 193–196.

Rudnicky, A., Xu, W., 1999. An agenda based dialog manage-ment architecture for spoken language systems. In: IEEEAutomatic Speech Recognition and Understanding Work-shop, pp. 337–340.

San Segundo, R., Montero, J.M., Colas, J., Gutierrez, J.M.,Ramos, J.M., Pardo, J.M., 2001. Methodology for dialoguedesign in telephone-based spoken dialogue systems: a Spanishtrain information system. In: Proc. European Conf. on SpeechCommunication and Technology (Eurospeech), pp. 2165–2168.

Schubert, V., Hamerich, S.W., 2005. The dialog applicationmetalanguage GDialogXML. In: Proc. European Conf. onSpeech Communication and Technology (Eurospeech), pp.789–792.

Seneff, S., Polifroni, J., 2000. Dialogue management in themercury flight reservation system. In: Proc. ANLP-NAACLSatellite Workshop, pp. 1–6.

Strik, H., Russel, A., van den Heuvel, H., Cucchiarini, C., Boves,L., 1997. A spoken dialog system for the Dutch publictransport information service. Int. J. Speech Technol. 2 (2),121–131.

Toth, A.R., Harris, T.K., et al., 2002. Towards every-citizen’sspeech interfaces: an application generator for speech inter-faces to databases. In: Proc. Internat. Conf. of SpokenLanguage Processing (ICSLP), pp. 1497–1500.

Turunen, M. et al., 2004. AthosMail—a multilingual adaptivespoken dialogue system for e-mail domain. In: Proc. Work-shop on Robust and Adaptive Information Processing forMobile Speech Interfaces.

Uebler, U., 2001. Multilingual speech recognition in sevenlanguages. Speech Commun. 35 (1), 53–69.

W3C, 1999. Available from: <http://www.w3.org/TR/voice-dialog-reqs/>.

Wahlster, W., Reithinger, N., Blocher, A., 2001. SmartKom:Multimodal communication with a life-like character. In:Proc. European Conf. on Speech Communication and Tech-nology (Eurospeech), pp. 1547–1550.

Wang, K., 2000. Implementation of a multimodal dialog systemusing extended markup languages. In: Proc. Internat. Conf. ofSpoken Language Processing (ICSLP), Vol. 2, pp. 138–141.

Wang, K., 2002. Salt: a spoken language interface for Web-basedmultimodal dialog systems. In: Proc. Internat. Conf. ofSpoken Language Processing (ICSLP), pp. 2241–2244.

Wang, Y.-F.H., Hamerich, S.W., Schless, V., 2003. Multi-modaland modality specific error handling in the Gemini Project. In:

L.F. D’Haro et al. / Speech Communication 48 (2006) 863–887 887

Proc. Workshop on Error Handling in Spoken DialogueSystems, pp. 139–144.

xHTML+Voice, 2004. Available from: <http://www.voice-xml.org/specs/multimodal/x+v/12/spec.html>.

Zue, V., Seneff, S., Glass, J., Polifroni, J., Pao, C., Hazen, T.J.,Hetherington, L., 2000. JUPITER: a telephone-based con-versational interface for weather information. IEEE Trans.Speech Audio Process. 8 (1), 85–96.