VoiceXML for Pervasive Environments

VoiceXML for Pervasive EnvironmentsStefan Radomski, Dirk Schnelle-Walka

Technische Universitat Darmstadt

The language support of VoiceXML2.1 to express flex-ible dialogs in pervasive environments is still lacking keyfeatures. Missing information about the environment andthe inability to react to external events lead to rigid and ver-bose dialogs. By introducing these features as ECMAScriptvariables and event handlers in an interpreter, dialog authorscan adapt their dialogs’ behavior with regard to the users’surroundings and incorporate available information from thepervasive environment. Adding these features extends theexpressiveness of VoiceXML2.1 and enables the modelingof previously inexpressible, more flexible dialogs.

Introduction

The modality of speech is widely considered to be apromising way to interact with systems in pervasive en-vironments (Sawhney & Schmandt, 2000; Turunen, 2004;Ranganathan, Chetan, Al-Muhtadi, Campbell, & Mickunas,2005). Being surrounded by a wealth of information process-ing entities leaves a user with the problem on how to interactwith them. While devices can employ dedicated interactionpanels (i.e. touch interfaces and graphical displays), such in-terfaces still require the user to step up to them or carry a mo-bile device. With the modality of speech comes the promiseof casual, effortless and hands-free interaction, suitable tocommand all the functionality of a pervasive environment.

Apart from the actual speech recognition technology, suchan interaction paradigm needs a way to formally describe thespoken dialogs between the system and the user: VoiceXMLis one such dialog description language. As part of the stan-dardization efforts of the W3C for multimodal applications,version 2.1 of the language was given recommendation statusin june 2007 with version 3.0 being underway. But with itsroots in telephony applications and interactive voice responsesystems, common requirements regarding an application inpervasive environments remain unsatisfied.

Improving the integration of dialogs modeled in Voice-XML with a pervasive environment is, foremost, to extend aninterpreters capabilities to exchange information with othersystems in the environment and to react accordingly. The il-lustration in figure 1 classifies existing VoiceXML languagefeatures within a general push/pull scheme for informationexchange between a VoiceXML interpreter and external sys-tems.

While there is language support to pull information intoa running VoiceXML session and eventually push gatheredinformation to external systems, the defined semantics ex-hibit behavior well suited for telephony applications but arecumbersome when applied to pervasive environments. Othercommunication schemes like pushing information into a run-

object: The object element can return ECMAScript object models form native objects. data: The data element can transform external XML into ECMAScript object models.

N/A

submit: Sends ECMAScript variables per GET or POST to a server. Only defined for scalars. object: ECMAScript variables can be used to pass parameters to the object.

N/A

Pulling InformationPushing Information

IntoVoiceXML

FromVoiceXML

1

43

2

Figure 1. Pushing and pulling information into and from a Voice-XML session.

ning VoiceXML session and pulling information from a run-ning VoiceXML session have no language support at all.

We identified four areas with regard to the communicationschemes, where VoiceXML needs to be extended to bettersupport dialogs in pervasive environments:

1. Enable other systems to push information into a run-ning VoiceXML session and allow the interpreter toadapt the dialog.

2. Continuously reflect current values in the object mod-els representing external systems in a VoiceXML ses-sion, not the fixed values from the time when a pullwas performed.

3. Allow the VoiceXML interpreter to push gathered in-formation to other systems without loosing the state ofthe dialog.

4. Enable other systems to pull information from a run-ning VoiceXML session without waiting for the ses-sion to push its information.

In this article, we will describe our work with regard tothe first two problems in the context of the JVoiceXML in-terpreter1. We consider them to be more important than thelast two which we will address in upcoming work.

The rest of this article is organized as follows. After thepresentation of related work in section , we give a condensedview of the VoiceXML2.1 standard and its expressiveness insection and outline core concepts to subsequently identifythe possibilities to extend the language with our approach

1 http://jvoicexml.sourceforge.net

2 STEFAN RADOMSKI, DIRK SCHNELLE-WALKA

in section . The article concludes with an outlook to Voice-XML3.0 in section and some subsequent conclusions.

In a previous publication, we already described ourwork of using the JSAPI22 layer beneath our JVoiceXMLimplementation to realize a distributed voice user inter-face incorporating audio streaming to and from mobile de-vices (Schnelle-Walka & Radomski, 2010; Radomski &Schnelle-Walka, 2010).

Related WorkThe inability of VoiceXML to react to external events

was noticed as early as 2001 for the VoiceXML1.0 standard,when Niklfeld et al. bemoaned that it would not be possibleto make a VoiceXML interpreter aware of changes from thegraphical component of a multimodal applications (Niklfeld,Finan, & Pucher, 2001). Similar concerns were raised byPakucs, when implementing a plug and play architecturefor voice user interfaces with mobile environments (Pakucs,2002).

To overcome these problems, most of the work regard-ing a deployment of VoiceXML in pervasive environmentsfocussed on creating small, dynamic documents on a serverand include information and behavior for a standard Voice-XML interpreter to process. By keeping the dialog within agenerated document short, the VoiceXML interpreter wouldsend frequent requests for new documents, in which updatedinformation and behavior could then be included.

INSPIRE (Dimopulos, Albayrak, Engelbrecht, Lehmann,& Moller, 2007), developed at the Strategic Research Labo-ratories of the Deutsche Telekom and the DAI-Labor, BerlinUniversity of Technology, is a home automation system witha multimodal user interface using, among other modalities,voice and gestures. All interactions are controlled by theINSPIRE Core, leaving VoiceXML as the means to expressthe spoken presentation layer. The VoiceXML documentsare dynamically generated from ontological descriptions perrequest and knowledge about the environment is embeddedinto the documents at creation-time. As such, this informa-tion might be deprecated as soon as the document is pro-cessed by the interpreter and only updated by subsequent re-quests. Additionally, generated UIs are still imperfect anddo not reach the quality of hand-crafted UIs (Sukaviriya etal., 1994; Szekely, 1996). Thus, existing design knowledgeis hard to use optimally for the generated UI case. More-over, it is not possible to author mixed-initiative dialogs withINSPIRE. User input is only accepted by simple fields thataccept only a single piece of information, making it impos-sible to ask for missing information as it would be necessaryin the following dialog taken from (Schnelle-Walka, Arndt,& Feldes, 2011).

User: Please close the shutter.System: Which shutter shall be closed?User does not know how to continue and says nothing.System: Do you want to close the shutter to the garden orto the terrace?User: The terrace.

Most of these problems are shared by Owl-Speak (Heinroth, Denich, & Schmitt, 2010). Developedat the university of Ulm, it employs an information stateupdate (ISU) model (Larsson & Traum, 2000) reflecting thecurrent dialog state as beliefs. Like INSPIRE, VoiceXMLdocuments are generated per request based on knowledgethat is captured in ontologies. Since the way OwlSpeakuses VoiceXML is comparable to INSPIRE it inherits all thedrawbacks regarding design quality and reaction to changesin the environment. In contrast to INSPIRE, mixed-initiativedialogs are possible with OwlSpeak but rely on new conceptsthat come with ISU. Therefore, dialog designers have tolearn the new concepts to transfer their existing designknowledge to the generation of dialogs using these concepts.

A similar approach is taken by Nyberg et al. (Nyberg,Mitamura, & Hataoka, 2002). Although they do not makeuse of ontologies, but employ a combination of two propri-etary XML dialects, namely DialogXML and ScenarioXMLto generate VoiceXML documents. The benefit is a morestraight-forward modeling of dialogs as state transition net-works expressed in scenarios, while sacrificing the mixed ini-tiative elements of VoiceXML. Their approach allows for abetter integration of subdialogs to express large scale voiceuser interfaces and they subsequently employed their systemto model a voice user interface for a car (Obuchi et al., 2005).

An inherent limitation of all approaches employing gener-ated VoiceXML documents is the inability to process eventsas they occur. For instance, it is not possible to express adialog such as the following:

User: Add milk to the grocery list.System: How many milk do ...--- Incoming call detected.System: Your wife is calling from her mobile phone doyou want to speak to her?User: Sure.--- Waiting for call to finish.System: So, how many milk do you want?User: Cancel that.

There are other frameworks to author voice based appli-cations for pervasive environments that do not make use ofVoiceXML. One of them is the ODP (Ontology-based Di-alog Platform) from SemVox, a spin-off of the German Re-search Center for Artificial Intelligence (DFKI GmbH). Mul-timodal input, like speech, touch or keyboard, coming fromvarious clients is connected by semantic processing. ODPsupports standards such as MRCP/SIP, EMMA and SSMLand GUI frameworks such as Ajax, Flash/Flex or JavaFXas well as novel modalities (for example multitouch, gesturerecognition or virtual characters) 3. SemVox also provides itsown workbench for rapid application development (Sonntag,Sonnenberg, Nesselrath, & Herzog, 2009). Applications arewritten in an proprietary XML format. Hence, existing de-sign knowledge is hard to reuse. Although ODP introduces

2 http://jcp.org/en/jsr/detail?id=1133 http://www.semvox.de/en/menu-home.html

https://www.researchgate.net/publication/227096219_Robust_Dialog_Management_Architecture_Using_VoiceXML_for_Car_Telematics_Systems?el=1_x_8&enrichId=rgreq-ecf0cff8-3f48-4942-9667-4123cdece713&enrichSource=Y292ZXJQYWdlOzI1NTk4MjcxMjtBUzoyNTkxMDA1MDAwMzM1MzhAMTQzODc4NTc4MTc0OQ==

https://www.researchgate.net/publication/280484714_Information_State_and_Dialogue_Management_in_the_TRINDI_Dialogue_Move_Engine_Toolkit?el=1_x_8&enrichId=rgreq-ecf0cff8-3f48-4942-9667-4123cdece713&enrichSource=Y292ZXJQYWdlOzI1NTk4MjcxMjtBUzoyNTkxMDA1MDAwMzM1MzhAMTQzODc4NTc4MTc0OQ==

VOICEXML FOR PERVASIVE ENVIRONMENTS 3

very expressive concepts they have to be learned and under-stood by the developers. Furthermore, the inherent complex-ities of ontologies are also often underestimated (Albertsen& Blomqvist, 2007).

VoiceXML in a NutshellTo motivate our approach and discuss its advantages

and shortcomings, we would first like to present a con-densed view of the VoiceXML2.1 standard (Voice ExtensibleMarkup Language (VoiceXML) Version 2.0, 2004)(Oshry etal., 2007). We will give an impression of its expressive-ness and outline core concepts to subsequently present ourapproach of extending it for modeling dialogs in pervasiveenvironments.

The VoiceXML2.1 standard is a formal description for in-teractions as spoken dialogs between a human and a machine.It models all voice interactions as a sequence of forms, inwhich the user is required to fill information slots as formitems. For every user turn, there is a set of active grammars,used by the system to match the users utterance to one ormore of these information slots and direct dialog flow withinand between forms. The overall structure bears some resem-blance to HTML as form items can also be submitted to aserver and get another VoiceXML document to process inresponse.

Core ConceptsA VoiceXML2.1 document consists of an enclosing vxml

element which contains at least one dialog as a form el-ement4 (see figure 2). A form in turn contains most no-tably form items, which can be divided into input and controlitems: input items, as their name suggests, gather input fromthe user, control items do not. The other items potentiallycontained in a form are not considered form items as they arenot eligible to be visited during form processing, as we willsee in section .

The major building block of forms is the field form in-put item, which defines a slot of information to be filled bythe user during form processing via speech recognition5. Ina simple system-initiative dialog, these fields will be visitedin document order and the user is expected to provide utter-ances, each matching the current field-level grammar. Otherinput items are available to gather input from e.g. a subdia-log, other voice applications or native objects.

There can be two different control item elements withina form: one or more initial elements and one or moreblock elements. Block elements contain sequences of pro-cedural statements i.e. to enqueue new spoken systemprompts or perform some computation. The initial elementplays an important role for the mixed-initiative approach inVoiceXML: When a form is initially entered or its form itemvariables have been cleared, the first iteration only considersthe initial form items. If the form defines a form-level gram-mar, an initial element can output a rather open prompt andthe form level grammar is used to map the user’s utterance toone or many information slots defined by the field elements.Usually, SISR6 embedded in the form-level SRGS7 grammar,

vxml {1}

form {1,n}

control form items

initial {1,n} First elements to be visited when a form is processed. Initial elements are eligible to selection until any form-item variable is set.

block {1,n} Sequences of procedural statements.

input form items

other items

field {1,n} A piece of data to be filled in this form.

object {1,n} Bridge to native functionality.

grammar {1,n} Form-level grammar, active in any field or initial element unless otherwise specified.

catch {1,n} Dialog-scoped event-handlers.

filled {1,n} Block to be executed when a given combination of form-level variables is set.

grammar {1,n} Field-level grammar, only active in this field.

subdialog {1,n} Subdialogs invoke forms in a new context and return upon their completion.

catch {1,n} Field-scoped event-handlers.

filled {1,n} Block to be executed when the field item variable was set.

prompt {1,n} Text to be spoken via TTS when the field is visited.

link {1,n} Redirect form processing to a new form.

grammar {1,n} Document-level grammar, active in any field or initial element unless otherwise specified.

script {1,n} ECMAScript to be executed when document is initialized.

catch {1,n} Document-scoped event-handlers.

Figure 2. Essential VoiceXML2.1 elements.

is used to provide a semantic interpretation of the recognitionresult. Information slots not filled during the initial utteranceare then explicitly queried for.

Throughout most of a VoiceXML document, ECMA-Script is available to perform computations or adapt the se-lection order of form items. ECMAScript variables within aVoiceXML interpreter are organized into scope stack, fromSESSION to ANONYMOUS (see figure 3). A scope is enteredwhen the interpreter processes a corresponding element andleft when processing of that element is completed, in whichcase all variables of the scope vanish.

These concepts suffice to explain our approach. There areother concepts within VoiceXML e.g. with regard to multi-document voice applications, call transfer control for tele-phony and touch-tone recognition, but we consider these outof scope to motivate our approach.

4 There is a menu element as well, but it is just syntactic sugarfor a form tailored to navigation.

5 Information slots can also be filled via DTMF, but we considerthis to be out of scope.

6 http://www.w3.org/TR/semantic-interpretation/7 http://www.w3.org/TR/speech-grammar/


Session

Persistent during all interactions between a user and a VoiceXML interpreter runtime.

Application

Persistent for all interactions with a Voice-XML application as a set of leaf-documents and a root document.

Document

Persistent during all processing with a single VoiceXML document.

Dialog

Persistent during processing of a single form.

Anonymous

Persistent during processing of a block, filled or catch element.

Figure 3. Scope hierarchy (adapted from (Oshry et al., 2007))

Form Interpretation Algorithm

When a VoiceXML document is processed, interpretationwill eventually reach a form element, either the one specifiedin the request URL per fragment identifier, or the first one indocument order. The algorithm responsible for the order ofinterpretation of the contained form items is the form inter-pretation algorithm (FIA).

Every form item has a name attribute, either explicitly setby the dialog’s author or automatically assigned to an inac-cessible internal value. Form items can also have a conditionas an ECMAScript expression in the item’s cond attribute.These form what is known in VoiceXML as the guard condi-tion (listing 1) of a form item and determine its selectability.

FUNC : isSelectable ( formItem )IF ECMAVarDefined ( formItem . name ) THENRETURN false

END IFIF XMLAttrSet ( formItem . cond ) THENIF ECMAEval ( formItem . cond ) == TRUE THENRETURN true ;

ELSERETURN false ;

END IFEND IFRETURN true ;

Listing 1 Pseudocode for to determine whether a form itemis selectable.

The FIA operates in iterations, where each iteration ends,when the first selectable form item was visited and its vari-able was set. After an iteration, the FIA starts processingwith the first form item in document order anew. This is im-portant, as processing starts from the beginning of the formand does not continue with subsequent items after one was

visited. Additionally, the clear element is available withina form item to reset any or all form item variables. Form pro-cessing is finished, when there are no more selectable itemswithin an iteration, some element explicitly end form pro-cessing or the user hangs up.

<form><block name="item1" cond="item2" />< f i e l d name="item2" /><block cond="item1"><c l e a r > / / r e s e t a l l form item v a r i a b l e s

</ block></ form>

Listing 2 Example form for selection order.

We would like to introduce the notation in table 2 to illus-trate the form interpretation and item selection order of theexample form in listing 2. The rows labeled FIAn denote thenth iteration of form interpretation by the FIA. The first fewcolumns represent the different form items with their negatedform item variable name and eventual condition in documentorder. If their conjunction is true (indicated by an x) the FIAwill deem the item selectable and visit the left-most one. Foreach iteration of the FIA, the actually visited form item isgiven in the last column as its name. If a form item does notspecify a name, as with the block in the third form item, wedenote its name as the element name with its index from thedocument order.

!ite

m1∧

item

2

!ite

m2

!blo

ck3∧

item

1

visi

ted

FIA1 - x - item2FIA2 x - x item1FIA3 - - x block3FIA4 - x - item2

Table 1Processing of the form in listing 2.

Events

In VoiceXML2.1 the implementation platform as well asthe document can throw events. Events are identified bycharacter strings and carry an optional message with somehuman readable explanation. To throw events, the throw el-ement is used, with the event’s character string in the eventattribute. Catching events is done by using the catch ele-ment with the event’s character string in its event attribute.Catch elements are inherited from containing elements “as ifby copy”, that is, a document scoped catch element is implic-itly available at the form a and field level and has access tothe variables defined in that scope.


Extending VoiceXML

With the expressiveness and core concepts of Voice-XML2.1 outlined, we now would like to present our ap-proach of extending the language to improve support formodeling dialogs for pervasive environments.

As motivated in the introduction, a key feature missingfrom VoiceXML is to react to events from the pervasive en-vironment. In other words, we need to provide the meansto “push information into a VoiceXML session” and havethe interpreter react. Furthermore, pulling information intoVoiceXML, while possible, will result in variables and ob-ject models which contain the values as they were when thepull was performed. Having environment synchronous vari-ables avoids some complicated form constructs with an ever-polling FIA.

The other VoiceXML extensions motivated in the intro-duction (pushing and pulling information from VoiceXML),while certainly possible with our overall approach, are notyet implemented, as we focussed on augmenting the Voice-XML runtime with events and environment synchronousvariables.

One design consideration we submitted to was not tochange any of the syntax or semantics of VoiceXML2.1, aswe wanted our approach to be able to be attached to any ex-isting VoiceXML document so one could, to some degree,add pervasive behavior as an after-thought.

Approach

To implement our approach, we extended JVoiceXML,with ECMAScript variables reflecting an environmental ob-ject model, containing various information about the cur-rent user and her surroundings and the possibility to registerECMAScript functions as event-handlers for external events.

At the moment, our environmental object model does con-tain information about the user, and available audio in- andoutput devices (see listing 3).

"env" : {

/ / Concurrency r e l a t e d v a r i a b l e s"wait" : Access b loc ks u n t i l an event occured ,"lock" : P r o c e s s i n g of e v e n t s i s suspended ,"unlock" : P r o c e s s i n g of e v e n t s i s resumed ,

/ / Model o f t h e c u r r e n t u s e r"user" : {"location" : {"latitude" = WGS84 Lat i tude ,"longitude" = WGS84 Longitude ,"altitude" = Meters above sea l e v e l

}

} ,

/ / A l l a v a i l a b l e d e v i c e s"devices" : ["default" : {"info" : {"model" : Device Model ,. . .

} ,"sink" : Speaker objec t ,"source" : Microphone objec t ,"location" : {"latitude" = WGS84 Lat i tude ,"longitude" = WGS84 Longitude ,"altitude" = Meters above sea l e v e l

}

"orientation" : {"pitch" = [ − 1 . . 1 ] ,"roll" = [ − 1 . . 1 ]"yaw" = [ − 1 . . 1 ] ,

} ,} ,. . .

]

/ / C u r r e n t I /O d e v i c e s"sink" : The current s ink of audio"source" : The current source o f audio

}

Listing 3 Pseudo-JSON notation of the environment objectmodel.

Most variables are read-only, but others can be modifiedto trigger some changes in the environment. For instance, touse another output device, the env.sink attribute can be setto the sink attribute of any of the available devices.

In addition to the information in the object model, a di-alog author can register ECMAScript functions as callbackswhich will be called when various events occurred within theenvironment.

/ / Appearence and d i s a p p e r e a n c e o f d e v i c e senv . d e v i c e s . added = f u n c t i o n ( dev ) { . . . }env . d e v i c e s . removed = f u n c t i o n ( dev ) { . . . }

/ / I n t e r a c t i o n on d e v i c e sdev . raisedToTalk ( ) = f u n c t i o n ( ) { . . . }dev . pushedToTalk ( ) = f u n c t i o n ( ) { . . . }

Listing 4 Available hooks for callbacks in the environmentmodel.

Many more callbacks and variables are conceivable, wechoose the given set to illustrate the approach and give animpression about its capabilities.

The insertion of the ECMAScript object model conformsto the scope mechanism as it is defined by the VoiceXMLspecification (Oshry et al., 2007). In order to introduce anew object into a JVoiceXML interpreter runtime, we regis-ter a new class as a VariableProvider in the VoiceXMLSESSION scope.

<beans : bean id="VariableProviders"><beans : c o n s t r u c t o r −argtype="java.lang.String"value="SESSION" />

<beans : property name="containers"><beans : map><beans : entrykey="env"value="EnvContainer" />


</ beans : map></ beans : property></ beans : bean>

Listing 5 Pseudo XML configuration for ECMAScriptVariables in JVoiceXML.

The ECMAScript class EnvContainer is implemented asa ScriptableObject in Rhino8, the ECMAScript runtimeused within JVoiceXML. When a session with the Voice-XML interpreter is started, a new EnvContainer is instan-tiated and its ECMAScript attributes and functions are pro-vided to the dialog author through the env object.

Within the EnvContainer we employ Mundo-Core (Aitenbichler, Kangasharju, & Muhlhauser, 2007)as our pervasive environment middleware, allowing us toconnect to a wide range of sensors to retrieve the state ofthe user’s situation. At the moment, the location andorientation data is retrieved from the CoreLocationframework as it can be found on current Macintosh and iOSplatforms. Alternatively a plethora of other location trackingsystems with implementations in MundoCore are availableas well (Aitenbichler, Lyardet, Hadjakos, & Muhlhauser,2009). As a platform-independent implementation of thepublish/subscribe paradigm with remote procedure calls,MundoCore is suited to refine the provided object modeleven further in the future.

The overall approach is somewhat comparable to the tran-sition from HTML/CGI to DHTML/AJAX as it allows moreapplication logic to be performed within the client (see fig-ure 4). One consequence is that the amount of required re-quest/response cycles is lowered and a VoiceXML documentgets a more permanent character. Where prior approachesrelied on dynamic information to be included at documentcreation-time by the server (see figure 5), we enable the run-time in the client to react to these changes. Thereby provid-ing a dialog author with a central place to model most of thedialog logic.

In the following sections, we will outline the advantagesof our approach and provide a set of idioms to utilize theseextensions to model dynamic dialogs with VoiceXML.

Pulling Information into VoiceXMLThe established way to pull information into a running

VoiceXML session is to use the data element as it was addedin version 2.1 of the standard. It is available to fetch someXML document at a given URL and transform it into anECMAScript object model. The element can be used withinany block of procedural code like the block or initialelement. Another, less obvious way, is to use the objectinput item element to call some object’s method providedwithin the VoiceXML interpreter runtime and have it returnan ECMAScript object model. As an input form item, it isonly usable as a direct child of a form element.

Both approaches have in common, that the returned ob-ject model contains the values as they were when the re-quest was performed. With our approach, every access to e.g.env.user.location will return environment synchronousvalues, not the value from when the env object was fetched.

VoiceXMLBrowser

DocumentServer

get document root

process(formular)

submit formular

generateResponse(form values)

send response document

query environment

Pervasive Middleware

query environment

process(formular)

asynchronousevents

asynchronousevents

Figure 4. Handling events and using dynamic information fromECMAScript Variables.

We achieve this by having our middleware publish rele-vant changes from remote devices and update the values inthe EnvContainer or by triggering remote procedure callswithin its ECMAScript accessor methods for Rhino. Thisenables us to make use of current information e.g. in thecond attribute of a form item. To model this technique withthe data or object element would have required to updateall relevant variables in a dedicated block or object formitem for every other FIA iteration by taking special care thatthe form item is selected as per guard condition.

Pushing Information into VoiceXML

Currently, there is no language support to push informa-tion into a VoiceXML session and have it react in some way.A common work-around is to create the VoiceXML docu-ments dynamically on a document server and only send asingle form which will be submitted to trigger the generationof the subsequent form (see figure 5). This enables the in-clusion of dynamic information at creation time, with poten-tially plenty of document creations during a dialog, but willnot permit reacting to events during form processing.

Another work-around is to abuse the FIA to continuouslypoll for changes in the environment, either via the data orthe object element. But this approach will lead to what isknown as “busy wait”, essentially saturating one processingunit with wasteful polling.

For our approach, we provide hooks within the env objectmodel, for dialog authors to register ECMAScript functionsto be called when events occurred in the pervasive environ-ment. This allows for flexible handling of the event, evenduring form processing. Within the event handlers, a dia-log author can throw VoiceXML events, set variables in thedocument scope or modify read-write properties of the envi-ronment object model (see listing 6).

8 http://www.mozilla.org/rhino/


VoiceXMLBrowser

DocumentServer

get document root

process(formular)

submit formular

generateResponse(form values)

send response document

process(formular)

query environment

Pervasive Middleware

asynchronousevents

Figure 5. Dynamically generating a response document on theserver after form submit.

<vxml>< s c r i p t >

env . d e v i c e s . added = f u n c t i o n ( dev ) {if ( dev . s ink != null ) {

env . s ink = dev . s ink ;}

if ( dev . source != null ) {env . source = dev . source ;

}

env . throw ( "recog.suspend" ) ;} ;

</ s c r i p t >. . .

</vxml>

Listing 6 Event handler to instantaneously use a newinteraction device.

Enabling event-handlers to throw events, just as the Voice-XML platform does, enables i.e. the suspension of the recog-nition process when the FIA is currently visiting a field formitem. When combined with setting and unsetting of variablesused in the guard conditions of the form’s items, dynamicdialogs become possible.

Adapting the Dialog

The transitioning through the dialog states and thereforethe appearance of the dialog as a sequence of system promptsand recognized user utterances is, almost exclusively, deter-mined by the FIA. It is the FIA which selects form inputitems to prompt for and form items who will trigger a tran-sition to other forms. Therefore, to adapt the dialog is tocontrol the FIA’s selection of form items and eventually tointerrupt the recognizer in a field element.

The simplest approach is to use information from the envi-ronmental object model in an items guard condition togetherwith a ECMAScript function as a predicate to transform a

measured value to some suitable boolean expression. An-other possibility is to control item selection from within theECMAScript event handlers by modifying document-scopedvariables and use these as part of an items guard condition.Listing 7 gives an example for both techniques. For the sce-nario, imagine a user who is moving into the kitchen wherehe picks up a dedicated interaction device to order somefood.

<vxml>< s c r i p t >

var devAdded = null ;env . d e v i c e s . added = f u n c t i o n ( dev ) {

devAdded = dev ;env . throw ( "recog.suspend" ) ;

} ;</ s c r i p t ><form>

< f i e l d name="newDev" cond="devAdded"><grammar s r c="builtin:grammar/boolean"

/><prompt>Use the new d e v i c e ?< / prompt>< f i l l e d >< s c r i p t >if ( newDev == "true" ) {

env . s ink = devAdded . s ink ;env . source = devAdded . source ;devAdded = null ;

}

</ s c r i p t ><prompt>Acknowledged < / prompt>< c l e a r n a m e l i s t="newDev" />

</ f i l l e d ></ f i e l d >

< f i e l dname="orderFood"cond="isInKitchen(env.user.location)"><grammar s r c="orderFood.grxml">< f i l l e d ><submit

n a m e l i s t="orderFood"next="..." />

</ f i l l e d ></ f i e l d >

<blockname="wait"cond="!isInKitchen(env.user.location)"

>< s c r i p t >env . wait ; < / s c r i p t >< c l e a r n a m e l i s t="wait" />

</ block>

<block>< c l e a r />

</ block>

<catch event="recog.suspend" />

</ form>


</vxml>Listing 7 Adapting the dialog with event-handlers andinformation form the environment.

!new

Dev∧

devA

dded

!ord

erFo

od∧

isIn

Kitc

hen

!wai

t∧!i

sInK

itche

n

visi

ted

FIA1 - - x waitEvent unrelated

FIA2 - - x waitEvent moved into kitchen

FIA3 - x - orderFoodEvent env.devices.added

FIA4 x x - newDevFIA5 - x - orderFood

Table 2Processing of the form in listing 7.

First, the user is neither in the kitchen, nor was there anew interaction device introduced therefore the FIA is sus-pended in its first iteration within the access to env.waitin the block named wait until events arrive. Even if thereare subsequent events, causing the FIA to resume, it willalways return to the wait block as it remains the first se-lectable form item in document order. Eventually the userwill have moved into the kitchen and the orderFood fieldwill become selectable and visited. Normally, the user wouldnow interact with the system, but for this example, the useractivates a new interaction device which causes the reg-istered event handler to set the document-scoped variabledevAdded to the ECMAScript object of the device and throwrecog.suspend. This will cause the interpreter to exit theorderFood item and enter the form-level catch element.With nothing specified there, the FIA starts its next itera-tion, this time selecting the newDev field, prompting the user,whether or not, she wants to use the new device. Either way,orderFood is selected again in the next iteration.

This example also illustrates a remaining problem. Whenthe form is submitted within the orderFood item, a requestfor a new VoiceXML document is send to the documentserver, causing a request/reply cycle and, most importantly,the loss of all variable values not in session or applicationscope. While this can always be mitigated by declaring per-sistent variables in one of these two scopes, it is not in linewith the approach of having a persistent VoiceXML docu-ment to model most, if not all of the dialogs in a pervasiveenvironment.

This is related to the third problem we motivated in theintroduction and could be mitigated, either by introducing anexplicit ECMAScript representation for the remote system,or by providing an ECMAScript function to publish a serial-ized object model to subscribed systems. However, we have

not yet implemented such a functionality.

Locking the Environment

Allowing asynchronous processing of events arrivingfrom the pervasive environment introduces potential raceconditions and requires explicit handling of locking for ar-eas of mutual exclusion. While the VoiceXML standard wasexplicitly designed, not to burden a dialog author with theseaspects, we feel that it enables a more refined managementof potential dialog interruptions when these aspects are madeexplicit.

With our implementation, event processing is suspendedfor the current session, whenever the ECMAScript vari-able env.lock is accessed and resumed by accessingenv.unlock. Without locking, there are potentially manythreads running in the ECMAScript runtime at once, concur-rently modifying variables and calling functions, which maylead to undefined behavior. Therefore, special care has tobe taken, not to handle events, i.e. while evaluating a scriptelement.

Another aspect of explicit locking it to manage areaswhere potential dialog interruptions are allowed and otherswhere events are not immediately processed. The followingsections give an idea of suitable idioms for different lockingscenarios.

Locking all of the Form. The easiest approach with regardto locking is to suspend all event processing within a form.With such an approach, events will only be processed, whenthe VoiceXML runtime transitions to another form:

<form><block name="locked">< s c r i p t >env . l o ck ; < / s c r i p t >

</ block>

. . .

<block name="unlocked" cond="locked">< s c r i p t >env . unlock ; < / s c r i p t >

</ block></ form>

Listing 8 VoiceXML idiom to suspend event processing in awhole form.

The form interpretation order of the VoiceXML fragmentin listing 8 is depicted in table 3.

With regard to the ability to react to events, this approachis functionally equivalent to submitting the form at the endand have the server generate a new document incorporatingall changes due to pending events from the environment.

Locking Individual Form Items. In order to lock individ-ual form items, we need a way to access env.lock when anitem was selected and is about to be visited, and to accessenv.unlock when the interpreter is about to leave the itemto start the next iteration. We can make use of the lazy eval-uation strategy of the ECMAScript expression within a form


!loc

ked

...

!unl

ock∧

lock

ed

islo

cked

?

visi

ted

FIA1 x ? - - lockedFIA2 - ? x x ?FIA3 - - x x unlockFIA4 - - - - exit form

Table 3Processing of a locked form.

items cond attribute to lock the environment just as the in-terpreter is about to visit the item. This relies on the unspeci-fied behavior that an item’s cond attribute is only checked ifthe item’s form item variable is still undefined as part of theguard condition algorithm by the FIA and not the other wayaround. To unlock the environment again, we can just use ascript element as the last element in the item’s filled andeventual catch elements:

<form>< f i e l d cond="someCond && env.lock">

. . .< f i l l e d >

. . .< s c r i p t >env . unlock < / s c r i p t >

</ f i l l e d ></ f i e l d >

</ form>Listing 9 VoiceXML idiom to suspend event processing in asingle item.

With this idiom, event processing will only take placewhen the FIA selected an unlocked item or if an iterationis restarted after a locked item was visited. It is still possiblethough, that event processing will take place in between theselection of two locked items.

Locking Areas in the Form. When processing of eventsand eventual dialog interruptions should not occur duringthe collection of a set of form items, the locking of individ-ual items is insufficient as events may still arrive in betweenthe form items, just when the FIA is about to select anotherlocked item. Here, we need an idiom to lock a whole areawithin a form. We can rely on the fact, that the FIA startseach iteration from the top of the form and introduce a depen-dency in an items cond attribute to an item which suspendedevent processing.

<form><block name="locked">< s c r i p t >env . l o ck ; < / s c r i p t >< c l e a r namelist="unlocked">

</ block>

!loc

ked

!pro

tect

ed1∧

lock

ed

!pro

tect

ed2∧

lock

ed

!unl

ocke

d∧

lock

ed

islo

cked

?

visi

ted

FIA1 x - - - - lockedFIA2 - x x x x protected1FIA3 - - x x x protected2FIA4 - - - x x unlockedFIA5 x - - - - locked

Table 4Processing of locked areas.

< f i e l dname="protected1"cond="locked">

</ f i e l d >

< f i e l dname="protected2"cond="locked">

</ f i e l d >

<block name="unlocked" cond="locked">< s c r i p t >env . unlock ; < / s c r i p t >< c l e a r namelist="locked">

</ block></ form>Listing 10 VoiceXML idiom to suspend event processing inan area.

Remaining Problems

In this section, we would like to discuss some remainingproblems and deficiencies of our approach.

The explicit management of areas of suspended event pro-cessing via locks is still somewhat complicated. Introducingimplicit locks within the VoiceXML interpreter for e.g. scriptelements would lessen the burden for a dialog author to coor-dinate concurrent threads, active in the ECMAScript runtimeat the same time. But this would require adaptations to theactual VoiceXML interpreter, something we tried to avoid sofar. Furthermore, identifying areas of uninterrupted interac-tions would still require explicit locking.

The adaption of the dialog in event handlers by modifyingECMAScript variables used in a form item’s cond attributefeels rather oblique. It would be more straight-forward to en-able the explicit selection of form items. This can’t be helpedwithout altering the semantics of form processing as it wouldrequire adapting or even abandoning the FIA.

Handling events in ECMAScript functions, rather thancatch elements feels somewhat out of place with the specifi-cation, nevertheless, it is still possible to throw events withinan event-handler and a dialog author could move the majority


of event handling logic into a catch element. As such, wedo not conceive it as a real shortcoming as it allows to handleminor events in ECMAScript only.

Some of the idioms outline above do not scale well. Forinstance, it is unpractical to include form items to handle theaddition and removal of devices in every form. It might bebeneficial, to move the handling of global aspects into dedi-cated subdialogs and call these within form items or catchelements.

VoiceXML 3.0Version 3.0 of the VoiceXML standard (McGlashan et al.,

2010) will modularize the overall language and gather func-tionality into profiles. There will be a “2.1” profile, incorpo-rating all language features of the current standard as mod-ules. Additional functionality will be provided in other mod-ules. The VoiceXML3.0 standard is still very much work inprogress at the moment but the requirements and the currentversion indicate that asynchronous events from external enti-ties will be part of the language. In contrast to our approach,external events will indeed be handled in catch elements ex-clusively and payload associated with an event is available ina global variable. It is not clear yet, what will happen, whensuch an event handler throws a recog.suspendmessage i.e.to interrupt the collection of a selected field. In addition toasynchronous events, a new language element send will beavailable to push object models to remote entities, thus al-leviating the problems associated with the behavior of thesubmit element, pertaining the third problem motivated inthe introduction. Environment synchronous object modelswill not be part of the language and an approach similar tothe one presented in this article would be needed. We do,however, feel that VoiceXML3.0 will provide a more flexibleplatform with modules as a natural extension points for ad-ditional language features and many aspects of our approachcould have been modeled more straight-forward. We lookforward to its eventual recommendation status and will con-tinue our work with VoiceXML3.0 once it is finalized.

Conclusion & OutlookExtending the expressiveness of VoiceXML2.1 by intro-

ducing missing language features through its ECMAScriptimplementation is a flexible way to experiment with suitableidioms to model adaptive dialogs. Introducing asynchronousevents, arriving from external systems and environment syn-chronous object models with ever-current values is an impor-tant part for a VoiceXML runtime suitable for pervasive en-vironments. It is our hope, that further adaptations of Voice-XML, in line with the HTML/CGI to DHTML/AJAX tran-sition for web applications will eventually lead to a Voice-XML platform with the majority of dialog logic modeled inthe VoiceXML document and a more persistent character ofthese documents. Once the VoiceXML3.0 standard is final-ized, we plan to incorporate our additions as modules to hidesome of the complexities with regard to locking and dialogadaptations.

ReferencesAitenbichler, E., Kangasharju, J., & Muhlhauser, M. (2007).

MundoCore: A Light-weight Infrastructure for Per-vasive Computing. Pervasive and Mobile Computing.(doi:10.1016/j.pmcj.2007.04.002 (332–361))

Aitenbichler, E., Lyardet, F., Hadjakos, A., & Muhlhauser, M.(2009). Fine-grained evaluation of local positioning systems forspecific target applications. In Proceedings of the 6th interna-tional conference on ubiquitous intelligence and computing (pp.236–250). Berlin, Heidelberg: Springer-Verlag. Available fromhttp://dx.doi.org/10.1007/978-3-642-02830-4 19

Albertsen, T., & Blomqvist, E. (2007). Describing ontology ap-plications. In Eswc ’07: Proceedings of the 4th european con-ference on the semantic web (pp. 549–563). Berlin, Heidelberg:Springer-Verlag.

Dimopulos, T., Albayrak, S., Engelbrecht, K., Lehmann, G., &Moller, S. (2007). Enhancing the Flexibility of a MultimodalSmart Home Environment. Fortschritte der Akustik, 33(2), 639.

Heinroth, T., Denich, D., & Schmitt, A. (2010, March). Owl-Speak - adaptive spoken dialogue within intelligent environ-ments. In 8th ieee international conference on pervasive com-puting and communications workshops (percom workshops)(p. 666 - 671). Available from http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5470518

Larsson, S., & Traum, D. (2000). Information state and dialoguemanagement in the trindi dialogue move engine toolkit. Naturallanguage engineering, 6(3&4), 323–340.

McGlashan, S., Burnett, D. C., Akolkar, R., Auburn, R., Bag-gia, P., Barnett, J., et al. (2010, December). Voice Exten-sible Markup Language (VoiceXML) Version 3.0, W3C Work-ing Draft. http://www.w3.org/TR/voicexml30/. Availablefrom http://www.w3.org/TR/voicexml21/

McGlashan, S., Burnett, D. C., Carter, J., Danielsen, P., Ferrans,J., Hunt, A., et al. (2004, March). Voice Extensible MarkupLanguage (VoiceXML) version 2.0. http://www.w3.org/TR/voicexml20/.

Niklfeld, G., Finan, R., & Pucher, M. (2001). Architecture foradaptive multimodal dialog systems based on voicexml. In Pro-ceedings of eurospeech01.

Nyberg, E., Mitamura, T., & Hataoka, N. (2002). Dialogxml: ex-tending voicexml for dynamic dialog management. In Proceed-ings of the second international conference on human languagetechnology research (pp. 298–302). San Francisco, CA, USA:Morgan Kaufmann Publishers Inc. Available from http://dl.acm.org/citation.cfm?id=1289189.1289215

Obuchi, Y., Nyberg, E., Mitamura, T., Judy, S., Duggan, M., &Hataoka, N. (2005). Robust dialog management architec-ture using voicexml for car telematics systems. In H. Abut,J. H. Hansen, & K. Takeda (Eds.), Dsp for in-vehicle and mobilesystems (p. 83-96). Springer US.

Oshry, M., Auburn, R., Baggia, P., Bodell, M., Burke, D., Burnett,D. C., et al. (2007, June). Voice Extensible Markup Language(VoiceXML) Version 2.1, W3C Recommendation. http://www.w3.org/TR/voicexml21/. Available from http://www.w3.org/TR/voicexml21/

Pakucs, B. (2002). Voicexml-based dynamic plug and play dialoguemanagement for mobile environments. In in proceedings of iscat&r workshop on multi-modal dialogue in mobile environments,kloster irsee.

Radomski, S., & Schnelle-Walka, D. (2010, Jun). Pervasive speechapi demo. In Wowmom 2010 conference proceedings.


Ranganathan, A., Chetan, S., Al-Muhtadi, J., Campbell, R., &Mickunas, M. (2005). Olympus: A high-level programmingmodel for pervasive computing environments.

Sawhney, N., & Schmandt, C. (2000). Nomadic radio: speechand audio interaction for contextual messaging in nomadic envi-ronments. ACM Transactions on Computer-Human Interaction(TOCHI), 7(3), 353–383.

Schnelle-Walka, D., Arndt, J., & Feldes, S. (2011, Feb). Towardsmixed-initiave concepts in smart environments. In Proceedingsof workshop interacting with smart objects.

Schnelle-Walka, D., & Radomski, S. (2010, Sep). An API forvoice user interfaces in pervasive environments. In Simpe 2010conference proceedings.

Sonntag, D., Sonnenberg, G., Nesselrath, R., & Herzog, G. (2009).Supporting a rapid dialogue engineering process. In Proceedings

of the first international workshop on spoken dialogue systemstechnology (iwsds).

Sukaviriya, N., Kovacevic, S., Foley, J., Myers, B., Olsen Jr, D.,& Schneider-Hufschmidt, M. (1994). Model-based user inter-faces: what are they and why should we care? In Proceedingsof the 7th annual acm symposium on user interface software andtechnology (pp. 133–135).

Szekely, P. (1996). Retrospective and challenges for model-basedinterface development. In Design, specification and verificationof interactive systems (Vol. 96, pp. 1–27).

Turunen, M. (2004). Jaspis-a spoken dialogue architecture and itsapplications. PhD thesis, University of Tampere, Department ofInformation Studies.

VoiceXML for Pervasive Environments

Documents

Transcript of VoiceXML for Pervasive Environments