Multimodal Cooperative Communication

10
Multimodal Cooperative Communication Robbert-Jan Beun 1 and Harry Bunt 2 1 Department of Information and Computing Science University of Utrecht, Utrecht, The Netherlands [email protected] 2 Computational Linguistics and AI Group Tilburg University, Tilburg, The Netherlands [email protected] 1 Introduction When we interact with computers, we often want them to be endowed with sim- ilar characteristics as we find in human communication, and that we are familiar with. One of these characteristics is the ability to use a combination of various communication modalities. In everyday conversation, people effortlessly combine modalities such as speech, gestures, facial expressions, touch and sounds to ex- press meaningful conversational contributions. Since the perceptual, cognitive and motor abilities of humans are well adapted to the real-time processing of these various modalities, we expect that including the possibility to use vari- ous modalities in interfaces may contribute to a more efficient and satisfactory human-computer interaction. It was only twenty years ago that interaction with computers was for the most part only possible through symbols that could be understood exclusively by expert users. Today we can hardly imagine that the interface once did not include the graphical apparatus of icons, buttons, pictures and diagrams that we have become so accustomed to. Clearly, the visual interactive qualities of interfaces have improved a lot, but they are still unable to utilise and integrate communication modalities in a similarly powerful way as we find in human com- munication. Commercially available interfaces are still unable to integrate speech and gestures, to adapt the modality to the circumstances of the communicative setting, or to decide in an intelligent manner whether particular information should be presented in a pictorial or textual format, spoken or written, or both. In general, the term ‘modality’ refers to an attribute or circumstance that de- notes the mode, manner or form of something. In the context of human-computer interaction, the modality of a message usually pertains to particular aspects of the surface structure or form in which information is conveyed. Message forms can be organised into a variety of physical, spatial and temporal structures. De- pending on the nature of these structures, messages can, for instance, be volatile or have a more permanent character and can be received by a particular per- ceptual channel. Speech and gestures, for instance, are evanescent and, if not recorded, disappear the moment they are performed; written messages, on the H. Bunt and R.-J. Beun (Eds.): CMC’98, LNAI 2155, pp. 1–10, 2001. c Springer-Verlag Berlin Heidelberg 2001

Transcript of Multimodal Cooperative Communication

Multimodal Cooperative Communication

Robbert-Jan Beun1 and Harry Bunt2

1 Department of Information and Computing ScienceUniversity of Utrecht, Utrecht, The Netherlands

[email protected] Computational Linguistics and AI GroupTilburg University, Tilburg, The Netherlands

[email protected]

1 Introduction

When we interact with computers, we often want them to be endowed with sim-ilar characteristics as we find in human communication, and that we are familiarwith. One of these characteristics is the ability to use a combination of variouscommunication modalities. In everyday conversation, people effortlessly combinemodalities such as speech, gestures, facial expressions, touch and sounds to ex-press meaningful conversational contributions. Since the perceptual, cognitiveand motor abilities of humans are well adapted to the real-time processing ofthese various modalities, we expect that including the possibility to use vari-ous modalities in interfaces may contribute to a more efficient and satisfactoryhuman-computer interaction.

It was only twenty years ago that interaction with computers was for themost part only possible through symbols that could be understood exclusivelyby expert users. Today we can hardly imagine that the interface once did notinclude the graphical apparatus of icons, buttons, pictures and diagrams thatwe have become so accustomed to. Clearly, the visual interactive qualities ofinterfaces have improved a lot, but they are still unable to utilise and integratecommunication modalities in a similarly powerful way as we find in human com-munication. Commercially available interfaces are still unable to integrate speechand gestures, to adapt the modality to the circumstances of the communicativesetting, or to decide in an intelligent manner whether particular informationshould be presented in a pictorial or textual format, spoken or written, or both.

In general, the term ‘modality’ refers to an attribute or circumstance that de-notes the mode, manner or form of something. In the context of human-computerinteraction, the modality of a message usually pertains to particular aspects ofthe surface structure or form in which information is conveyed. Message formscan be organised into a variety of physical, spatial and temporal structures. De-pending on the nature of these structures, messages can, for instance, be volatileor have a more permanent character and can be received by a particular per-ceptual channel. Speech and gestures, for instance, are evanescent and, if notrecorded, disappear the moment they are performed; written messages, on the

H. Bunt and R.-J. Beun (Eds.): CMC’98, LNAI 2155, pp. 1–10, 2001.c© Springer-Verlag Berlin Heidelberg 2001

2 Robbert-Jan Beun and Harry Bunt

other hand, may persist over hundreds or even thousands of years. A user inter-face designer must be aware of at least some of these properties, since they mayhave important consequences for the quality of the communication process.

Modality should not be confused with medium. Although both notions arerelated to the form of the message, a medium usually refers to the various physi-cal channels and carriers that are used to transfer information, ranging from thehuman perceptual channels to carriers such as coaxial cable and radio waves. Amodality, on the other hand, often refers to a particular communicative system,i.e. conventions of symbols and rules to use these symbols and to express mes-sages. Language, for instance, can be conceived as a modality, not as a medium.In some cases, however, the distinction between the two notions is rather ill-defined. Speech is in some of the literature considered a different modality thanwritten language, where the main difference is the medium used to transfer themessages. By contrast, a particular modality is sometimes considered to deriveits character from different communication forms that make use of the samemedium, e.g. written language and pictures.

Systems that combine different modalities in communication are usuallycalled ‘multimodal’. This book focuses on the use of multimodality in inter-face design and broaches topics such as the interpretation and production ofmultimodal messages by computer systems. Multimodal systems seem to derivetheir character from the possibility that different messages sent through two ormore channels or by means of different communication forms can be integratedinto a single message. For instance, the combination of pointing to a particularobject and speaking the words Put that in front of the TV contains two differentmessages that can be integrated into a single one where the word that is assignedto the referred object.

Multimodality often involves several media, but since a transmission mediumis not always a determining factor of the modality, multimodality can also beachieved through the same medium. When different messages are transferredthrough different media, as in pointing and speaking, we use the term ‘diviplex-ing’, following Taylor (1989); if the same medium is used, for instance in writtentext and pictures, we use the term ‘multiplexing’. In the latter case, the twosignals that carry the information of the separate messages are multiplexed intoone signal, i.e. parts of one of the messages are interleaved with parts of theother message (time-division multiplexing) or the messages are simultaneouslycarried over different partitions of the channel (frequency-division multiplexing).Note that in order to reconstruct the original message in a multiplexed signal,the two signals first have to be separated at the receiver’s side before they canbe integrated again.

Developments in interaction technology and the resulting expansion of band-width of interactive systems enable designers to incorporate a variety of mediaand modalities in the computer interface. But merely adding amazing techno-logical feats or increasing bandwidth does not necessarily improve the commu-nication process. A signal has to be tuned to the many aspects that play a rolein the interaction, such as the characteristics of the user’s information processes,

Multimodal Cooperative Communication 3

the content of the message, the task, or the communication channel. In manycases, a picture tells us more than a thousand words, but it is difficult, or evenimpossible, to put an abstract message such as It is possible that John walks ina picture without using symbols. A warning signal that indicates a fire in thebuilding should be different from the signal that indicates the daily receipt ofour electronic mail, and in a noisy environment, a visual signal may be preferredover an auditory one.

The use of more than one modality is often a cause of redundancy, which canbe exploited to maximise the effectiveness of communication, as in the case oflip-reading. Perhaps even more important is that the efficiency of communica-tion can be maximised by using different modalities simultaneously for differentaspects of information. For example, when we use language to exchange informa-tion, we often do not express our emotional reactions explicitly in the language,but use gestures and facial expressions diviplexed with the language, or intona-tion patterns and other prosodic features in a multiplexed fashion. Feedback andother forms of dialogue control in face-to-face dialogue, such as attention moni-toring, are only partly expressed linguistically, and often take a diviplexed formby nodding and looking at the dialogue partner. In written text, topic changesare often multiplexed with the spatial organisation of the words in paragraphsand sections. And pointing to an object can be much more efficient than usinga complex linguistic description. The diviplexed and multiplexed use of differ-ent modalities and media is a fundamental feature of natural communication;the ‘Multimax Principle’ (Bunt, 1998) claims that people do not leave modalitiesunused that are available and useful in the given communication situation. (Forinstance, in face-to-face communication it would be very strange not to makeany gestures or to keep one’s facial expression exactly the same all the time.)Since the integration of various modalities is such a prominent characteristicof the human information process, the inclusion of integrated multimodality inhuman-computer interfaces opens the way for more effective and efficient formsof communication.1

Taking care of the appropriate combination of modalities can be consid-ered a feature of cooperativeness. In general, cooperation involves joint activityand mutual consideration of goals. Focusing on conversational contributions,cooperation is often analysed in terms of the Gricean maxims (Grice, 1975),which roughly state that speakers have substantial evidence for their statements(maxim of quality), make their contributions as informative as required for thepurpose of the conversation (maxim of quantity), are relevant (maxim of rele-vance) and behave perspicuously and economically (maxim of manner). The first

1 Speaking of ‘integrated multimodality’ may seem pleonastic, since the integration ofinformation that is exchanged using different modalities may be considered to beinherent to multimodal communication. In their seminal discussion of the conceptsof multimodality and multimedia, Nigay and Coutaz (1993) take the integrationof information from different channels to be one of the defining characteristics ofmultimodality. A case of unintegrated multimodal communication could perhaps besaid to occur when one speaks to one person and winks at someone else.

4 Robbert-Jan Beun and Harry Bunt

three maxims usually refer to the content of the contributions, while the last oneconcerns the form of messages.

Grice’s maxim of manner can in fact be seen as tying together the conceptsof cooperation and multimodality. Given a particular content, a speaker is sup-posed to present his utterances in an effective and efficient form that supportsthe addressee’s understanding of the meaning of the sender’s contribution forthe ongoing conversation. Hence, the sender avoids, for instance, ambiguitiesand obscurities, because otherwise the addressee is unable to identify the origi-nal message. Avoiding ambiguities tends to lead to articulate, explicit utterancesof considerable length. On the other hand, the maxim says that the sender shouldbe as brief and efficient as possible, so that the decoding and encoding processesof the signal can be optimised. Multimodality not only enables the sender tochoose the most efficient form for a particular piece of information, it also allowsthe sender to transfer different pieces of information simultaneously. Participantsin a cooperative dialogue may be said to pursue several goals simultaneously: onthe one hand the concrete goal(s) that motivate the dialogue, and at a meta-levelalso the general goal of communicating successfully, which entails such subgoalsas being understood correctly, obtaining reliable and relevant information, andso on, much in the spirit of the Gricean maxims. These goals can be pursuedsimultaneously by performing multimodal communicative acts that are multi-functional,2 and exploit the use of different modalities among other things toconvey different information, relating to goals at different levels.

2 Multimodal Generation, Interpretation, Collaboration,and System Design

The chapters in this book have been grouped in four parts, which are concernedwith the generation of multimodal dialogue contributions, with cooperativenessin multimodal communication, with interpretation in multimodal dialogue, andwith multimodal platforms and test environments.

2.1 Multimodal Presentation and Generation

The first chapter by Donia Scott and Richard Power, Generating Textual Dia-grams and Diagrammatic Texts, is concerned with the automatic coordinationof diagrams and texts in electronic document generation. Scott and Power ar-gue that there is no sharp division between text and diagrams, since texts arepresented diagrammatically and, vice versa, most diagrams contain textual ele-ments. In their paper, they take the position that diagrams can be conceived astexts with a rich graphical layout. Therefore, the production of the two commu-nication forms is integrated and generated from a common architecture. In theirapproach, they extend a proposal by Nunberg who makes a distinction between

2 See Allwood (2000) and Bunt (2000) for discussions of the multifunctionality ofdialogue contributions.

Multimodal Cooperative Communication 5

text structure, realised by layout and punctuation, and the syntactic structureof sentences. Power and Scott postulate an intermediate abstract data structurethat captures those features of layout and punctuation that interact with mean-ing. A feature that admits vertical lists, called ‘Indentation’, extends Nunberg’stext-grammar. The results are applied in the Iconoclast system where the ab-stract data structure mediates between rhetorical/semantic structures and thedetails of the graphical layout and punctuation.

Susanne van Mulken in Chapter 2, Pedro: Assessing Presentation Decod-ability on the Basis of Empirically Validated Models, describes an empirical studyto test hypotheses that follow from a model about the decodability of pictorialpresentations of object references. The model, which is implemented in a multi-modal presentation system called ‘Pedro’, exploits Bayesian networks to makepredictions about the decodability of the user. Three independent variables weredefined: a. Similarity Advantage, concerning the resemblance of antecedent andthe presentation, b. Relative Salience, i.e. the perceptual salience of the objectwith respect to its surrounding, and c. Domain Expertise, which relates to theuser’s knowledge of names and sizes of components in a technical device. In gen-eral, the results of the experiment strongly supported the hypothesis that thenumber of correct responses decreases with decreasing levels of the independentvariables. It was found, however, that Relative Salience only plays a role in caseof complete ambiguity, i.e. if the object could not be determined by similarityand user knowledge, and that similarity only played a role if expertise was low.The outcome of the experiment had no impact on the model that was originallyimplemented in Pedro.

In Chapter 3, Improvise: Automated Generation of Animated Graphics forCoordinated Multimedia Presentations, a generation system that automaticallycreates sequences of animated graphical illustrations, called Improvise, is pre-sented by Michelle Zhou and Steven Feiner. Four features are discussed thatfacilitate coordinated multimedia design: a semantic model of the input data, atemporal model of visual techniques, an action based inference engine and anapplication independent visual realiser. In order to create a wide range of visualpresentations, the semantic data model contains a taxonomy of application in-dependent properties, such as ‘type’, ‘attribute’ and ‘relation’. Visual techniquesare used to assemble a new visual presentation or to modify an existing one; inImprovise, visual techniques are extended with temporal constraints to coor-dinate animated graphics with other temporal media. Finally, Zhou and Feinerpresent the inference engine that creates the design specifications rendered bythe visual realiser. In their paper, the generation and modification processes ofanimated visual narratives are illustrated by two examples. In the first exam-ple, a narrative is generated and combined with spoken sentences to present ahospital patient’s information to a nurse; in the second example, an existingpresentation of a computer network is modified.

In Chapter 4, Multimodal Reference to Objects: An Empirical Approach,Robbert-Jan Beun and Anita Cremers present an empirical study about ob-

6 Robbert-Jan Beun and Harry Bunt

ject reference in task dialogues. A dialogue experiment was carried out whereboth participants had visual as well as physical access to a shared task domainwhere objects, consisting of coloured Lego blocks, had to be manipulated. Com-munication about these objects was possible by uttering linguistic expressionsand/or non-verbal references, such as pointing. Main question was how peoplerefer to a particular target object and why a specific surface structure was cho-sen. Several hypotheses were formulated on the basis of the principle of minimalcooperative effort and the assumption that dialogue participants establish dif-ferent kinds of focus spaces. Beun and Cremers showed that in the dialoguesobtained in the experiment most referential expressions were ambiguous withrespect to the domain of discourse and that focus was an important factor indisambiguating these expressions. Also, references to objects outside the focusarea were significantly more redundant than references inside the focus area.It was found that in a multimodal environment focus is not only a discourse-related phenomenon, but also depends on particular properties of the domain ofconversation combined with the perceptual abilities of the dialogue partners.

2.2 Multimodal Cooperation

One of the key concepts in modelling cooperation in communication, is thatof communicative action. The four chapters in this part of the book are allconcerned with certain aspects of the planning, expression, and interpretationof actions in multimodal communication.

Oliviero Stock, Carlo Strappavara, and Massimo Zancanaro in their chap-ter, Augmenting and Executing SharedPlans for Multimodal Communication,propose the adoption of an extended form of the concept of shared plans, asdeveloped by Grosz and Kraus (1993). The theory of shared plans (and theirtechnical form, ‘SharedPlans’) is based on distinguishing between the plans thatan agent ‘knows’ (recipes for actions) and the plans that an agent constructsand adopts. The theory of shared plans attempts to model interaction as a jointactivity in which the participants try to build a plan in the latter sense; theplan is shared in the sense that the participants have compatible beliefs andintentions. Stock et al. take the view that the interface in multimodal inter-action has a double nature: some of the actions are intended to augment thecurrent SharedPlan, whereas others are intended to execute the related recipe.The authors therefore propose an extension to the model of plan augmentationas put forward by Lochbaum (1994), where SharedPlans are meant not onlyto be augmented, but also executed. They discuss and illustrate their proposalwith examples from the prototypical multimodal systems they are developing,building on the AlFresco system (Stock et al., 1993).

With the advent of multimodal cooperative systems, a lasting source of in-spiration for the design of interactive systems will be the way multimodal co-operation is achieved in human-human dialogue. Jens Allwood’s contribution,Cooperation and Flexibility in Multimodal Communication (Chapter 6) exploressome of the ways cooperativeness is multimodally manifested in natural dialogue

Multimodal Cooperative Communication 7

with a view to the design of future cooperative human-computer interfaces. Af-ter a discussion and characterisation of the notion of cooperation, he proposesto extend cooperation into a notion of ‘mutual flexibility’. An empirical study ispresented of how verbal and nonverbal gestural means are used to achieve flexibil-ity in Swedish face-to-face conversation. Allwood discusses possible implicationsfor the design of interactive systems, which concerns such questions as whethersystems should be friendly; should sometimes be non-serious or vague; shouldbe non-imposing; should give and elicit supportive or other types of feedback;should show consideration and interest; and should be able to invoke mutualawareness and belief. If the answer to any of these questions is positive, thenwe are well advised to look at the ways in which these features are achieved innatural human communication, in order to know what means to use for makingthis happen in human-computer interaction.

In Chapter 7: Communication and Manipulation in a Collaborative DialogueModel, Martine Hurault-Plantet and Cecile Balkanski present a dialogue modelthat has been developed and tested in an application that simulates a telephoneswitchboard, with data drawn from human-human dialogues recorded at a tele-phone switchboard in an industrial setting. This model rests on a theory ofcollaborative discourse based on the mental states of dialogue agents. The the-ory establishes conditions on the beliefs and intentions of the agents for themto be able to cooperate. The authors show how their model allows for the treat-ment of both communicative acts and manipulation acts. The model is thusequipped to model cooperative human-machine communication in a multimodalcontext, where natural language (in this case typed rather than spoken) is usedin combination with direct manipulation.

In a multimodal user interface where language plays an important part, userrequests for actions to be performed by the system may conveniently take theform of an imperative expression. Moreover, in a ‘user-friendly’ interface theuser should be allowed to formulate logically complex requests such as Do A1

or A2, or Do A if B. From a semantic point of view, such logically compleximperatives can lead to strange results if they are interpreted as communicativeacts, aiming at a situation where their propositional content is true. For instance,an implication A if B is logically equivalent to a disjunction ¬B or A; thereforean interpretation of a request like Hit the Enter key if you see the symbol ‘@’could be interpreted as Hit the Enter key or do not see the symbol ‘@’ – whichis intuitively wrong. Paul Piwek, in his contribution Relating Imperatives toAction, provides an analysis of the use of complex imperatives which avoidssuch problems by taking the influence of background information into account.The analysis is carried out within a model of communicating agents in whichimperatives modify their commitments, construed in terms of beliefs combinedwith plans for action. This analysis explicates what it means for an agent tohave a successful policy for action with respect to satisfying his commitments,where some of these commitments have been introduced as a result of imperativelanguage use. The work reported in this chapter was carried out within theframework of the DenK multimodal dialogue project (see also Chapter 11).

8 Robbert-Jan Beun and Harry Bunt

2.3 Multimodal Interpretation

In Chapter 9, Interpretation of Gestures and Speech: A Practical Approach toMultimodal Communication, Xavier Pouteau discusses the notion of a ‘seman-tic frame’ to integrate the interpretation of gestures and speech in multimodalinterfaces. Pouteau distinguishes three functions of gestures: a. the epistemicfunction, which refers to the sense of touch, b. the ergative function, which per-tains to the manipulation of objects, and c. the semiotic function, which concernsthe information content of the gesture. It is argued that contemporary interfacesonly use the first, i.e. touch, through the contact with keyboard and mouse, andthat the other functions are neglected. In order to combine gestures with speech,Pouteau focuses on the semiotic functions of gesture, in particular the notion ofdemonstratum for pointing. In the semantic frame, Pouteau abstracts from theactual devices that support pointing and only registers the visual elements pos-sibly pointed at during the interaction supplemented, with information abouttime and coordinates of the pointing act. This information is integrated withsemantic information from the speech signal. Feedback messages are generatedin case of interpretation failures.

Michael Streit in his chapter Why are Multimodal Systems so Difficult toBuild? - About the Difference between Deictic Gestures and Direct Manipulation,addresses the question of synchronisation in diviplexed messages. Streit concen-trates on the synchronisation of speech and gestures and argues that there is agap between empirical findings and our expectations about natural conversation.Close observations of users interacting with the multimodal systems Mofa andTalky has shown that, in contrast with human-human communication, longdelays may exist between gestures and accompanying parts of the speech. Sincethe linguistic part of the message may help the disambiguation of the meaningof the gesture, these long delays may cause important problems for the interpre-tation process. Streit calls this the ‘wait problem’ and shows that the problemhas no unequivocal solution. Therefore, several approaches to solve the problemare discussed, depending on the type of pointing, the pointing device, the ver-bal expressions and the type of object referred to. It is argued that appropriatefeedback in combination with suitable pointing devices and gesture forms mayreduce the problem.

In Chapter 11,Multimodal Cooperative Resolution of Referential Expressionsin the DenK System, Leen Kievit, Paul Piwek, Robbert-Jan Beun and HarryBunt describe a module of the DenK system, that finds the antecedents for re-ferring expressions. The DenK system reflects a multimodal dialogue situationwhere users have direct access to a domain (by pointing and direct manipu-lation) or indirect access through natural language utterances. In the DenKsystem different types of context are distinguished, for instance the perceptualcontext (what the user can see in the application domain), the private context(the system’s knowledge of the application), the common context (beliefs sharedby system and user) and the dialogue history. The question arises, first, in whichcontext to look for a referent and second, which object the user meant if more

Multimodal Cooperative Communication 9

than one object satisfies the conditions of the expression. The resolution pro-cess is guided by the type of definiteness of the expression. For instance, theantecedent of a pronominal expression is first looked up in the dialogue historyand, if no fitting antecedent can be selected, the process takes another contextto find one or more suitable candidates. Within a particular context, objectshave some degree of saliency, and given the semantic agreement of the objectand the linguistic description, only salient objects are potential candidates. Anevaluation of the resolution process showed that the algorithm failed in only 5%of the cases in a corpus of 523 referential acts that were taken from various dia-logue corpora. The system is furthermore equipped with mechanisms to provideclarification questions and to re-evaluate ambiguous utterances on the basis ofthe user’s reply.

2.4 Multimodal Platforms and Test Environments

In Chapter 12,The IntelliMedia WorkBench - An Environment for Building Mul-timodal Systems, Tom Brøndsted, Paul Mc Kevitt and others present a genericmultimedia platform, called Chameleon, that can be tailored to various ap-plications and distributed over various hardware and software platforms. Aninitial application of Chameleon is the IntelliMedia Workbench which inte-grates modules such as a blackboard, a dialogue manager, a domain model andvarious input and output modalities. The blackboard keeps a history of the in-teraction in terms of frames that are coded as predicate argument structures.The dialogue manager decides which actions the system has to perform and,subsequently, sends the information to the output modules, i.e. a speech synthe-siser and a laser pointer. The current domain model contains information aboutthe architectural and functional layout of a building. Input modules consist of agesture recogniser and a speech recognition system. Since different modules maybe running on separate machines, modules are integrated by the so-called Dacssystem, which supports communication facilities, such as synchronous and asyn-chronous remote procedures, by means of a demon that acts as a router for allinternal traffic and that establishes connections to demons on remote machines.

In the last chapter, A Unified Framework for Constructing Multimodal Ex-periments and Applications, Adam Cheyer, Luc Julia and Jean-Claude Martindiscuss a prototype system that enables users to interact with a map displayby means of a combination of pen and voice input. The system, which will beused as a platform for Wizard-of-Oz (WOZ) experiments, offers facilities fora multimodal interface on which the user may draw, write or speak about atravel-planning domain. The approach employs an agent-based framework tocoordinate distributed information sources that work in parallel to resolve dif-ferent types of ambiguities in the interpretation process. Since multiple users areallowed to share a common workspace and the interface can be configured on aper-user basis to include more or fewer GUI controls, the system already offersimportant facilities for a fully automated application function part of a WOZexperiment. The idea is that the Wizard will use the full functionality of the au-tomated system, while the user interacts in an unconstrained manner. Hence, in

10 Robbert-Jan Beun and Harry Bunt

the same experiment, information will be collected about both the most naturalway to interact and the performance of the real system. It is expected that thedata can be applied directly to evaluation and improvement of the system.

References

Allwood, J. (2000) An activity-based approach to pragmatics. In: H.C. Bunt and W.J.Black (eds.) Abduction, Belief and Context in Dialogue. Studies in ComputationalPragmatics, Benjamins, Amsterdam, 47–80.

Bunt, H.C. (1998) Issues in Multimodal Human-Computer Communication. In H.C.Bunt, R.-J. Beun and T. Borghuis (eds) Multimodal Human-Computer Communi-cation. Springer Verlag, Berlin, 1–12.

Bunt, H.C. (2000) Dialogue Pragmatics and Context Specification. In: H.C. Bunt andW.J. Black (eds.) Abduction, Belief and Context in Dialogue. Studies in Computa-tional Pragmatics, Benjamins, Amsterdam, 81–150.

Grice, H.P. (1975) Logic and Conversation (From the William James Lectures, HarvardUniversity, 1967). In P. Cole and J. Morgan (eds.) Syntax and Semantics 3: SpeechActs. Academic Press, New York, 41–58.

Grosz, B. and Kraus, S. (1993) Collaborative plans for group activities. In Proceedingsof the 19th International Joint Conference on Artificial Intelligence, Chambery,367–373.

Lochbaum, K. (1994) Using Collaborative Plans to Model the Intentional Structure ofDiscourse. PhD thesis, Harvard University, Cambridge, MA.

Nigay, L. and Coutaz, J. (1993) A Design Space for Multimodal Systems: ConcurrentProcessing and Data Fusion. Proceedings INTERCHI’93, 172–178.

Stock, O. and the AlFresco Project Team (1993) AlFresco: Enjoying the combi-nation of NLP and hypermedia for information exploration. In M. Maybury (ed.)Intelligent Multimodal Interfaces. MIT Press, Cambridge, MA., 197–224.

Taylor, M.M. (1989) Response Timing in Layered Protocols: A Cybernetic View of Nat-ural Dialogue. In: M.M. Taylor, F. Neel, and D.G. Bouwhuis (eds.) The Structureof Multimodal Dialogue. North-Holland, Amsterdam, 159–172.