Generating finely synchronized gesture and speech for humanoid robots: A closed-loop approach

8
Generating Finely Synchronized Gesture and Speech for Humanoid Robots: A Closed-Loop Approach Maha Salem Faculty of Technology Bielefeld University D-33594 Bielefeld, Germany [email protected] Stefan Kopp Sociable Agents Group Bielefeld University D-33594 Bielefeld, Germany [email protected] Frank Joublin Honda Research Institute Europe D-63073 Offenbach, Germany [email protected] Abstract—To engage in natural interactions with humans, social robots should produce speech-accompanying non-verbal behaviors such as hand and arm gestures. Given the special con- straints imposed by the physical properties of a humanoid robot, successful multimodal synchronization is difficult to achieve. Previous work focusing on the production of communicative robot behavior has not sufficiently addressed the challenge of speech- gesture synchrony. Introducing the first closed-loop approach to speech and gesture generation for humanoid robots, we propose a novel multimodal scheduler that comprises two features to improve the synchronization process. First, the scheduler integrates an experimentally fitted forward model at the behavior planning stage to provide a more accurate estimation of the robot’s gesture preparation time. Second, the model incorporates a feedback-based adaptation mechanism which allows for an on- line adjustment of the synchronization of the two modalities dur- ing execution. Technical results obtained with the implemented scheduler are presented to demonstrate the feasibility of our approach; empirical results from an evaluation study are further discussed to highlight the implications of the present work. I. I NTRODUCTION Speech-accompanying hand and arm gestures are a fun- damental component of human communicative behavior, fre- quently used to illustrate what is expressed in speech [18]. At the same time, human listeners are very attentive to infor- mation conveyed via such non-verbal behaviors [11]. There- fore, such non-verbal behaviors represent ideal candidates for extending the communicative capabilities of humanoid robots that are to interact socially with humans. In addition, providing multiple modalities helps to dissolve ambiguity and thus to increase robustness of communication. In human communication, co-verbal gestures are tightly connected to speech as part of an integrated utterance, yielding semantic, pragmatic and temporal synchrony between both modalities [17]. A hand gesture typically consists of three phases ordered in a sequence over time to form a so-called gesture phrase: preparation, stroke, and retraction [12]. The stroke is the expressive phase of the gesture and bears meaning in relation to the corresponding part of speech, which is commonly referred to as the affiliate [18]. Based on studies of human communication (e.g., [12], [17]), gestures are found to be finely synchronized with the accompanying linguistic affiliate to convey their intended communicative meaning: the gesture stroke typically precedes or ends at – but does not follow – the stressed syllable of the affiliate. To achieve this phonological synchrony, gesture may adjust to speech, e.g., by means of adequately timed preparation or additional hold phases, or speech may adapt to gesture, e.g., by means of pauses [12]. Ensuring such co-expressive synchrony between the two modalities poses a major challenge for artificial communicators such as virtual agents or social robots. Up to the present time, most technical systems dedicated to co- verbal gesture generation merely approximate cross-modal synchronization, typically by means of one modality adjusting to the other without any mutual adaptability. However, relevant literature (e.g., [2], [3]) emphasizes the importance of accu- rate speech-gesture synchrony even for artificial agents, since asynchrony may result in conveying incorrect meaning. While effects of speech-gesture (a)synchrony have been studied with regard to human communication (e.g. [9]), they have not yet been sufficiently examined regarding communicative robots. The work described in this paper contributes to and com- plements existing work by providing a novel multimodal scheduler for closed-loop control of communicative robot behavior. Our approach centers on the generation of deictic and representational (i.e., iconic and metaphoric) gestures as they are considered particularly communicative (see [11] for a review). The paper is organized as follows. We first discuss related work in Section 2, showing that not much research has focused on the generation and evaluation of finely synchronized robot gesture and speech. In Section 3, we describe the model of our multimodal behavior scheduler which endows humanoid robots with the ability to generate finely synchronized gesture and speech. Multimodal behaviors realized in our proposed framework are then presented as technical results in Section 4. In Section 5, we evaluate the generated behaviors in an empirical study. Finally, we conclude and give an outlook of future work in Section 6. II. RELATED WORK A. Social Robotics Only a few robotic systems incorporate both speech and gesture synthesis; however, in most cases the robots are equipped with a set of predefined gestures or even prerecorded behaviors that are not generated on-line but simply replayed during human-robot interaction (e.g., [24]). Crucially, to the

Transcript of Generating finely synchronized gesture and speech for humanoid robots: A closed-loop approach

Generating Finely Synchronized Gesture and Speechfor Humanoid Robots: A Closed-Loop Approach

Maha SalemFaculty of Technology

Bielefeld UniversityD-33594 Bielefeld, Germany

[email protected]

Stefan KoppSociable Agents Group

Bielefeld UniversityD-33594 Bielefeld, Germany

[email protected]

Frank JoublinHonda Research Institute Europe

D-63073 Offenbach, [email protected]

Abstract—To engage in natural interactions with humans,social robots should produce speech-accompanying non-verbalbehaviors such as hand and arm gestures. Given the special con-straints imposed by the physical properties of a humanoid robot,successful multimodal synchronization is difficult to achieve.Previous work focusing on the production of communicative robotbehavior has not sufficiently addressed the challenge of speech-gesture synchrony. Introducing the first closed-loop approachto speech and gesture generation for humanoid robots, wepropose a novel multimodal scheduler that comprises two featuresto improve the synchronization process. First, the schedulerintegrates an experimentally fitted forward model at the behaviorplanning stage to provide a more accurate estimation of therobot’s gesture preparation time. Second, the model incorporatesa feedback-based adaptation mechanism which allows for an on-line adjustment of the synchronization of the two modalities dur-ing execution. Technical results obtained with the implementedscheduler are presented to demonstrate the feasibility of ourapproach; empirical results from an evaluation study are furtherdiscussed to highlight the implications of the present work.

I. INTRODUCTION

Speech-accompanying hand and arm gestures are a fun-damental component of human communicative behavior, fre-quently used to illustrate what is expressed in speech [18].At the same time, human listeners are very attentive to infor-mation conveyed via such non-verbal behaviors [11]. There-fore, such non-verbal behaviors represent ideal candidates forextending the communicative capabilities of humanoid robotsthat are to interact socially with humans. In addition, providingmultiple modalities helps to dissolve ambiguity and thus toincrease robustness of communication.

In human communication, co-verbal gestures are tightlyconnected to speech as part of an integrated utterance, yieldingsemantic, pragmatic and temporal synchrony between bothmodalities [17]. A hand gesture typically consists of threephases ordered in a sequence over time to form a so-calledgesture phrase: preparation, stroke, and retraction [12]. Thestroke is the expressive phase of the gesture and bears meaningin relation to the corresponding part of speech, which iscommonly referred to as the affiliate [18]. Based on studiesof human communication (e.g., [12], [17]), gestures are foundto be finely synchronized with the accompanying linguisticaffiliate to convey their intended communicative meaning: thegesture stroke typically precedes or ends at – but does not

follow – the stressed syllable of the affiliate. To achieve thisphonological synchrony, gesture may adjust to speech, e.g.,by means of adequately timed preparation or additional holdphases, or speech may adapt to gesture, e.g., by means ofpauses [12]. Ensuring such co-expressive synchrony betweenthe two modalities poses a major challenge for artificialcommunicators such as virtual agents or social robots. Upto the present time, most technical systems dedicated to co-verbal gesture generation merely approximate cross-modalsynchronization, typically by means of one modality adjustingto the other without any mutual adaptability. However, relevantliterature (e.g., [2], [3]) emphasizes the importance of accu-rate speech-gesture synchrony even for artificial agents, sinceasynchrony may result in conveying incorrect meaning. Whileeffects of speech-gesture (a)synchrony have been studied withregard to human communication (e.g. [9]), they have not yetbeen sufficiently examined regarding communicative robots.

The work described in this paper contributes to and com-plements existing work by providing a novel multimodalscheduler for closed-loop control of communicative robotbehavior. Our approach centers on the generation of deicticand representational (i.e., iconic and metaphoric) gesturesas they are considered particularly communicative (see [11]for a review). The paper is organized as follows. We firstdiscuss related work in Section 2, showing that not muchresearch has focused on the generation and evaluation offinely synchronized robot gesture and speech. In Section 3,we describe the model of our multimodal behavior schedulerwhich endows humanoid robots with the ability to generatefinely synchronized gesture and speech. Multimodal behaviorsrealized in our proposed framework are then presented astechnical results in Section 4. In Section 5, we evaluatethe generated behaviors in an empirical study. Finally, weconclude and give an outlook of future work in Section 6.

II. RELATED WORK

A. Social Robotics

Only a few robotic systems incorporate both speech andgesture synthesis; however, in most cases the robots areequipped with a set of predefined gestures or even prerecordedbehaviors that are not generated on-line but simply replayedduring human-robot interaction (e.g., [24]). Crucially, to the

best of our knowledge, no previous approach in robotics hasrealized exact, i.e., empirically sound, synchronization of robotgesture and speech. For example, the system presented byMead et al. [19] roughly attempts to synchronize the robot’sgesture with the speech output at the phrase level rather thanwith the exact affiliate. In a different approach, Bremner etal. [1] predefine timings for speech-gesture synchronizationbased on estimations at the coding stage, resulting in gesturemovements that are entirely preplanned in form and duration.During execution, movements of the different gesture phases(i.e., preparation, stroke, and retraction) are triggered by eventsin the speech output. Generally, in all existing robotic systemsthat attempt to synchronize synthetic speech and gesture,running speech completely dictates the timing of the generatedgestures, while gestures, in contrast, cannot affect the producedspeech. Moreover, while some approaches obtain exact speechtiming information from the text-to-speech system, timing ofgesture is typically based on estimated approximations whichlack accuracy. Another characteristic of related approaches isthe open-loop control of the implementations. This means,once speech and gesture output have been planned and sched-uled for the robot, both modalities are generated ballisti-cally and cannot be further adjusted at run-time, e.g., basedon sensory feedback. Such unidirectional synchronization aswell as the open-loop production of multimodal behaviorsare characteristic of all currently existing approaches robotgesture generation (e.g., [20], [16]). As a result, the workpresented in this paper stands out from other robotic systemsfor speech-gesture generation by incorporating reactive closed-loop feedback for a more flexible approach to multimodalrobot expressiveness.

B. Virtual Agents

In contrast to the research area of robotics, the challengeof generating speech and co-verbal gesture has already beentackled rather extensively in various ways within the domainof embodied virtual agents. In their seminal work, Cassell etal. [2] presented the REA system in which a conversationalhumanoid agent operates as a real estate salesperson. Anothersystem proposed by Cassell et al. [4] is the Behavior Expres-sion Animation Toolkit (BEAT) which enables an animatedvirtual human to generate appropriate and synchronized non-verbal behaviors and synthesized speech from typed inputtext. The interactive expressive GRETA system [21] is anotherexample of a real-time animated conversational agent. Thefemale character is able to communicate with a human userby means of verbal and non-verbal behaviors such as gestures,head movements, gaze, and facial expressions. Despite theirelaborate architectures, all these system are limited with regardto timing and synchronization of multimodal behaviors: toachieve cross-modal synchrony, gesture timing is heuristicallyadapted to the timing of generated speech, i.e., start and endtimes of phonemes are used to parametrize gesture time. Inaddition, both speech and gesture are executed ballistically,i.e., the systems do not provide any means for feedback-basedadjustments once behavior execution has been initiated.

Complementing these approaches, Kopp and Wachsmuth[14] introduced the Articulated Communicator Engine (ACE)as the first multimodal behavior realizer for virtual agentsthat provides for mutual adaptation mechanisms between thetiming of speech and gesture. Inspired by theories from humangesture research, ACE is based on the assumption that theproduction of continuous speech and gesture is organizedin successive segments. Each of these segments correspondsto what Kopp and Wachsmuth [14] define as a multimodalchunk of speech-gesture production which, in turn, comprisesa pair of an intonation phrase and a co-expressive gesturephrase. That is, complex utterances with several differentgestures consist of multiple successive chunks. Given sucha multi-chunk utterance, the ACE system coordinates theconcurrent modalities at two different levels. First, for intra-chunk scheduling, temporal synchrony is mainly accomplishedby adapting gesture to the timing of speech. For this, absolutephoneme timing information is obtained from the text-to-speech system. This process is similar to the synchronizationmethod used in the above discussed systems. Once scheduledand ready to be executed, speech and gesture run ballisticallywithin a chunk, i.e., their execution is unaffected by theprogress of the other respective modality. Second, for inter-chunk scheduling between two successive chunks, both speechand gesture can anticipate the forthcoming chunk and adaptto it if required. For gesture, this adaptation referred to asco-articulation effect may range from the insertion of anintermediate rest pose to a direct transition movement skippingthe retraction phase or parts thereof. For speech, a silentpause may be inserted between two intonation phrases for theduration of the preparation phase of the following gesture.Such adaptive flexibility in the form of inter-chunk synchronyis achieved by a global scheduler which plans the followingchunk in advance while monitoring the chunk currently inexecution. Once the predecessor has been executed, the planfor the next chunk may be refined based on the currentstate of speech and gesture generation. In summary, the ACEsystem attempts to overcome some of the issues found inthe previously described frameworks by providing a greaterlevel of flexibility. However, even in ACE the realized cross-modal synchronization mechanisms are not entirely realisti-cally modeled. Despite providing for mutual adaptation at theinter-chunk level, lacking adjustability within a chunk as wellas the ballistic generation of complete gesture and intonationphrases conflict with findings from psychology (e.g., [12]).

The present work draws inspiration from the innovativequality of the ACE system, especially with regard to itsincremental on-line scheduling of multimodal behavior at theinter-chunk level. Specifically, in our previous work we pre-sented a system that employs the virtual agent framework ACEwith its original scheduler as an underlying behavior realizerfor multimodal robot behavior generation [22]. In the workdescribed in this paper, we expand the model of the underlyingscheduler by countering its conceptual shortcomings at theintra-chunk level in order to realize more finely synchronizedspeech and gesture for humanoid robots.

• Phonological encoding

• Determine timing information for speech

• Generate speech output

Chunk i

In Preparation

Speech

• Movement planning

• Predict gesture preparation time tg based on empirically validated forward model

Gesture

Planning

Lurking

Affiliate

Stroke

In Execution Subsiding

Partial retraction

Skipped retraction

!

Execution

• Phonological encoding

• Determine timing information for speech

• Generate speech output

Chunk i+1

In Preparation

Speech

• Movement planningGesture

Pending

Affiliate

In Execution

!

!

Silent pause

Feedback-based update

ts = time for speech output before affiliate onset

tg = time for gesture preparation before stroke onset

ts

tg

Feedback & on-line

adjustment if required

tg

!

Planning Execution

!

Predict gesture preparation time tg based on empirically validated for-ward model

Feedback & on-line

adjustment if required

Stroke

Fig. 1. Model of the ‘planning-execution’ procedure of the proposed multimodal scheduler.

III. CLOSED-LOOP SCHEDULER

In view of the limitations of existing approaches discussedin the previous section and the physically constrained behaviorgeneration capacities of a robot, an improved multimodal be-havior scheduler needs to address the following two questions.First, with regard to the behavior planning phase, how cana more reliable prediction of robot-specific gesture executiontime be obtained to allow for better multimodal scheduling?Second, with regard to the execution phase, how can planningand scheduling errors be accounted for during execution ifthey were to cause mistimed synchrony otherwise? Addressingthe first issue, the proposed extended scheduler was realizedto incorporate an empirically verified forward model thatpredicts an estimate of the gesture preparation time requiredby the robot prior to actual execution. Addressing the secondissue, an on-line adjustment mechanism was integrated intothe synchronization process for cross-modal adaptation withina chunk based on afferent sensory feedback. Generation andcontinuous synchronization of gesture and speech are flexiblyconducted at run-time. Figure 1 illustrates the ‘planning-execution’ procedure of the proposed improved scheduler.Section III-A gives an overview of the complete multimodalgeneration process within a multimodal chunk; Section III-Bprovides details on the implementation of the predictive for-ward model which has been tailored to the requirements of theHonda humanoid robot [10] used for the present work; finally,in Section III-C, the current realization of the incorporatedfeedback-based adaptation mechanism is described.

A. Concept Overview

In our behavior generation system, multimodal utterancesare specified using the XML-based Multimodal UtteranceRepresentation Markup Language (MURML [15]). It providesspecific tags to label the speech affiliate using time identifiers;in addition, the posture of the gesture stroke can be specifiedin an overt form (see [22] for more details). An exampleMURML file is illustrated in Figure 2. Given such a file,the proposed scheduler operates as follows.

Phase 1: Planning:

• Speech preparation. The planning phase begins with thephonological encoding of the designated speech output.During this process, the text-to-speech synthesis systemMARY [23], which is integrated into our system for speech

<definition> <utterance> <specification> <time id="t1"/> This one <time id="t2"/> is my favorite picture. </specification> <behaviorspec> <gesture id="gesture_1" scope="hand">

<affiliate onset="t1" end="t2"/> <constraints> <parallel> <static slot="HandShape" value="BSffinger"/> <static slot="ExtFingerOrientation" value="DirA"/> <static slot="PalmOrientation" value="DirL"/>

<static slot="HandLocation" value="LocShoulder LocRight LocNorm"/> </parallel> </constraints> </gesture> </behaviorspec> </utterance></definition>

Fig. 2. Example of a MURML specification for multimodal utterances.

output generation in multiple languages, establishes a com-plete list of phonemes and their respective durations. Basedon this list, relevant speech timing information such as theonset time of the affiliate is determined. Speech output issubsequently generated in an audio file which is furtherprocessed in dependence on the position of the affiliate.If the affiliate is located at the beginning of the utterance,the audio file remains as it is. If the affiliate is located inthe middle or at the end of the utterance, the sound file issplit into two parts as follows: the first part contains thespeech output to be uttered before the affiliate onset; thesecond part contains the speech affiliate and, if applicable,any subsequent remaining parts of speech.

• Gesture preparation. Based on the specification of thedesignated location for the gesture stroke onset, a suitabletrajectory is calculated, resulting in a detailed movementplan. Meanwhile, the predictive forward model further de-scribed in Section III-B computes the estimated executiontime required by the robot for the gesture preparation phasebefore the stroke onset.

Following a comparison of derived timing information for bothmodalities, start times for speech and gesture are determinedso that temporal synchrony is expected to be achieved atthe affiliate-stroke level. That is, if the duration of speechbefore the affiliate onset is longer than the estimated gesturepreparation time before the stroke onset, then speech outputstarts first, otherwise gesture execution starts first.

Phase 2: Execution:• Before the affiliate onset. Speech-gesture production isinitiated as scheduled in the final step of the planning phase.If there is speech output scheduled to precede the affiliate,the first sound file containing this part of speech is replayed.While the robot performs the gesture preparation movement,variance between the target and its actual wrist position isconstantly monitored utilizing afferent feedback from therobot controller.

• Ensuring synchrony between affiliate and stroke onset.Once the robot’s wrist has reached a position within apredefined range of the target location for the stroke onset,playback of the sound file containing the speech affiliate istriggered. If the predicted time was accurate, there shouldbe no noticeable interruption in speech flow. If the gesturepreparation takes longer than scheduled, the feedback-basedadaptation mechanism described in Section III-C becomeseffective such that a pause is inserted before the affiliate.

B. Predictive Forward Model

Neurophysiological findings suggest that human motor con-trol relies more on sensory predictions than on sensory feed-back: due to their inherent delay, such feedback loops areconsidered too slow for efficient trajectory control [6]. Draw-ing inspiration from this neurobiological perspective in lightof the present work, the concept of internal forward modelsis of particular interest to the realization of the proposedscheduler for multimodal robot behaviors. Given the physicalproperties of the robot and its inability to move the arms with

arbitrary speed, precise movement times of gesture executionare difficult to determine a priori. Importantly though, tosuccessfully synchronize the robot’s gesture with concurrentspeech, the gesture preparation phase must be completed be-fore the onset of the speech affiliate. Therefore, the executiontime of this gesture phase in particular needs to be estimatedas accurately as possible so that the behavior onset times ofthe two modalities can be adequately scheduled. Ideally, anoptimized movement time prediction will thus lead to speech-gesture production in which the feedback mechanism of ourclosed loop approach will not frequently need to intervene.

Trajectory Simulation Using Robot’s WBM Controller:Our approach to a robot-specific forward model was realizedusing the whole body motion (WBM) software controllingthe Honda humanoid robot [8]. The robot’s WBM controllerprovides the possibility for accelerated internal simulationsof designated trajectories prior to the actual movement [7].Although the identical WBM controller is used for bothinternal simulation and real robot control, the two processesare temporally decoupled. Since the simulation employs thesame controller as used for subsequent generation, the WBM-based prediction model can account for the actual path ofthe simulated trajectory with regard to the robot’s physicalproperties, e.g., joint limits. To simulate the robot’s futurestate, the internal predictor iterates the robot model, however,ten times faster than real-time control. For this speed-up,longer sampling time intervals and less degrees of freedomthan applied during real robot control are employed for theiterations [25]. This way, the internal predictor model cansimulate the desired trajectory much faster, requiring only neg-ligible computation time. As a result, it offers a considerableoption for the estimation of movement time required for thetypically linear trajectory of the gesture preparation phase. Theoutcome of the predicted trajectory time tg is influenced by anumber of parameters that are specific to the WBM software,such as speed response and maximal velocity. Therefore, thecontroller-based forward model had to be fitted in order todetermine appropriate values for these parameters.

For this, a range of different values was empirically testedfor each parameter on a set of experimental training data. Thisdata set comprised 60 target trajectories of the right arm, eachstarting from the robot’s default arm position and reachingto different targets within a typical range of gesture space.The resulting set of targets specified a diverse variety of bothshort-distance and long-distance trajectories. Evaluation wasbased on comparisons of predicted movement time with actualtrajectory time values derived from the robot’s performanceof each test trajectory, so that the smallest mean error amongall possible combinations of values assigned to the WBM-specific parameters was identified. For the given training data,the mean prediction error of this combination was 0.1458seconds with a minimum deviation of 0.0024 seconds for thebest predicted trajectory time, and a maximum deviation of0.6180 seconds for the poorest trajectory time prediction.

Ideally, machine learning algorithms (e.g., based on neuralnetworks) should be incorporated into the model to obtain

potentially more accurate forward models in the future. Forthe present time, however, the realized WBM-based forwardmodel provides a more advanced – and with a mean errorbelow 0.15 seconds generally acceptable – approach to gesturetime prediction and planning than most existing schedulers. Inaddition, despite potential for improvement, a fully accurateprediction model is almost impossible to achieve, whichemphasizes the remaining need for reactive feedback-basedadaptation mechanisms during the gesture execution phase.

C. Feedback-Based Adaptation Mechanism

Despite the improved accuracy of timing estimation for thegesture preparation phase as based on the implemented for-ward model, the actual timing of multimodal utterances mightstill deviate from the prediction during execution. Therefore,the second feature realized in our proposed scheduler providesa feedback-based adaptation mechanism that allows for reac-tive cross-modal adjustment during the execution phase.

Generally, two types of real-time deviation from scheduledtiming plans are possible: first, the forward model may over-estimate the time required for gesture preparation, resulting ina premature gesture stroke onset; second, the predictive modelmay underestimate the time required for the gesture prepara-tion phase, resulting in a delayed stroke onset. According tofindings from human gesture research stating that the gesturestroke may precede but not follow the speech affiliate [17], thefirst type of deviation appears less problematic than the second.In fact, for the 60 tested training trajectories, the mean errorvalues obtained for the WBM-based predictive model yielded amaximal overestimation error of only 0.276 seconds; therefore,such cases may well be tolerated. In general, the implementedforward model was found to overestimate gesture preparationtime by more than 0.1 seconds only in very rare instances (7out of 60 cases), which further supports the decision to leavethis type of prediction error unattended.1 However, given itsinherent tendency to underestimate the time required by therobot, the present approach focuses on this more problematictype of prediction error with regard to gesture preparation time.

Implementation of Cross-Modal Adjustment: In his work,Kendon [12] demonstrated that human speech may pausebefore the affiliate onset if the gesture preparation movementhas not yet been completed, suggesting that the inserted pauseensures temporal synchrony of the gesture stroke with speech.Similarly, experiments conducted by de Ruiter [5] showedthat synchronization between the two modalities can indeedbe achieved by the timing of running speech adapting to thetiming of gesture. As mentioned in Section II, existing mul-timodal schedulers do not account for such speech adaptationto gesture within a multimodal chunk, since they operate fullyballistically at intra-chunk level during the execution phase.When used on a robotic platform, this lacking adaptability isparticularly problematic if gesture preparation time is signifi-cantly underestimated, in which case the delay of the gesturestroke onset may cause speech-gesture mistiming.

1If, however, the model exposed a pattern of frequent overestimation ofgesture preparation time, a pre-stroke hold phase could be implemented.

To account for the human ability to align speech to gesturealso within a chunk, our proposed scheduler was realized toreplace the commonly used open-loop execution mechanismwith a more flexible, closed-loop approach. This is achieved byutilizing afferent feedback from the robot’s WBM controllerfor reactive on-line adjustment of the multimodal synchroniza-tion process. Specifically, as outlined in the description of theremodeled execution phase in Section III-A, cross-modal intra-chunk adaptation is achieved as follows. The sensory feedbackreceived from the WBM controller provides information aboutthe current wrist position of the robot’s arm at each time step,which enables the scheduler to constantly check for variancebetween the target and actual position. This information isused as a ‘triggering’ mechanism, so that the speech affiliateis only uttered once the robot’s hand has reached a positionwithin the range of the stroke onset position.

If gesture preparation time is overestimated or correctlypredicted, feedback about successful completion of the prepa-ration phase is transmitted to the speech processing modulebefore or at the very latest at the scheduled affiliate onset time.As a result, speech output is uttered according to the previouslyestablished generation plan without any audible disruption.If, however, gesture preparation time is underestimated, thedescribed cross-modal processing dependency results in a de-viation of the actual multimodal production process comparedto the scheduled behavior plan. That is, if the robot’s handhas not yet reached the stroke onset location by the predictedtime, the reactive feedback mechanism intervenes in the speechproduction process: if the affiliate is located at the beginning ofthe utterance, it causes a delayed speech onset; if the affiliateis situated in the middle or at the end of speech, it resultsin an inserted speech pause right before the affiliate. To re-establish speech-gesture synchrony in the latter case, speechis paused until sensory feedback confirming the completion ofthe gesture preparation phase triggers its continuation.2 Suchanticipation and reactive adaptation of the flow of speech togesture timing complies with psycholinguistic models fromgesture literature promoting an interactive view of the rela-tionship between speech and gesture (e.g., [18]).

IV. TECHNICAL RESULTS

The following examples were generated using the proposedmultimodal scheduler which incorporates both the experimen-tally fitted WBM controller-based predictive model and thereactive adaptation mechanism. Ideally, the movement timerequired for gesture preparation is estimated precisely enoughduring the planning phase so that the actual execution is con-sistent with the scheduled multimodal plan. In such instance,the feedback-based adaptation mechanism would trigger theproduction of the speech affiliate just on time so that speechproduction is not notably interrupted. However, to demonstratethe cross-modal adjustment capability of our scheduler, we

2Alternatively, speech rate could be modulated (e.g., ‘slowed down’) untilgesture preparation has completed, however, since this is difficult to realizewith a TTS engine that preproduces the audio files, our approach currentlyemploys the more straightforward solution based on inserted pauses.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.40.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Time (sec)

z c

oord

inate

(m

)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Time (sec)

Ve

locity (

m/s

ec)

[-----------][ predicted preparation time ] [This][ one ][ is ][ my ][ favorite ][ picture. ]

feedback-inducedspeechdelay

Fig. 3. Multimodal utterance generated with the proposed multimodal scheduler demonstrating a feedback-induced speech delay prior to speech onset (affiliateat the beginning of the chunk).

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.80.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Time (sec)

z c

oo

rdin

ate

(m

)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 6.8 5 5.2 5.4 5.60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Time (sec)

Ve

locity (

m/s

ec)

[predicted preparation time]feedback-inducedspeechpause

[ The ][ television ][ is ][ about ][ -------------- ][ this ][ big ][ and ][ they ][ have ][ it ][ on ][ sale. ]

Fig. 4. Multimodal utterance generated with the proposed multimodal scheduler demonstrating a feedback-induced speech pause within a chunk (affiliate inthe middle of the chunk).

provide examples in which sensory feedback induced the inser-tion of pauses into the flow of speech. Such cases emerge whenthe gesture preparation time is considerably underestimatedby the predictive model during planning. Two scenarios aredemonstrated: first, a multimodal utterance in which the speechaffiliate is located at the beginning of the chunk, and second,one in which it is situated in the middle of the utterance.

Affiliate at the Beginning of a Chunk: Figure 3 depictsan example of a multimodal utterance generated with ourproposed scheduler which illustrates the feedback-based adap-tation mechanism effective at the beginning of the utterance.The upper plotted graph visualizes the velocity profile ofthe robot’s right wrist during gesture execution; the lowergraph plots the z-axis of the robot’s wrist trajectory over time,which represents the vertical rising of the arm and thus themost prominent dimension of this gesture. Speech output istranscribed in temporal alignment to the generated gesture tra-jectory with words of the affiliate highlighted in red color. Thefigure illustrates the underestimation of gesture preparationtime by ∼0.4 seconds resulting in a feedback-induced delayof the speech affiliate onset until the stroke onset positionhas been reached. This way, despite the initially mistimedgesture preparation onset, temporal synchrony between thespeech affiliate and the gesture stroke is ensured.

Affiliate in the Middle of a Chunk: Figure 4 shows an exam-ple of a multimodal utterance which illustrates the feedback-based adaptation mechanism effective in the middle of thechunk. As described in Section III-A, the position of theaffiliate yields the generation of two separate speech outputfiles. Initially, speech and gesture commence according to theirassigned onset times as scheduled in the planning phase. Inthis case, speech onset precedes the beginning of the gesturepreparation phase, i.e., playback of the first audio file contain-ing speech to be uttered before the affiliate onset is initiatedfirst and is accompanied by gesture movements after ∼0.72seconds. Again, gesture preparation time is underestimated,resulting in a feedback-induced speech pause immediatelypreceding the affiliate onset. Once the preparation phase of thegesture has been completed, the feedback mechanism triggersthe continuation of speech so that the affiliate and remainingspeech output are replayed from the second audio file.

V. EVALUATION

To evaluate the optimized generation of synchronized co-verbal robot gesture with our new scheduler, we conducted anempirical study with 20 participants (10m, 10f), ranging in agefrom 19 to 36 years (M = 28.5, SD = 4.53). Each participantwas invited to view and rate a set of video clips showing theHonda humanoid robot while delivering various multimodalutterances (speech was generated in German language forthe study, as all participants were German native speakers).For each utterance, two different versions were presented inrandomized order: one version was recorded using the originalACE scheduler, while the other one employed our novelextended scheduler. Participants were asked to identify theversion in which the robot’s gesture was better synchronized

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

"better synchronized" "more natural synchronization"

Affiliate in the middle of chunk

(18.35%)

Affiliate at the beginning of chunk

(20.85%)

Affiliate in the middle of chunk

(31.65%)

Affiliate at the beginning of chunk

(29.15%)

Affiliate in the middle of chunk

(24.55%)

Affiliate at the beginning of chunk

(17.55%)

Affiliate in the middle of chunk

(25.45%)

Affiliate at the beginning of chunk

(32.45%)

Old Scheduler New Scheduler

Avera

ge R

atings

Fig. 5. Evaluation of synchronization qualities of old vs. new scheduler.

with the speech output, as well as the version in which thesynchronization seemed more natural to them. Evaluation wasbased on twelve video sequences covering pairs of six differentmultimodal utterances (including the two examples presentedin Section 4); in three of these utterances the speech affiliatewas located at the beginning of the chunk, while the affiliatein the other three utterances was situated in the middle of thechunk. Participants were allowed to watch the video sequencesas many times as needed to come to a decision.

In summary, the majority of respondents (60.8%) identifiedthe multimodal robot behavior generated with the new sched-uler as better synchronized, with 57.9% further describing itas the more natural version compared to the output generatedusing the original ACE scheduler (see Figure 5). Althoughnot significantly different, the ratings nevertheless attest a morepositive average perception of the speech-gesture output result-ing from our proposed scheduler. Notably, most participantsreported having difficulties distinguishing and identifying theallegedly better synchronized version. This observation is inline with recent findings from human gesture research showingthat our language perception system is fairly tolerant regardingtemporal synchrony in human-human communication ([9],[13]). That is, a variety of synchrony offsets can all lead tomultimodal behavior that is perceived as natural. Thus, ourfindings not only validate our technical approach but alsocontribute to this ongoing debate about the role of speech-gesture synchrony by providing an HRI perspective for thesystematic study of human perception of (a)synchrony.

VI. DISCUSSION AND CONCLUSION

We presented a novel multimodal scheduler that overcomesthe conceptual shortcomings of existing behavior schedulers.The proposed model comprises two features that improvethe synchronization process on the given robotic platform.

First, the scheduler was realized to incorporate a biologicallyinspired forward model that predicts a more accurate estimateof the gesture preparation time required by the robot priorto actual execution. Our technical approach was implementedand empirically fitted using a set of experimental test datacomprising 60 target trajectories. Future work should extendthe forward model to incorporate machine learning algorithmsso that a more accurate and, from a neurobiological perspec-tive, more intuitive prediction model may be obtained. Thesecond feature of the extended scheduler incorporates an on-line adjustment mechanism into the synchronization processfor cross-modal adaptation within a chunk based on sensoryfeedback from the robot. This way, the extended scheduler notonly provides cross-modal alignment between two successivechunks, but also within a single chunk. In the near future, otherstrategies to modulate speech (e.g., introducing vocal timingrather than pauses) should be explored.

The functionality of the reactive adaptation mechanismprovided by our scheduler at the intra-chunk level was illus-trated by means of two examples. They show how synchronybetween robot gesture and speech is ensured during executionin spite of prediction errors previously made during behaviorplanning and scheduling. This increased generation flexibilityis a novelty in robotic platforms that aim at synchronizedspeech-gesture production. In this way, the extended schedulerrepresents the first closed-loop approach to co-verbal gesturegeneration for humanoid robots and, as a major contributionof the present work, enables the robot to plan, generateand continuously synchronize gesture and speech at run-time.Thus, it advances the state of the art by providing a moreflexible and natural way to realize multimodal behavior forsociable robots and other artificial communicators.

In addition to the technical implementation, the realizedscheduler was evaluated in a video-based user study. Ourresults suggest that human perception of speech-gesture syn-chrony – both in humans and robots – is not a straightforwardprocess and thus difficult to predict. Thus, to model acceptableand natural synchronization of speech and gesture for HRI, asystematic investigation of the role and effects of (a)synchronyis required. Our generation system for synchronized robotgesture and speech provides a useful tool for this purpose.

ACKNOWLEDGMENT

The work described was supported by the Honda ResearchInstitute Europe.

REFERENCES

[1] P. Bremner, A. Pipe, C. Melhuish, M. Fraser, and S. Subramanian.Conversational Gestures in Human-Robot Interaction. In Proceedings ofthe IEEE International Conference on Systems, Man and Cybernetics,pages 1645–1649, 2009.

[2] J. Cassell, T. Bickmore, L. Campbell, H. Vilhjalmsson, and H. Yan.Human Conversation as a System Framework: Desigining EmbodiedConversational Agents. In Embodied Conversational Agents, pages 29–63. MIT Press: Cambridge, MA, 2000.

[3] J. Cassell and M. Stone. Living Hand to Mouth: PsychologicalTheories about Speech and Gesture in Interactive Dialogue Systems.In Proceedings of the AAAI 1999 Fall Symposium on PsychologicalModels of Communication in Collaborative Systems, pages 34–42, NorthFalmouth, MA, 1999.

[4] J. Cassell, H. Vilhjalmsson, and T. Bickmore. BEAT: the BehaviorExpression Animation Toolkit. In Proceedings of ACM SIGGRAPH2001, 2001.

[5] J. P. de Ruiter. Gesture and Speech Production. PhD Thesis. MPI Seriesin Psycholinguistics, University of Nijmegen, 1998.

[6] M. Desmurget and S. Grafton. Forward modeling allows feedbackcontrol for fast reaching movements. Trends in Cognitive Sciences,4(11):423–431, 2000.

[7] M. Gienger, B. Bolder, M. Dunn, H. Sugiura, H. Janssen, and C. Go-erick. Predictive Behavior Generation – A Sensor-Based Walking andReaching Architecture for Humanoid Robots. In K. Berns and T. Luksch,editors, Autonome Mobile Systeme 2007, Informatik Aktuell, pages 275–281. Springer, Berlin, Heidelberg, 2007.

[8] M. Gienger, H. Janßen, and S. Goerick. Task-Oriented Whole BodyMotion for Humanoid Robots. In Proceedings of the IEEE-RASInternational Conference on Humanoid Robots, Tsukuba, Japan, 2005.

[9] B. Habets, S. Kita, Z. Shao, A. Ozyurek, and P. Hagoort. The Roleof Synchrony and Ambiguity in Speech: Gesture Integration DuringComprehension. Journal of Cognitive Neuroscience. Advance OnlinePublication, 23(8):1845–1854, 2011.

[10] Honda Motor Co. Ltd. The Honda Humanoid Robot Asimo, year2000 model, 2000. http://asimo.honda.com/downloads/pdf/honda-asimo-robot-fact-sheet.pdf – accessed May 2012.

[11] A. B. Hostetter. When Do Gestures Communicate? A Meta-Analysis.Psychological Bulletin, 137(2):297–315, 2011.

[12] A. Kendon. Gesture: Visible Action as Utterance. Gesture, 6(1):119–144, 2004.

[13] C. Kirchhof and J. P. De Ruiter. On the audiovisual integration ofspeech and gesture. In Book of Abstracts of the 5th Conference of theInternational Society for Gesture Studies, page 62, Lund, Sweden, 2012.

[14] S. Kopp and I. Wachsmuth. Synthesizing Multimodal Utterancesfor Conversational Agents. Computer Animation and Virtual Worlds,15(1):39–52, 2004.

[15] A. Kranstedt, S. Kopp, and I. Wachsmuth. MURML: A MultimodalUtterance Representation Markup Language for Conversational Agents.In Proceedings of the AAMAS 2002 Workshop on Embodied Conversa-tional Agents - Let’s Specify and Evaluate Them, Bologna, Italy, 2002.

[16] Q. A. Le, S. Hanoune, and C. Pelachaud. Design and implementationof an expressive gesture model for a humanoid robot. In Proceedingsof the 11th IEEE-RAS International Conference on Humanoid Robots,pages 134–140, Bled, Slovenia, 2011.

[17] D. McNeill. Hand and Mind: What Gestures Reveal about Thought.University of Chicago Press, Chicago, 1992.

[18] D. McNeill. Gesture and Thought. University of Chicago Press,Chicago, 2005.

[19] R. Mead, E. Wade, P. Johnson, A. B. St. Clair, S. Chen, and M. J.Mataric. An Architecture for Rehabilitation Task Practice in SociallyAssistive Human-Robot Interaction. In Proceedings of the IEEE Inter-national Symposium in Robot and Human Interactive Communication.

[20] V. Ng-Thow-Hing, P. Luo, and S. Okita. Synchronized Gesture andSpeech Production for Humanoid Robots. In Proceedings of theIEEE/RSJ International Conference on Intelligent Robots and Systems,pages 4617–4624, 2010.

[21] R. Niewiadomski, E. Bevacqua, M. Mancini, and C. Pelachaud. Greta:An Interactive Expressive ECA System. In Proceedings of 8th Interna-tional Conference on Autonomous Agents and Multiagent Systems, pages1399–1400, 2009.

[22] M. Salem, S. Kopp, I. Wachsmuth, K. Rohlfing, and F. Joublin. Gen-eration and Evaluation of Communicative Robot Gesture. InternationalJournal of Social Robotics, Special Issue on Expectations, Intentions,and Actions, 4(2):201–217, 2012.

[23] M. Schroder and J. Trouvain. The German Text-to-Speech SynthesisSystem MARY: A Tool for Research, Development and Teaching.International Journal of Speech Technology, pages 365–377, 2003.

[24] C. Sidner, C. Lee, and N. Lesh. The Role of Dialog in Human RobotInteraction. In International Workshop on Language Understanding andAgents for Real World Interaction, 2003.

[25] H. Sugiura, H. Janßen, and C. Goerick. Instant Prediction for ReactiveMotions with Planning. In Proceedings of the 2009 IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems, pages 5475–5480,Piscataway, NJ, USA, 2009. IEEE Press.