What Iconic Gesture Fragments Reveal about Gesture–Speech Integration: When Synchrony Is Lost,...

17
What Iconic Gesture Fragments Reveal about GestureSpeech Integration: When Synchrony Is Lost, Memory Can Help Christian Obermeier 1 , Henning Holle 2 , and Thomas C. Gunter 1 Abstract The present series of experiments explores several issues related to gesturespeech integration and synchrony during sentence processing. To be able to more precisely manipulate gesturespeech synchrony, we used gesture fragments instead of complete gestures, thereby avoiding the usual long temporal overlap of gestures with their coexpressive speech. In a pretest, the minimal duration of an iconic gesture fragment needed to disambiguate a homonym (i.e., disambiguation point) was there- fore identified. In three subsequent ERP experiments, we then investigated whether the gesture information available at the disambiguation point has immediate as well as delayed conse- quences on the processing of a temporarily ambiguous spoken sentence, and whether these gesturespeech integration pro- cesses are susceptible to temporal synchrony. Experiment 1, which used asynchronous stimuli as well as an explicit task, showed clear N400 effects at the homonym as well as at the target word presented further downstream, suggesting that asynchrony does not prevent integration under explicit task conditions. No such effects were found when asynchronous stimuli were pre- sented using a more shallow task (Experiment 2). Finally, when gesture fragment and homonym were synchronous, similar re- sults as in Experiment 1 were found, even under shallow task conditions (Experiment 3). We conclude that when iconic ges- ture fragments and speech are in synchrony, their interaction is more or less automatic. When they are not, more controlled, active memory processes are necessary to be able to combine the gesture fragment and speech context in such a way that the homonym is disambiguated correctly. INTRODUCTION In everyday face-to-face conversation, speakers not only use speech to transfer information but also rely on facial expressions, body posture, and gestures. In particular, co-speech gestures play an important role in daily commu- nication. This category of spontaneous hand movements consists of different subtypes such as beats, emblems, and deictic, metaphoric, and iconic gestures. Iconic gestures are distinguished by their close formal relationship to the semantic content of speech(McNeill, 1992, p. 12) and are the most thoroughly studied gesture type. For instance, a speaker might perform a typing movement with his fin- gers while saying: Yesterday I wrote the letter.In this example, both gesture and speech refer to the same un- derlying idea, namely, writing something. A steadily in- creasing number of studies have shown that such iconic gestures are not only closely linked to the content of the accompanying speech but that they also have an effect on speech comprehension (ERP studies: Holle & Gunter, 2007; Kelly, Ward, Creigh, & Bartolotti, 2007; Özyürek, Willems, Kita, & Hagoort, 2007; Wu & Coulson, 2005; Kelly, Kravitz, & Hopkins, 2004; fMRI studies: Holle, Obleser, Rueschemeyer, & Gunter, 2010; Green et al., 2009; Holle, Gunter, Rueschemeyer, Hennenlotter, & Iacoboni, 2008; Willems, Özyürek, & Hagoort, 2007). Although such ex- periments suggest that a listener can extract additional information from iconic gestures (e.g., in the example above, we know that the letter was written on a keyboard and not with a pen) and use that information linguistically, little is known so far about the timing issues involved in this process and how the interplay between gesture and speech is actually functioning. Before discussing how the present study investigated speechgesture timing, we will review how timing has been discussed in the gesture lit- erature so far. Based on physical properties, a gesture phrase can usually be divided into three consecutive time phases: prep- aration, stroke, and retraction (McNeill, 1992). The prepara- tion phase is the period from the beginning of the gesture movement up to the stroke. Typically, the hands start to rise from a resting position and travel to the gesture space where the stroke is to be performed. Very often, some phonologicalfeatures of the stroke are already present during the preparation phase. This is why it has been claimed that a key characteristic of the preparation phase is anticipation (McNeill, 1992). The most salient part of a gesture is the stroke phase. The peak effort of the hand movement marks the onset 1 Max-Planck-Institute for Human Cognitive and Brain Sciences, Leipzig, Germany, 2 University of Sussex, Brighton, UK © 2011 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 23:7, pp. 16481663

Transcript of What Iconic Gesture Fragments Reveal about Gesture–Speech Integration: When Synchrony Is Lost,...

What Iconic Gesture Fragments Reveal aboutGesture–Speech Integration: When Synchrony

Is Lost, Memory Can Help

Christian Obermeier1, Henning Holle2, and Thomas C. Gunter1

Abstract

■ The present series of experiments explores several issuesrelated to gesture–speech integration and synchrony duringsentence processing. To be able to more precisely manipulategesture–speech synchrony, we used gesture fragments insteadof complete gestures, thereby avoiding the usual long temporaloverlap of gestures with their coexpressive speech. In a pretest,the minimal duration of an iconic gesture fragment needed todisambiguate a homonym (i.e., disambiguation point) was there-fore identified. In three subsequent ERP experiments, we theninvestigated whether the gesture information available at thedisambiguation point has immediate as well as delayed conse-quences on the processing of a temporarily ambiguous spokensentence, and whether these gesture–speech integration pro-cesses are susceptible to temporal synchrony. Experiment 1,

which used asynchronous stimuli as well as an explicit task,showed clear N400 effects at the homonym as well as at the targetword presented further downstream, suggesting that asynchronydoes not prevent integration under explicit task conditions. Nosuch effects were found when asynchronous stimuli were pre-sented using a more shallow task (Experiment 2). Finally, whengesture fragment and homonym were synchronous, similar re-sults as in Experiment 1 were found, even under shallow taskconditions (Experiment 3). We conclude that when iconic ges-ture fragments and speech are in synchrony, their interaction ismore or less automatic. When they are not, more controlled,active memory processes are necessary to be able to combinethe gesture fragment and speech context in such a way that thehomonym is disambiguated correctly. ■

INTRODUCTION

In everyday face-to-face conversation, speakers not onlyuse speech to transfer information but also rely on facialexpressions, body posture, and gestures. In particular,co-speech gestures play an important role in daily commu-nication. This category of spontaneous hand movementsconsists of different subtypes such as beats, emblems, anddeictic, metaphoric, and iconic gestures. Iconic gesturesare distinguished by their “close formal relationship to thesemantic content of speech” (McNeill, 1992, p. 12) andare themost thoroughly studied gesture type. For instance,a speaker might perform a typing movement with his fin-gers while saying: “Yesterday I wrote the letter.” In thisexample, both gesture and speech refer to the same un-derlying idea, namely, writing something. A steadily in-creasing number of studies have shown that such iconicgestures are not only closely linked to the content of theaccompanying speech but that they also have an effect onspeech comprehension (ERP studies: Holle & Gunter,2007; Kelly, Ward, Creigh, & Bartolotti, 2007; Özyürek,Willems, Kita, & Hagoort, 2007; Wu & Coulson, 2005;Kelly, Kravitz, &Hopkins, 2004; fMRI studies:Holle,Obleser,

Rueschemeyer, & Gunter, 2010; Green et al., 2009; Holle,Gunter, Rueschemeyer, Hennenlotter, & Iacoboni, 2008;Willems, Özyürek, & Hagoort, 2007). Although such ex-periments suggest that a listener can extract additionalinformation from iconic gestures (e.g., in the exampleabove, we know that the letter was written on a keyboardand not with a pen) and use that information linguistically,little is known so far about the timing issues involved inthis process and how the interplay between gesture andspeech is actually functioning. Before discussing how thepresent study investigated speech–gesture timing, we willreview how timing has been discussed in the gesture lit-erature so far.Based on physical properties, a gesture phrase can

usually be divided into three consecutive time phases: prep-aration, stroke, and retraction (McNeill, 1992). The prepara-tion phase is the period from the beginning of the gesturemovement up to the stroke. Typically, the hands start torise from a resting position and travel to the gesture spacewhere the stroke is to be performed. Very often, some“phonological” features of the stroke are already presentduring the preparation phase. This is why it has beenclaimed that a key characteristic of the preparation phaseis anticipation (McNeill, 1992).The most salient part of a gesture is the stroke phase.

The peak effort of the hand movement marks the onset

1Max-Planck-Institute for Human Cognitive and Brain Sciences,Leipzig, Germany, 2University of Sussex, Brighton, UK

© 2011 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 23:7, pp. 1648–1663

of the stroke phase. This phase lasts until the hands start toreturn to the resting position. The form of the stroke doesnot anticipate some upcoming event (like the preparationdoes), but “the effort is focused on the form of the move-ment itself—its trajectory, shape, posture” (McNeill, 1992,p. 376). Most researches consider the stroke as the essen-tial part of a gesture phrase (but see Kita, van Gijn, & vander Hulst, 1998), whereas the preparation and the retrac-tion can be omitted. The retraction phase begins wherethe hands start to return to the resting position and endswhen they have reached this position. This final gesturephase mirrors the anticipation phase in that the effort isfocused on reaching an end position.1

Although these time phases are defined on a formallevel, previous work suggests that meaning is conveyed toa different degree throughout the three gesture phases. Inparticular, the stroke phase has been described as the seg-ment of iconic gesture in which the meaning is expressed(McNeill, 1992). Speakers have a high tendency to producethe onset of the stroke in combination with that part ofspeech which the gesture describes (cf. the lexical affiliate;

Levelt, Richardson, & la Heij, 1985). This synchronizationbetween gesture and speech at the stroke is probablyone important cause for the observation that the strokephase is themost meaningful segment of an iconic gesture.

In essence, it seems that gesture timing is discussed inrather broad terms of three consecutive phases. The pres-ent series of experiments was set out to detail the timingissues related to gesture–speech integration during sen-tence processing. To do so, we were faced with an interest-ing challenge, namely, the large overlap in time betweengesture and speech. It is clear from the foregoing para-graphs that before the most meaningful part of a gesture(i.e., the stroke) is present, some preparatory motor activ-ity is going on, whereas after the stroke, gesture tends tolinger on during the retraction phase. Such large overlapswith speech will make a precise measurement of speech–gesture integration (including synchrony issues) very dif-ficult in a sentence context. As an illustration, the tim-ing parameters of the stimulus material as used by Holleand Gunter (2007) are given in Figure 1A. As can be seenclearly, the full-length gesture completely overlapped with

Figure 1. Timing of gesture and speech. (A) The temporal alignment of the original full-length gesture material used by Holle and Gunter(2007). As can be seen, the full-length gestures overlap with most of the main clause, including verb, homonym, and target word. Thus, aprecise investigation of the temporal aspects of gesture speech integration as well as distinction between integration and disambiguation isvirtually impossible with full-length gestures. (B) The timing of the gesture fragments with respect to their accompanying speech, as usedin the present study. In Experiments 1 and 2, gesture fragment (disambiguation point [DP]) and speech (IP) were asynchronous, that is, onaverage, gesture did not temporally overlap with the homonym. In contrast, gesture fragments and speech were synchronous in Experiment 3.

Obermeier, Holle, and Gunter 1649

the first part of the second sentence, which makes it im-possible to manipulate gesture–speech synchrony usingfull gestures without simultaneously changing the amountof gesture information that is available at a given point intime. To avoid such an undesirable confound between syn-chrony and gesture information, we decided to find a wayto reduce the length of gesture information without chang-ing its impact on speech substantially. Determining suchgesture fragments would then enable us to investigatetiming issues with much greater precision.

One possibility to find gesture fragments of minimallength that still have a communicative impact is the use ofa gating paradigm. Although usually implemented in stud-ies on spoken word recognition, gating can be used withvirtually any kind of sequential linguistic stimuli. In gating,stimuli are presented in sequences of increasing duration.After each segment, participants are asked to classify thesegment on a given parameter. For example, participantscan judge the sequence with respect to prosodic (Grosjean,1983) or syntactic parameters (Friederici, Gunter, Hahne,& Mauth, 2004) or try to identify the complete stimulusbased on the perceived segment (Grosjean, 1996). An im-portant difference between these examples is the numberof possible response alternatives, ranging from very few inthe case of word category identification to thousands of pos-sibilities in the case of spoken word recognition. With re-gard to iconic gestures, it is known that participants havedifficulties identifying the correct corresponding speech unitwhen these gestures are presented in isolation (Hadar &Pinchas-Zamir, 2004; Krauss, Morrel-Samuels, & Colasante,1991). Thus, iconic gestures more clearly convey their mean-ing when viewed in combination with a semantic context.Therefore, in contrast to, for instance, words or signs, whichare lexicalized and can be identified without any additionalcontextual information, the only appropriate way to iden-tify when an iconic gesture becomes meaningful is to inves-tigate it in the presence of a semantic context.

Homonyms (words with multiple unrelated meanings,e.g., ball)maybe especially suited to provide such a context.In the situation of an unbalanced homonym such as ball,there are only two response alternatives to choose from: amore frequent dominant meaning (e.g., game) and a lesserfrequent subordinatemeaning (e.g.,dance). Previously con-ducted experiments indicate that a complete iconic gesturecan disambiguate a homonym (Holle & Gunter, 2007). Inthese ERP experiments, participants watched videos of anactress producing German sentences that contained an un-balanced homonym in the initial clause.2 The homonymwas disambiguated at a target word in the following sub-clause. Coincident with the initial part of the sentence, theactress made a gesture that supported either the dominantor the subordinatemeaning.Holle andGunter (2007) foundthat the amplitude of the N4003 at the target words variedreliably as a function of the congruency of the precedinggesture–homonym combination. The N400 was found tobe smaller when the targets were preceded by a congruentgesture–homonym combination and larger when preceded

by a gesture–homonym combination incongruent to theirmeaning.As mentioned above, the present series of experiments

tried to detail the role of timing/synchrony in gesture–speech integration during sentence processing. To do so,we first use a gating procedure as pretest in order to solvethe challenge of the large overlap in time between gestureand speech normally seen for complete gestures. In thiscontext-guided gating paradigm, we determine a so-calleddisambiguation point (DP) of the original gestures of Holleand Gunter (2007). In the next step, these DPs are usedto shorten the original gestures. These gesture fragmentsare then used in three ERP experiments which investigatewhether the gesture information available at the DP has im-mediate as well as delayed consequences on the processingof a temporarily ambiguous spoken sentence, and whetherthese gesture–speech integration processes are susceptibleto temporal synchrony (see Figure 1B). Because timing is acritical issue here, we use ERPs as the dependentmeasure asthey have excellent temporal resolution and give the oppor-tunity to measure gesture–speech interaction under differ-ent task conditions. Additionally, by using ERPs, we are ableto investigate the disambiguating effect of gesture at dif-ferent points in time. We conclude that when gesture andspeech are in synchrony, their interaction is more or lessautomatic. In contrast, when both domains are not in syn-chrony, active gesture-related memory processes are neces-sary to be able to combine the gesture fragment and speechcontext in such a way that the homonym is successfullydisambiguated.

EXPERIMENT 1: DISAMBIGUATION USINGGESTURE FRAGMENTS

Experiment 1 serves as a basis for the other two experi-ments. It tests whether gesture fragments presented upto their DP are able to disambiguate speech. The DPs wereassessed in a pretest using a context-guided gating doneon the original material of Holle and Gunter (2007). Thebig advantage of using gesture fragments made out of thisparticular material is that we can measure the brain activ-ity of our participants with greater precision at two posi-tions in time. At the homonym position, speech–gestureintegration can be directly measured, whereas a few wordsdownstream the target word position gives us direct on-line evidence whether this integration led indeed to a suc-cessful disambiguation. As can be seen in Figure 1B, thegesture fragment was presented earlier than the homonymin Experiment 1 (i.e., there was a clear gesture–speechasynchrony).

Methods

The Original Stimulus Material

The original stimulus material of Holle and Gunter (2007)made use of 48 homonyms having a clear dominant andsubordinate meaning (for a description how dominant

1650 Journal of Cognitive Neuroscience Volume 23, Number 7

and subordinate meaning were determined, see Gunter,Wagner, & Friederici, 2003). For each of these homonyms,two 2-sentence utterances were constructed includingeither a dominant or the subordinate target word. Theutterances consisted of a short introductory sentence intro-ducing a person followed by a longer complex sentencedescribing an action of that person. The complex sentencewas composed of a main clause containing the homonymand a successive subclause containing the target word. Pre-vious to the target word, the sentences for the dominantand subordinate versions were identical. A professionalactress was videotaped while uttering the sentences. Shewas asked to simultaneously perform a gesture that sup-ported the sentence. To minimize the influence of facialexpression, the face of the actress was masked with a nylonstocking. All gestures directly related to the target wordor resembling emblems were excluded. To improve thesound quality, the speech of the sentences was re-recordedin a separate session. Cross-splicing was applied to mini-mize the possibility that participants might use prosodiccues for lexical ambiguity resolution. Afterwards, the re-recorded speech was recombined with the original videomaterial in such way that the temporal alignment of a ges-ture and the corresponding homonym was identical to thealignment in the original video recordings (for more de-tails about the recording scenario and preparation of theoriginal stimuli, please see Holle & Gunter, 2007).

Rating of the Gesture Phases

In order to get a more detailed understanding of the stimu-lus material and as a preparation for the present set of ex-periments, the onset of the gesture preparation, as well asthe on- and offset of the gesture stroke of the original ges-ture material, was independently assessed by two persons.The phases of the gestures were determined according totheir kinetic features described in the guidelines on ges-ture transcription by McNeill (1992, p. 375f ). To avoid aconfounding influence of speech, the gesture videos werepresented without sound for the rating procedure, as hasbeen suggested in the Neuropsychological Gesture CodingSystem (Lausberg & Slöetjes, 2008; see also Lausberg &

Kita, 2003). First, the onset and the offset of the completehand movement were determined, then the on- and offsetof the stroke were identified based on the change of effortin the movement, that is, changes in movement trajectory,shape, posture, and movement dynamics (for details, seeMcNeill, 1992). The phase prior to the stroke onset wasdetermined as the preparation phase, the phase after thestroke offset as the retraction. The movements did notinclude any holds. Both raters highly agreed on the clas-sification of the different gesture phases [e.g., interraterreliability (time of stroke onset) > .90]. In case there wasdissent about the exact point in time of preparation onset,stroke onset or offset, raters afterward discussed the resultsand chose the point in time they both felt appropriate. Thevalues for the on- and offsets did not differ significantlyacross gesture conditions [all F(1, 94) < 1; see Table 1].

Pretest: Gating

To determine the point in time at which a gesture can re-liably disambiguate a homonym, a context-guided gatingwas applied. Gating is a very popular paradigm in spokenword recognition (Grosjean, 1996). Its rationale is based onthe assumption that spoken word recognition is a discrimi-native process, that is, with increasing auditory informa-tion, the number of potential candidate words is reduceduntil only the correct word remains (e.g., cohort model;see Gaskell & Marslen-Wilson, 1997; Marslen-Wilson,1987). In this context, the identification point (IP) is de-fined as the amount of information needed to identify aword without change in response thereafter. Although gat-ing is most common in spoken word recognition, it can beused with virtually any kind of sequential material [e.g., ASL(Emmorey & Corina, 1990); music sequences ( Jansen &Povel, 2004)]. Because iconic gestures convey their mean-ing more clearly when produced with co-occurring speech,we employed a context-guided gating task to identify theisolation points of the gestures. Homonyms were used ascontext. This procedure has the advantage that the num-ber of possible gesture interpretations is restricted to two,namely, the dominant meaning and the subordinate mean-ing of the homonym.

Table 1. Stimulus Properties

Gesture Speech

GesturePreparation

OnsetDisambiguation

Point (DP)Gesture

Stroke OnsetGesture

Stroke OffsetHomonym

Onset

HomonymIdentificationPoint (IP)

TargetWord Onset

TargetWord Offset

D D 1.72 (0.41) 2.10 (0.44) 2.07 (0.46) 2.91 (0.48) 2.84 (0.40) 3.09 (0.41) 3.78 (0.38) 4.16 (0.38)

D S 1.72 (0.41) 2.10 (0.44) 2.07 (0.46) 2.91 (0.48) 2.84 (0.40) 3.09 (0.41) 3.80 (0.38) 4.17 (0.38)

S D 1.68 (0.50) 2.10 (0.51) 2.17 (0.52) 3.01 (0.51) 2.84 (0.40) 3.09 (0.41) 3.78 (0.38) 4.16 (0.38)

S S 1.68 (0.50) 2.10 (0.51) 2.17 (0.53) 3.01 (0.51) 2.84 (0.40) 3.09 (0.41) 3.80 (0.38) 4.17 (0.38)

Mean 1.70 (0.45) 2.10 (0.47) 2.12 (0.49) 2.96 (0.50) 2.84 (0.40) 3.09 (0.41) 3.79 (0.38) 4.17 (0.38)

Mean onset and offset values are in seconds relative to the onset of the introductory sentence (SD in parentheses).

Obermeier, Holle, and Gunter 1651

Forty native German-speaking participants took part inthe gating pretest. A gating trial started with the visual pre-sentation of the homonym for 500 msec (e.g., ball), fol-lowed by the gated gesture video. At 500 msec after thevideo offset, the participants had to determine whetherthe homonym referred to the dominant or the subordinatemeaning based on gesture information. Three responsealternatives were possible and simultaneously presentedon the screen: (1) dominant meaning (e.g., the word gamewas displayed on the screen), (2) subordinate meaning(e.g., dance), and (3) “next frame.” Participants were in-structed to choose the third response alternative until theyfelt they had some indication of which meaning was tar-geted by the gesture. The increment size was one videoframe which corresponded to 40 msec, that is, each gatewas 40 msec longer than the previous one. Gating startedat the onset of the preparation phase and ended eitherwhen the offset of the stroke phase was reached or whenthe subject gave a correct response for 10 consecutivesegments. Because very short video sequences are diffi-cult to display and recognize, each segment also containedthe 500 msec directly before the onset of the preparation.Thus, the shortest segment of each gesture had a lengthof 540 msec (500 + 40 msec for the first frame of the pre-paration phase). The gesture items were pseudorandomlydistributed across two experimental lists. Each of the listscontained 24 of the original dominant and 24 of the originalsubordinate gestures, resulting in a total of 48 gestures perexperimental list. For each homonym, either the dominantor the subordinate gesture was presented within one list.

The relevant information was the DP, which correspondsto the amount of gesture information needed to identifya gesture as either being related to the dominant or thesubordinate meaning of a homonym without any changesin response thereafter. The mean DPs for the single itemsranged from 2.22 to 19.63 frames (M= 9.88, SD= 3.6), cal-culated relative to preparation onset. Thus, on average,the participants needed to see about 400 msec of gestureto disambiguate a homonym. An ANOVA with the factorsWordmeaning frequency (2) and List (2) revealed that domi-nant gestures (M = 9.33, SD = 3.6) were identified earlierthan subordinate gestures (M = 10.42, SD = 3.58), as indi-cated by the significant main effect of Word meaning fre-quency [F(1, 94) = 4.2, p < .05]. This result indicates thatmore gesture information is needed to select the subordi-nate meaning.

When investigating the distribution of the DPs relative tothe stroke onset, we found a surprising result. DPs rangedfrom almost 20 frames before the stroke onset to 9 framespast the stroke onset, with the DPs of 60 gestures beingprior to the stroke onset. This means that almost two thirdsof all gestures enabled a meaning selection before the par-ticipants had actually seen the stroke (see SupplementaryFigure 1). The difference between DP and stroke onset wasfound to be significantly smaller than zero across partici-pants [t1(1, 39) = −4.7, p < .001] and items [t2(1, 95) =−2.3, p < .05]. The corresponding minF0 statistic (Clark,

1973) was significant [minF 0(1, 128) = 4.26, p < .05], in-dicating that gestures reliably enabled a meaning selectionbefore stroke onset.

Stimuli: Gesture Fragments

The stimuli for Experiment 1 were constructed as follows.First, the original gesture and speech streams were sepa-rated. Full-length gesture streams were then replaced withgesture streams cut at the DP. The duration of the ges-ture streams was adjusted to the duration of the speechstreams by adding a recording of the corresponding emptyvideo background. This manipulation created the illusionof a speaker disappearing from the screen while the speechwas still continuing for a short amount of time. Speechstreams were recombined with both the clipped dominantas well as clipped subordinate gesture streams, resulting ina 2 × 2 design, with Gesture (Dominant and Subordinate)and Speech (Dominant and Subordinate) as within-subjectfactors (see Table 2; see also supplementary on-line mate-rials for examples of the videos). Each of the four condi-tions (DD, DS, SD, and SS) contained 48 items, resultingin an experimental set of 192 items. The items were pseudo-randomly distributed to four blocks of 48 items, ensuringthat (i) each block contained 12 items of all four conditionsand (ii) each block contained only one of the four possiblegesture–speech combinations for each homonym.In order to investigate the on-line integration of the ges-

ture fragments with the homonym with as much temporalprecision as possible, we also determined the earliest pointin time at which the homonym is identified. In a gatingparadigm, spoken word fragments of increasing duration(increment size: 20 msec) were presented to 20 partici-pants who did not participate in any of the experiments re-ported here. The IP was determined as the gate where theparticipants started to give the correct response withoutany change in response thereafter. On average, participantswere able to identify the homonyms after 260 msec. Thesehomonym IPs were used as triggers for the ERPs that dealtwith the direct integration of the gesture fragments withthe homonyms (see also Figure 1B).

Participants

Thirty-nine native German-speaking participants were paidfor their participation and signed a written informed con-sent. Seven of them were excluded because of excessive arti-facts. The remaining 32 participants (16 women; 20–28 years,mean = 23.8 years) were right-handed (mean laterality co-efficient = 94.3; Oldfield, 1971), had normal or corrected-to-normal vision, had no known hearing deficits, and hadnot taken part in the pretest of the stimulus material.

Procedure

Participants were seated in a dimly lit, sound-proof boothfacing a computer screen. They were instructed to attend

1652 Journal of Cognitive Neuroscience Volume 23, Number 7

both to the movements in the video as well as the accom-panying speech. After each item, participants judgedwhether gesture and speech were compatible. Note thatin order to perform this task, participants had to comparethe meaning indicated by the homonym–gesture combina-tion with the meaning expressed by the target word (seeTable 2). A trial started with a fixation cross, which was pre-sented for 2000 msec, followed by the video presentation.The videos were centered on a black background and ex-tended for 10° visual angle horizontally and 8° vertically.Subsequently, a question mark prompted the participantsto respond within 2000 msec after which feedback wasgiven for 1000 msec.The experiment was divided into four blocks of approxi-

mately 9 min each. For all blocks, the presentation order

of the items was varied in a pseudorandomized fashion.Block order and key assignment were counterbalancedacross participants, resulting in a total of eight different ex-perimental lists with 192 items each. One of the eight listswas randomly assigned to each participant. Thus, eachexperimental list was presented to four participants. Anexperimental session lasted approximately 45 min.

ERP Recording

The EEG was recorded from 59 Ag/AgCl electrodes (Electro-Cap International, Eaton, OH). It was amplified using aPORTI-32/MREFA amplifier (DC to 135 Hz) and digitized at500 Hz. Electrode impedance was kept below 5 kΩ. The left

Table 2. Stimulus Examples

Gesture Target Word Gesture/Homonym Target Word

Introduction: Alle waren von Sandra beeindruckt.Everybody was impressed by Sandra.

D D Sie beherrschte den Ballamb, was sich im Spiel beim Aufschlag deutlich zeigte.

She controlled the ballamb, which during the game at the serve clearly showed.

D S Sie beherrschte den Ballamb, was sich im Tanz mit dem Bräutigam deutlich zeigte.

She controlled the ballamb, which during the dance with the bridegroom clearly showed.

S S Sie beherrschte den Ballamb, was sich im Tanz mit dem Bräutigam deutlich zeigte.

She controlled the ballamb, which during the dance with the bridegroom clearly showed.

S D Sie beherrschte den Ballamb, was sich im Spiel beim Aufschlag deutlich zeigte.

She controlled the ballamb, which during the game at the serve clearly showed.

Introductory sentence was identical for all four conditions. The first two columns indicate the conveyed meaning of gesture and the subsequenttarget word: dominant (D) or subordinate (S). Target words are in bold. Literal translations are in italics. Cross-splicing was performed at the end ofthe main clause (i.e., in this case after the word “Ball”).

Obermeier, Holle, and Gunter 1653

mastoid served as a reference. Vertical and horizontal electro-oculogram was measured for artifact rejection purposes.

Data Analysis

Participantsʼ response accuracy was assessed by a repeatedmeasures ANOVA with Gesture (D, S) and Target word (D,S) as within-subject factors. EEG data were rejected off-lineby applying an automatic artifact rejection using a 200-msecsliding window on the electrooculogram (±30 μV) andEEG channels (±40 μV). All trials followed by incorrect re-sponses were also rejected. On the basis of these criteria,approximately 33% of the data were excluded from furtheranalysis. Single-subject averages were calculated for everycondition both at the homonym and target word position.

In the analyses at the homonym position, epochs weretime-locked to the IP of the homonyms and lasted from200msec prior to the IP to 1000msec afterward. A 200-msecprestimulus baseline was applied. Ten regions of interest(ROIs) were defined, namely, anterior left [AL] (AF7, F5,FC5), anterior center–left [ACL] (AF3, F3, FC3), anteriorcenter [AC] (AFz, Fz, FCz), anterior center–right [ACR](AF4, F4, FC4), anterior right [AR] (AF8, F6, FC6), posteriorleft [PL] (CP5, P5, PO7), posterior center–left [PCL] (CP3,P3, PO3), posterior center [PC] (CPz, Pz, POz), posteriorcenter–right [PCR] (CP4, P4, PO4), and posterior right[PR] (CP6, P6, PO8). Based on visual inspection, a time win-dow ranging from 100 to 400 msec was used to analyze theintegration of gesture and homonym. A repeated measuresANOVA using Gesture (D, S), ROI (1, 2, 3, 4, 5), and Region(anterior, posterior) as within-subject factors was calcu-lated. Only effects which involve the crucial factor gesturewill be reported.

In the target word analysis, epochs were time-locked tothe target word and lasted from 200 msec prior to the tar-get onset to 1000 msec post target. A 200-msec prestimu-lus baseline was applied. The identical 10 ROIs as in theprevious analysis were used. The standard N400 time win-dow ranging from 300 to 500 msec after target word onsetwas selected to analyze N400 effects. A repeated measuresANOVA using gesture (D, S), target word (D, S), ROI (1, 2,3, 4, 5), and region (anterior, posterior) as within-subjectfactors was performed. Only effects which involve thecrucial factors gesture or speech will be reported. In allstatistical analyses, the Greenhouse–Geisser (1959) correc-tion was applied where necessary. In such cases, the uncor-rected degrees of freedom (df ), the corrected p values,and the correction factor ε are reported. Prior to all statis-tical analyses, the data were filtered with a high-pass filterof 0.2 Hz. Additionally, a 10-Hz low-pass filter was usedfor presentation purposes only.

Behavioral Data

The response accuracy was adequate across the differentcongruency conditions (congruent gesture speech pair-ings: 77%; incongruent gesture speech pairings: 73%). A

significant main effect of congruency [paired t(31) = 2.30,p < .05] indicated that response accuracy was better forthe congruent than the incongruent pairings. Congruentpairings also showed a faster RT [congruent: 450 msec; in-congruent: 474 msec; paired t(31) =−2.04, p= .05]. Notethat because the response occurred with some delay, theRT data should be treated with caution. Overall, the behav-ioral data suggest that when gesture fragment and speechare congruent, comprehension is enhanced compared towhen they are incongruent.

ERP Data: Homonym

In Figure 2A, an early enhanced negativity can be observedwhen the homonym is preceded by subordinate gesturefragments as compared to dominant gesture fragments. Al-though this effect seems very early, this negativity is likelyto be a member of the N400 family when considering itsscalp distribution. The early onset can be explained bythe use of the IP as the onset trigger of the averages. Arepeated measures ANOVA revealed a main effect of Ges-ture [F(1, 31) = 17.61, p < .0002], indicating that theintegration of a subordinate gesture fragment with the cor-responding homonym is more effortful than the integra-tion of a dominant gesture fragment.

ERP Data: Target Word

As can be seen in Figure 2B and C, the ERPs show an in-creased negativity starting at about 300 msec for incongru-ent gesture–target word relations (DS, SD) in comparisonto the congruent ones (DD, SS). Based on its latency andscalp distribution, the negativity was identified as an N400.The analysis of the 300–500 msec time window showed asignificant two-way interaction of Gesture and Target word[F(1, 31) = 16.33, p < .0005], as well as a significant two-way interaction of Target word and Region [F(1, 31) =4.79, p < .05].On the basis of the gesture and target word interaction,

step-down analyses were computed to assess the maineffect of Gesture for both target word conditions. At domi-nant target words, the N400 was larger after a subordinategesture compared to a dominant gesture [F(1, 31) = 10.14,p< .01]. In contrast, the N400 at subordinate target wordswas larger when being preceded by a dominant gesture[F(1, 31) = 12.16, p< .01]. Thus, incongruent gesture con-text elicited a larger N400 at both target word conditions.Yet, the effect was slightly larger for subordinate (Cohenʼsf 2 = 0.38) than dominant targets (Cohenʼs f 2 = 0.32).

Discussion

Experiment 1 clearly shows that gesture fragments pre-sented up to the DP are able to influence speech–gestureintegration and are able to disambiguate speech. Beforediscussing the ERP results, the results observed duringthe gating pretest merit attention.

1654 Journal of Cognitive Neuroscience Volume 23, Number 7

Context-guided Gating

The DPs found in the context-guided gating might be con-sidered as surprisingly early, given what McNeill has writtenabout the preparation phase (see Introduction). Relativeto the gesture phases as determined by our rating, mostof the DPs actually occurred before stroke onset withinthe preparation phase of the gestures. It therefore seemsthat the preparation phase already suffices to select theappropriate meaning of a homonym. Although potentiallyintriguing, we have to be cautious in interpreting this resultbecause there are several methodologically related expla-nations that may account for such an early effect.First, it is possible that the way we determined our stroke

onset (with the sound turned off) may have resulted in laterstroke onsets than a rating conforming entirely to the sug-gestions of McNeill (1992, p. 375f ) would have. McNeillhas suggested how to determine the phases of gesture withsound turned on. This methodological difference makes itdifficult to relate our finding to McNeillʼs claims about thepreparation phase.Second, an inherent feature of a gating procedure is the

highly repetitive nature of the task. Such repetitions mayhave induced processing strategies different from thoseused in real-time speech comprehension. It is also wellknown from studies on spoken word recognition (e.g.,Salasoo & Pisoni, 1985; Grosjean, 1980) that additional con-textual information enables participants to identify wordsearlier than without context. Because iconic gestures areseldom, if ever, produced without context (i.e., accompany-ing speech), the meaning of a gesture may be accessiblerather early. One could speculate that the earliness ofmeaning comprehension may depend on the degree ofcontextual constraint. For instance, the participants in our

study might have been able to decide upon the correctmeaning more easily and faster, because they only had tochoose between the two different meanings of a homonym;that is, gestures related to Kamm, whichmeans either combor crest, can be easily discriminated by hand shape. Thepreparation of the comb video contains the beginning of aone-handed gripping movement while there is an ascend-ing and expanding two-handed movement in the crestvideo. This in line with Kita et al. (1998), who argue thathand-internal information such as hand shape or wrist loca-tion tends to emerge toward the end of the preparationphase, in other words, the preparation anticipates featuresof the stroke (McNeill, 1992). Thus, it is not that surprisingthat the preparation phase is informing the recipient whattype of stroke phase might be following. It is, however, anovel finding that a recipient can actively interpret and usesuch preparatory motor activity in a forced-choice situation.It is important to note that this meaning anticipation onlyseems to be possible within the context of speech. An ad-ditional behavioral study, in which nine participants had toguess the meaning of the gestures clipped at DP withoutany context, showed that participants were not able to getthe correct meaning of the gesture fragments.4 This resultconfirms that without any context, the meaning of a gestureis rather imprecise (Hadar & Pinchas-Zamir, 2004; Krausset al., 1991). When a context is given, however, most ofour gesture fragments are able to disambiguate by meansof displaying solely prestroke information.

ERPs at the Homonym Position

Experiment 1 addressed the question of whether gesturesclipped at the DP suffice as disambiguation cues in on-line

Figure 2. ERPs as found in Experiment 1. The left panel (A) shows the ERPs time-locked to the identification point of the homonyms. The solidline represents the ERP when the homonym was preceded by a dominant gesture fragment. The dotted line represents the ERP when the homonymwas preceded by a subordinate gesture fragment. The middle (B) and left (C) panels represent the ERPs time-locked to the onset of the target word.The solid line represents the cases in which gesture cue and subsequent target word were congruent. The dotted line represents those instanceswere gesture cue and target word were incongruent.

Obermeier, Holle, and Gunter 1655

speech comprehension using a congruency judgment task.The observed ERP effects at the homonym and at the tar-get word position indicate that, indeed, these short gesturefragments can be used for disambiguation, even thoughthere was an asynchrony of 970 msec between the endof a gesture fragment and the corresponding homonymIP. The ERPs elicited at the IP position of the homonymshowed a direct influence of gesture type. Subordinategestures elicited a more negative ERP compared to domi-nant gestures. Although its onset was very early (probablydue to the use of the IP as trigger point), we would like tosuggest, on the basis of its scalp distribution, that this com-ponent belongs to the N400 family. The data thereforesuggest that the integration of the homonym with thesubordinate gesture fragment is probably more effortfulthan the integration with the dominant gesture fragment.A more extended discussion of this effect will be given inthe general discussion. For the moment, it is enough toknow that the gesture fragments had a direct and differen-tial impact during the processing of the homonym. Thenext question relates to whether this impact leads to adisambiguation of the homonym, influencing sentenceprocessing further downstream. Such an effect would in-dicate that the gesture fragments, indeed, contained dis-ambiguating information.

ERPs at the Target Position

The ERP data on the target word showed clearly that thegesture fragments were used to disambiguate the hom-onym. When a target word was incongruent with how thegesture fragments disambiguated the homonym, a largerN400 was elicited compared to when targets were congru-ent with the preceding gesture-driven disambiguation. In-terestingly, both types of target words showed this effect,suggesting that the activation of both meanings of a hom-onym varied reliably as a function of the preceding gesturecontext. Note, however, that the N400 effect was largerfor the subordinate target words, suggesting a larger sen-sitivity toward gesture influence than dominant targets.Such a finding may indicate that gesture fragments area relatively weak disambiguating context (see Martin, Vu,Kellas, & Metcalf, 1999; Simpson, 1981).

It is important to note that in Experiment 1, participantswere explicitly asked to compare the semantic content ofa gesture fragment–homonym combination with the sub-sequent target word in order to solve the task. Thus, thetask forced them to actively combine and integrate bothsources of information. Due to the large distance betweenthe end of the gesture fragment and the homonym IP(about 970 msec, see Table 1), it is, on the one hand, re-alistic to assume that gesture–speech integration in thisparticular case is an effortful memory-related process,because the gestural information has to be actively keptin working memory until the homonym is encountered.On the other hand, there are many suggestions in theliterature that speech–gesture integration should occur

more or less automatically and, therefore, should be ef-fortless (Özyürek et al., 2007; Kelly et al., 2004). Automaticprocesses are characterized as being very fast, occurringwithout awareness and intention, and not tapping intolimited-capacity resources (Schneider & Shiffrin, 1977;Shiffrin & Schneider, 1977; Posner & Snyder, 1975). Ifthe integration of the gesture fragments with the hom-onyms is an automatic process, as suggested for com-plete gesture–speech integration by McNeill, Cassell, andMcCullough (1994), it should be independent from ex-perimental context and task.To explore the underlying nature of the integration of a

gesture fragment with a homonym, we used a more shal-low memory task in Experiment 2 and examined whetherparticipants would still use the gesture fragments as dis-ambiguation cues even when the task did not require themto do so. As in Experiment 1, there was an asynchrony be-tween gesture and speech, in that the gesture fragmentsended about 970 msec before the IP of the homonyms.The rationale of the task was as follows: After a few trialswithout a task prompt, participants were asked whetherthey had seen a certain movement or heard a certain wordin the previous video. No reference was made to the po-tential relationship between gesture and speech in the taskinstructions. Thus, participants had to pay attention to bothgesture and speech, but were not required to actively com-bine both streams to solve the task. Holle and Gunter(2007), who used the same shallow task to investigatewhether the integration of full gestures is automatic, foundan N400 effect for both target word conditions. Based onthat study, we hypothesized that the shortened gesturesused in the present study should also modulate the N400at the position of the target word under shallow task con-ditions. Additionally, we also expected an enhanced nega-tivity for the integration of the subordinate as compared todominant gesture fragments at the position of the hom-onym, as it was observed in Experiment 1.

EXPERIMENT 2: ON THE AUTOMATICITYOF THE INTEGRATION OF GESTUREFRAGMENTS INTO SPEECH

Methods

Participants

Thirty-four native German-speaking participants were paidfor their participation and signed a written informed con-sent. Two of them were excluded because of excessive arti-facts. The remaining 32 participants (16women, age range=21–29 years, mean = 25.6 years) were right-handed (meanlaterality coefficient = 93.8), had normal or corrected-to-normal vision, and had no known hearing deficits. Nonehad taken part in any of the previous experiments.

Stimuli

The same stimuli as in Experiment 1 were used.

1656 Journal of Cognitive Neuroscience Volume 23, Number 7

Procedure

Presentation of the stimuli was identical to Experiment 1.Participants were, however, performing a different, moreshallow task, and received the following instructions: “Inthis experiment, you will be seeing a number of short vid-eos with sound. During these videos the speaker movesher arms. After some videos, you will be asked whetheryou have seen a certain movement or heard a certain wordin the previous video.”A visual prompt cue was presented after the offset of

each video. After 87.5% of all videos, the prompt cue indi-cated the upcoming trial, that is, no response was requiredin these trials. After 6.25% of all videos, the prompt cueindicated to prepare for the movement task. A short silentvideo clip was presented as a probe. The probes consistedof soundless full-length gesture videos. After the offset ofeach probe video, a question mark prompted the partici-pants to respond whether the probe contained the move-ment of the previous experimental item. Feedback wasgiven if participants answered incorrectly or if they failedto respond within 2000 msec after the response cue. Afterthe remaining 6.25% of the videos, the prompt cue in-formed the participants that the word task had to be car-ried out. Participants had to indicate whether a visuallypresented probe word had been part of the previous sen-tence. The probe words were selected from sentence-initial, -middle, and -final positions of the experimentalsentence. Response and feedback were identical to themovement task trials.

ERP Recording and Data Analysis

The parameters for the recording, artifact rejection, andanalysis were the same as in Experiment 1. The amount ofbehavioral data obtained in the present experiment is quitesmall (24 responses overall), with half of them originating

from themovement task and the other half of theword task.Therefore, we decided not to use the behavioral data asa rejection criterion for the ERP analyses. Approximately22% of all trials were rejected for the final analysis of boththe homonym as well as target word position.

Results

Behavioral Data

Overall, participants gave 87% correct answers, indicatingthat although the task in Experiment 2 was rather shallow,participants nonetheless paid attention to the stimulusmaterial. Performance was less accurate in the movementtask (79% correct) than in the word task (96% correct;Wilcoxon signed-rank test; z = −4.72; p < .001).

ERP Data: Homonym

Figure 3A shows no visible difference between subordinateand dominant gesture fragments at the homonym position.The corresponding ANOVA indicated no statistically signifi-cant differences (all Fs < .53, p > .49).

ERP Data: Target Word

As can be seen in Figure 3B and C, there is barely a visibledifference between the congruent and incongruent ges-ture cues for both target word conditions. The repeatedmeasures ANOVA confirmed this impression by yieldingno significant four-way interaction of Gesture, Target word,ROI, and Region [F(4, 124)= 0.32, p> .69; ε= .42], nor anyother significant interaction involving the crucial factors ofGesture or Target word (all Fs < 1.28, all ps > .28); that is,

Figure 3. ERPs as found in Experiment 2.

Obermeier, Holle, and Gunter 1657

there was no significant disambiguating influence of ges-ture on speech in the data.

Discussion

Experiment 2 dealt with the question of whether ges-ture fragments are integrated with speech when a shallowtask was used. Both at the homonym as well as on the tar-get words, no significant ERP effects were found. Thus,fragmented gestures do not influence the processing ofcoexpressive ambiguous speech when the task does notexplicitly require an integration of gesture and speech.One way to interpret this finding is to suggest that the in-tegration of gesture fragments is not an automatic process.Such a conclusion would, however, contradict the litera-ture that indicates that gesture–speech integration is moreor less automatic in nature (McNeill et al., 1994). It is there-fore sensible to look more carefully at Experiment 2 andsee whether a more parsimonious hypothesis can be for-mulated. Using the identical experimental setup, Holleand Gunter (2007) found a disambiguating effect of theoriginal full-length gestures under shallow task conditions.One crucial difference between the gestures used by Holleand Gunter and the gesture fragments used here is whetherthe gesture overlaps with its corresponding co-speech unit(i.e., the homonym). Whereas complete gestures span overa larger amount of time and have a significant temporaloverlap with the homonym, no such temporal overlap ispresent between the gesture fragments and the homonyms(see Figure 1). Remember that due to the clipping pro-cedure, the gesture fragments end, on average, 970 msecprior to the homonym IP. Thus, at the time the gesturefragment ends, there is no coexpressive speech unit withwhich it can be integrated. When effortful processing isinduced by the task (Experiment 1), this time lag does notseem to be problematic. If, however, the task does not ex-plicitly require participants to actively combine gesture andspeech as in Experiment 2, the time lag between gestureand speech may be problematic, probably because theminimal amount of information present in the gesture frag-ments gets lost over time. Thus, an alternative explanationis that automatic integration of gesture fragments does notoccur when a gesture and its corresponding speech unitdo not have a sufficient amount of temporal overlap. It isimportant to note that such an alternative explanation isalso in accordance with McNeill (1992), who suggested thatit is the temporal overlap (i.e., simultaneity between ges-ture and speech) that enables a rather automatic and im-mediate integration of gesture and speech. In his semanticsynchrony rule, McNeill states that the same “idea unit”(p. 27) must occur simultaneously in both gesture andspeech in order to allow proper integration. In other words,he suggests that if gestures and speech are synchronous,they should be integrated in a rather automatic way. Sofar, however, there has been little empirical work on theeffects of gesture–speech synchronization in comprehen-sion (but see Treffner, Peter, & Kleidon, 2008).

In Experiment 3, we explored the temporal overlaphypothesis as formulated above. We synchronized thegesture fragments with the homonyms in such a way thatthe DPs of the gestures were aligned with the IPs of thehomonym. The rest of the experiment remained exactly thesame as in Experiment 2. Thus, again, the shallow task wasused. If, as suggested by the temporal overlap hypothesis,synchronization is playing a crucial role during speech–gesture integration, one would predict ERP effects similarto those observed in Experiment 1, both in the immediatecontext of the homonym as well as further downstreamat the target word. In contrast, if the integration of gesturefragments and speech is not automatic, gesture shouldnot modulate the ERPs at either sentence position, as itwas observed in Experiment 2.

EXPERIMENT 3: THE ROLE OF TEMPORALSYNCHRONY FOR THE INTEGRATION OFGESTURE AND SPEECH

Methods

Participants

Thirty-eight native German-speaking participants werepaid for their participation and signed a written informedconsent. Six of them were excluded because of excessiveartifacts. The remaining 32 participants (15 women, agerange = 19–30 years, mean = 25.5 years) were right-handed(mean laterality coefficient = 93.9), had normal or corrected-to-normal vision, had no known hearing deficits, and didnot participate in any of the previous experiments.

Stimuli

The 96 gesture videos used in Experiments 1 and 2 consti-tuted the basis for the stimuli of Experiment 3. In order toestablish a temporal synchrony between a gesture frag-ment and the corresponding speech unit, the DP of ges-ture was temporally aligned with the IP of the homonym,that is, the point in time at which the homonym was clearlyrecognized by listeners. The IPs had been determinedpreviously using a gating paradigm (see Experiment 1). In-terestingly, the onset of the preparation phase of the syn-chronized gesture fragments still precedes the onset of thehomonym by an average of 160 msec. Thus, the gestureonset is still preceding the onset of the coexpressivespeech unit as it is usually observed in natural conversation(McNeill, 1992; Morrel-Samuels & Krauss, 1992).

Procedure, ERP Recording, and Data Analysis

The procedure, as well as parameters for ERP recording, arti-fact rejection, and analysis, was identical to Experiment 2. Be-havioral data were, as in Experiment 2, not used as rejectioncriterion. Overall, 25% of the trials were excluded fromfurther analysis. Based on visual inspection, separate

1658 Journal of Cognitive Neuroscience Volume 23, Number 7

repeated measures ANOVAs with Gesture (D, S), Targetword (D, S), ROI (1, 2,3, 4, 5), and Region (anterior, poste-rior) as within-subject factors were performed for timewindow of the homonym (100 to 400 msec) and the targetword (300 to 500 msec). These time windows were iden-tical to those used in Experiments 1 and 2. Additionally,we performed an ANOVA for an earlier time window at theposition of the homonym (50 to 150 msec).

Results

Behavioral Data

Similar behavioral results as in Experiment 2 were ob-served. Participants responded correctly in 82% of all testtrials. Again, the movement task was carried out less effi-ciently (74% correct) than in the word task (90% correct;Wilcoxon signed-rank test; z = −3.80; p < .001).

ERP Data: Homonym

As can be seen in Figure 4A, an increased negativity iselicited when the homonym is preceded by a subordinategesture fragment as compared to a dominant one.5 As inExperiment 1, the earliness of the effect can be explainedby the use of the homonym IPs as a trigger for the averages.A repeated measures ANOVA yielded a significant main ef-fect of gesture [F(1, 31) = 6.09, p < .05], a two-way inter-action of gesture and ROI [F(4, 124) = 8.09, p < .001],as well as a significant three-way interaction of gesture,region, and ROI [F(4, 124) = 4.07, p< .05; ε= .46]. Theseresults suggest that the integration of a subordinate ges-ture fragment with a homonym is more difficult than theintegration of a dominant one. Further step-down analysesrevealed that the main effect of gesture was strongest overfronto-central sites [F(1, 31) = 10.46, p < .001].

ERP Data: Target Word

No significant Gesture × Target word interaction was foundin the early time window (all Fs < 1.05, all ps > .34). For theN400 time window, however, the ANOVA revealed a signifi-cant interaction of gesture and target word [F(1, 31) = 7.72,p < .01]. Based on this interaction, the simple main effectsof Gesture were tested separately for the two Target wordconditions. At subordinate target words, the N400 was sig-nificantly larger after a dominant gesture compared to asubordinate one [F(1, 31) = 6.63, p < .05]. No such effectof gesture–target word congruency was found at dominanttarget words [F(1, 31) = 0.33, p= .57]. Thus, when gesturefragments and speech are synchronized, the integration ofboth sources of information seems to be more automatic/less effortful, at least for the subordinate word meaning.

GENERAL DISCUSSION

The present set of experiments explored gesture–speechintegration and the degree to which integration dependson the temporal synchrony between both domains. Inorder to enhance the precision in measuring temporalaspects of gesture speech integration, we presented ourparticipants with gesture fragments. To do so, we first as-sessed the minimally necessary amount of iconic gestureinformation needed to reliably disambiguate a homonymusing a context-guided gating task. In Experiment 1, wheregesture fragment and homonym were asynchronous andan explicit task was used, the ERPs triggered by homonymIPs revealed a direct influence of gesture during the process-ing of the ambiguous word. Subordinate gesture fragmentselicited a more negative deflection compared to dominantgesture fragments, indicating that the integration of subordi-nate gesture fragments with the homonym is more effortful.The ERP data at the target words showed that the gesturefragments were not only integrated with speech but were

Figure 4. ERPs as found in Experiment 3.

Obermeier, Holle, and Gunter 1659

also used to disambiguate the homonym. When a targetword was incongruent with the meaning of the precedinggesture–homonym combination, a larger N400 was elicitedas compared to when this meaning was congruent.

In order to explore the nature of the gesture–speechinteraction, we used a more shallow task in Experiment 2.If gesture–speech integration is an automatic process, thistask manipulation should have resulted in similar ERP pat-terns as found in Experiment 1, where participants wereexplicitly asked to judge the compatibility between ges-ture and speech. This was, however, not the case, as nosignificant ERP effects were observed in Experiment 2.One possible interpretation of this negative finding is thatgesture–speech integration is not an automatic processwhen gesture fragments are used. The data of Experiment 3,however, suggest a different reason for the null finding ofExperiment 2.

In Experiment 3, we used the shallow task again but alsosynchronized the gesture fragments and homonyms. Thissynchrony manipulation led to a robust negativity for thesubordinate gestures at the homonym as well as to sig-nificant N400 effects at subordinate target words. The com-bined ERP data therefore suggest that when gesture andspeech are in synchrony, their interaction is more or lessautomatic. When both domains are not in synchrony, ef-fortful gesture-related memory processes are necessary tobe able to combine the gesture fragment and speech con-text in such a way that the homonym is disambiguatedcorrectly.

Gesture–Speech Synchrony

The present series of ERP experiments are, at least to theauthorsʼ knowledge, the first series of experiments whichinvestigated the effect of synchrony or temporal overlapon gesture–speech integration. In particular, our designallowed us both to analyze the direct integration of a ges-ture fragment at the homonym as well as explore theindirect consequence of this integration at a later targetword, which was presented a few words downstream inthe sentence. Although widely recognized as being oneof the crucial factors for gesture–speech comprehension,the temporal aspects of iconic gestures have been under-studied so far. In 1992, McNeill stressed the significanceof timing for gesture–speech integration by putting for-ward his semantic synchrony rule. He stated that the samesemantic content has to occur simultaneously in gestureand speech in order to allow proper integration. This meansthat if both sources of information are synchronously pres-ent, the information should be integrated, and this integra-tion is suggested to occur in an automatic fashion (McNeillet al., 1994). Our results point to a similar direction. Whenthe iconic gesture fragments were synchronized with theircoexpressive speech unit (i.e., the homonym), an effect ofimmediate iconic gesture–speech integration was found atthe position of the homonym (Experiment 3). Subordinate

gesture fragments elicited a larger negativity than dominantgesture fragments. This finding is in line with Simpson andKrueger (1991), who showed for ambiguous words pre-sented in a neutral context that the dominant meaningwas activated to a stronger degree than the subordinatemeaning. Because of the mutual influence of homonymand gesture fragments, we cannot really tell how exactlythe integration of iconic gestures and speech works, al-though a plausible scenario will be given below. We suggestthat the ERP results at the homonym reflect an effect of wordmeaning frequency. A gesture fragment that is related to themore frequent dominant meaning of a homonym can beintegrated easier with the homonym than a gesture frag-ment related to the less frequent subordinate meaning. Incontrast to the above mentioned results, no significant ERPeffect was found at the homonym when a gesture fragmentand the corresponding homonym were asynchronous,thereby lacking temporal overlap (Experiment 2). We as-sume that only if the gesture fragment and homonym aresynchronous, and thus, share an amount of temporal over-lap, can they be automatically integrated into one single ideaunit. If, however, there is no overlap and immediate auto-matic integration is not feasible, the information within thegesture fragments gets lost over time and cannot be inte-grated with the homonym. Thus, timing seems to be animportant factor for gesture–speech integration. Most likely,it is not the absolute temporal synchrony that is crucial, butrather the temporal overlap between gesture and speech.Presumably, there is some kind of time window for iconicgesture–speech integration, as it has been found for othertypes of multimodal integration (e.g., McGurk Effect; VanWassenhove, Grant, & Poeppel, 2007). Again, more experi-ments are clearly needed to substantiate such a conjecture.

Iconic Gestures and Memory

The ERP data of Experiment 1 showed that when the taskexplicitly required an integration of gesture and speech,participants can use effortful memory processes to over-come the integration problems related to nonoverlappinggesture and speech channels. We know from the behav-ioral test that the gesture fragments themselves are rathermeaningless without context. Therefore, they have to bealigned with the homonym to become a single meaningfulunit (i.e., related to the dominant or subordinate meaningof the homonym). One possible way to achieve this is tostore the gestural content actively in workingmemory untilthe corresponding homonym has been encountered. Atthis point in time, stored gestural information and the cor-responding speech unit are synchronous again and canbe integrated. The type of memory involved in this processis less likely to be semantic in nature because the behav-ioral data showed that the gesture fragments are virtuallymeaningless when presented without a context. We there-fore speculate that semantic interpretation of the gesturefragments is delayed until the IP of the homonym has beenprocessed. Until then, the gesture fragments are thought

1660 Journal of Cognitive Neuroscience Volume 23, Number 7

to be stored in a nonsemantic (e.g., movement-based) for-mat in working memory.6

Context Strength and Gesture–Speech Integration

As has been argued above, gesture and speech are inte-grated at the homonym position, which led to a disambig-uation of the homonym, which can be measured furtherdownstream in the sentence at the target word. In Experi-ment 1, this disambiguation was independent of wordmeaning frequency. In Experiment 3, however, only thesubordinate targets showed a clear N400 effect. These re-sults are a bit puzzling and need an explanation. Previousresearch on homonym disambiguation has shown thatweak contexts only affect the subordinate meaning butnot the dominant meaning of a homonym (e.g., Holle &Gunter, 2007; Martin et al., 1999; Simpson, 1981). This isexactly what seemed to happen in Experiment 3, where ashallower task was used. When participants are not pushedby the task to integrate gesture and speech, the meaning ofthe fragments is treated as weak context. In Experiment 1,however, the task required the participants to activelycombine the information from the two domains. Becausethe task demands modified the perceived importance ofthe semantic relationship between gesture and speech, italso changed the weak gesture context into a strong one.

What Happens at the Homonym?—A PotentialScenario for Gesture–Speech Integration

It must be clear by now that the processes taking place atthe position of homonym are quite complex. Let us useExperiment 1 as an example to come up with a plausiblescenario. First, the gesture fragment is stored in a nonse-mantic memory buffer. At this point in time, the fragmentdoes not have a meaning yet, because there is no suitablecontext to which the fragment can be related. A mean-ingful interpretation is only possible after the homonymhas been processed. This is the case at the IP. At this pointin time, it seems that two things happen in parallel. On theone hand, the homonym activates both the dominant andthe subordinate word meaning (cf. Swinney, 1979). On theother hand, the gesture fragment acquires meaning byinteracting with the corresponding meaning of the hom-onym. The meaning of the homonym supported by ges-ture is kept active, while the unsupported meaning decaysor is actively inhibited, resulting in a disambiguation of thehomonym. It is quite obvious that gesture–speech inte-gration in our case is a very complex process of mutual dis-ambiguation between both channels. It is not just thatgesture is only influencing speech or the other way around.Both information channels impact/need each other at thesame time in order to disambiguate. This process of mutualinfluence between gesture and speech in order to gener-ate a single semantic unit is the key part of any gesture–

speech integration. To identify the precise timing and struc-ture of this process is, without doubt, oneof themajor goalsof future gesture research. The present results may providea starting point for this quest.

Conclusion

The present set of experiments showed that when iconicgesture fragments and speech are in synchrony, their in-teraction is task-independent and, in this sense, automatic.In contrast, when an iconic gesture is not in synchronywith its related speech unit, more controlled, active mem-ory processes are necessary to be able to combine the ges-ture fragment with its related speech unit in such a waythat the homonym is disambiguated. Thus, when syn-chrony fails, memory can help.

Acknowledgments

We thank Ina Koch and Kristiane Werrmann for data acquisi-tion, Sven Gutekunst for technical assistance, and Shirley-AnnRüschemeyer and Jim Parkinson for editing the article from a nativespeaker perspective. Arjan van Eerden did the stimulus prepara-tion and data collection of Experiment 3 for his bachelor thesis.We thank five anonymous reviewers for their helpful comments.

Reprint requests should be sent to Christian Obermeier, Max-Planck-Institute for Human Cognitive and Brain Sciences, PO Box500 355, 04303 Leipzig, Germany, or via e-mail: [email protected].

Notes

1. Note that there are also so-called holds (prestroke hold,poststroke hold), whose main purpose is to ensure synchronybetween gesture and corresponding speech unit. Because therewere no such holds in the used stimulus material, we do notdescribe those gesture phases in more detail.2. Because both studies are based on the same stimulus ma-terial, the stimulus example in Table 2 for the present study alsoillustrates the sentences used by Holle and Gunter.3. The N400 is a negative deflection in the ERP of the enceph-alogram (EEG) with its peak around 400 msec and is hypoth-esized to reflect semantic processing (Hinojosa, Martin-Loeches,& Rubia, 2001). In particular, the N400 has been shown to be sen-sitive to the difficulty of integration of a stimulus (e.g., word, butalso gesture) into a preceding context. The easier this can bedone, the smaller is the N400 amplitude.4. The participants were presented with a silent clip of the ges-ture fragment and had to write down a free proposal of the mean-ing of the fragment. Only 7% of all gestures were identifiedcorrectly (i.e., semantically related to the meaning of the targetword or homonym) and again only 7% of these correct responsesincluded the actual target word (i.e., 0.5% overall). The responseswere assessed by two raters (interrater reliability = .80).5. In contrast to Experiments 1 and 2, clear early ERP compo-nents can be seen in both gesture conditions in Experiment 3.These components are due to the offset of the gesture fragmentand relate to the physical properties of the stimulus (cf. Donchin,Ritter, & McCallum, 1978). For the present purpose, only thenegative modulation of the ERP is of importance.6. An alternative explanation would be that the decontextual-ized gesture fragments are immediately semantically interpreted,

Obermeier, Holle, and Gunter 1661

and that the information is then stored in a semantic format inworking memory. The explanation for the results of behavioralpretest would then be that the semantic interpretations elicitedby the decontextualized gesture fragments are just too under-specified to reliably elicit a verbal description. However, we be-lieve that this is a less likely explanation of our data.

REFERENCES

Clark, H. H. (1973). The language-as-fixed-effect fallacy: Acritique of language statistics in psychological research.Journal of Verbal Learning and Verbal Behavior, 12,335–359.

Donchin, E., Ritter, W., & McCallum, W. C. (1978). Cognitivepsychophysiology: The endogenous components of theERPs. In E. Callaway, P. Teuting, & S. Koslow (Eds.),Event-related brain potentials in man (pp. 349–441).New York: Academic Press.

Emmorey, K., & Corina, D. (1990). Lexical recognition in signlanguage: Effects of phonetic structure and morphology.Perceptual and Motor Skills, 71, 1227–1252.

Friederici, A., Gunter, T. C., Hahne, A., & Mauth, K.(2004). The relative timing of syntactic and semanticprocesses in sentence comprehension. NeuroReport:For Rapid Communication of Neuroscience Research,15, 165–169.

Gaskell, M. G., & Marslen-Wilson, W. D. (1997). Integratingform and meaning: A distributed model of speechperception. Language and Cognitive Processes, 12,613–656.

Green, A., Straube, B., Weiss, S., Jansen, A., Willmes, K.,Konrad, K., et al. (2009). Neural integration of iconicand unrelated coverbal gestures: A functional MRI study.Human Brain Mapping, 30, 3309–3324.

Greenhouse, S. W., & Geisser, S. (1959). On methods inthe analysis of profile data. Psychometrika, 24, 95–112.

Grosjean, F. (1980). Spoken word recognition processesand the gating paradigm. Perception & Psychophysics,28, 267–283.

Grosjean, F. (1983). How long is the sentence? Predictionand prosody in the on-line processing of language.Linguistics, 21, 501–529.

Grosjean, F. (1996). Gating. Language and CognitiveProcesses, 11, 597–604.

Gunter, T. C., Wagner, S., & Friederici, A. D. (2003). Workingmemory and lexical ambiguity resolution as revealed byERPs: A difficult case for activation theories. Journal ofCognitive Neuroscience, 15, 643–657.

Hadar, U., & Pinchas-Zamir, L. (2004). The semantic specificityof gesture: Implications for gesture classification andfunction. Journal of Language and Social Psychology,23, 204–214.

Hinojosa, J. A., Martin-Loeches, M., & Rubia, F. J. (2001).Event-related potentials and semantics: An overviewand an integrative proposal. Brain and Language, 78,128–139.

Holle, H., & Gunter, T. C. (2007). The role of iconic gesturesin speech disambiguation: ERP evidence. Journal ofCognitive Neuroscience, 19, 1175–1192.

Holle, H., Gunter, T. C., Rueschemeyer, S. A., Hennenlotter, A.,& Iacoboni, M. (2008). Neural correlates of the processingof co-speech gestures. Neuroimage, 39, 2010–2024.

Holle, H., Obleser, J., Rueschemeyer, S. A., & Gunter, T. C.(2010). Integration of iconic gestures and speech in leftsuperior temporal areas boosts speech comprehensionunder adverse listening conditions. Neuroimage, 49,875–884.

Jansen, E., & Povel, D. J. (2004). The processing of chords intonal melodic sequences. Journal of New Music Research,33, 31–48.

Kelly, S. D., Kravitz, C., & Hopkins, M. (2004). Neural correlatesof bimodal speech and gesture comprehension. Brainand Language, 89, 253–260.

Kelly, S. D., Ward, S., Creigh, P., & Bartolotti, J. (2007). Anintentional stance modulates the integration of gestureand speech during comprehension. Brain and Language,101, 222–233.

Kita, S., van Gijn, I., & van der Hulst, H. (1998). Movementphase in signs and co-speech gestures, and theirtranscriptions by human coders. Lecture Notes inComputer Science, 1371, 23–35.

Krauss, R. M., Morrel-Samuels, P., & Colasante, C. (1991).Do conversational hand gestures communicate? Journalof Personality and Social Psychology, 61, 743–754.

Lausberg, H., & Kita, S. (2003). The content of the messageinfluences the hand choice in co-speech gestures andin gesturing without speaking. Brain and Language, 86,57–69.

Lausberg, H., & Slöetjes, H. (2008). Gesture coding withthe NGCS–ELAN system. In A. J. Spink, M. R. Ballintijn,N. D. Rogers, F. Grieco, L. W. S. Loijens, L. P. J. J.Noldus, et al. (Eds.), Proceedings of Measuring Behavior2008, 6th International Conference on Methods andTechniques in Behavioral Research (pp. 176–177).Maastricht: Noldus.

Levelt, W. J., Richardson, G., & la Heij, W. (1985). Pointingand voicing in deictic expressions. Journal of Memoryand Language, 24, 133–164.

Marslen-Wilson, W. (1987). Functional parallelism in spokenword-recognition. Cognition, 25, 71–102.

Martin, C., Vu, H., Kellas, G., & Metcalf, K. (1999). Strengthof discourse context as a determinant of the subordinatebias effect. Quarterly Journal of Experimental Psychology:Section A, Human Experimental Psychology, 52, 813–839.

McNeill, D. (1992). Hand and mind: What gestures revealabout thought. Chicago: University of Chicago Press.

McNeill, D., Cassell, J., & McCullough, K.-E. (1994).Communicative effects of speech-mismatched gestures.Research on Language and Social Interaction, 27,223–237.

Morrel-Samuels, P., & Krauss, R. M. (1992). Word familiaritypredicts temporal asynchrony of hand gestures andspeech. Journal of Experimental Psychology: Learning,Memory, and Cognition, 18, 615–622.

Oldfield, R. C. (1971). The assessment and analysis ofhandedness: The Edinburgh inventory. Neuropsychologia,9, 97–113.

Özyürek, A., Willems, R. M., Kita, S., & Hagoort, P. (2007).On-line integration of semantic information from speechand gesture: Insights from event-related brain potentials.Journal of Cognitive Neuroscience, 19, 605–616.

Posner, M. I., & Snyder, C. R. R. (1975). Attention andcognitive control. In R. L. Solso (Ed.), Informationprocessing and cognition: The Loyola symposium.Hillsdale, NJ: Erlbaum.

Salasoo, A., & Pisoni, B. P. (1985). Interaction of knowledgesources in spoken word identification. Journal of Memoryand Language, 24, 210–231.

Schneider, W., & Shiffrin, R. M. (1977). Controlled andautomatic human information-processing: 1. Detection,search, and attention. Psychological Review, 84, 1–66.

Shiffrin, R. M., & Schneider, W. (1977). Controlled andautomatic human information-processing: 2. Perceptuallearning, automatic attending, and a general theory.Psychological Review, 84, 127–190.

1662 Journal of Cognitive Neuroscience Volume 23, Number 7

Simpson, G. B. (1981). Meaning dominance and semanticcontext in the processing of lexical ambiguity. Journalof Verbal Learning and Verbal Behavior, 20, 120–136.

Simpson, G. B., & Krueger, M. (1991). Selective accessof homograph meanings in sentence context. Journalof Memory and Language, 30, 627–643.

Swinney, D. A. (1979). Lexical access during sentencecomprehension: (Re)consideration of context effects.Journal of Verbal Learning and Verbal Behavior, 18,645–659.

Treffner, P., Peter, M., & Kleidon, M. (2008). Gestures and

phases: The dynamics of speech–hand communication.Ecological Psychology, 20, 32–64.

Van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007).Temporal window of integration in auditory–visualspeech perception. Neuropsychologia, 45, 598–607.

Willems, R. M., Özyürek, A., & Hagoort, P. (2007). Whenlanguage meets action. The neural integration of gestureand speech. Cerebral Cortex, 17, 2322–2333.

Wu, Y. C., & Coulson, S. (2005). Meaningful gestures:Electrophysiological indices of iconic gesturecomprehension. Psychophysiology, 42, 654–667.

Obermeier, Holle, and Gunter 1663

This article has been cited by:

1. Kim Ouwehand, Tamara van Gog, Fred PaasThe Use of Gesturing to Facilitate Older Adults’ Learning from Computer-BasedDynamic Visualizations 33-58. [CrossRef]

2. Christian Obermeier, Thomas Dolk, Thomas C. Gunter. 2012. The benefit of gestures during communication: Evidence fromhearing and hearing-impaired individuals. Cortex 48:7, 857-870. [CrossRef]

3. Agustín Ibáñez, Juan F. Cardona, Yamil Vidal Dos Santos, Alejandro Blenkmann, Pía Aravena, María Roca, Esteban Hurtado,Mirna Nerguizian, Lucía Amoruso, Gonzalo Gómez-Arévalo, Anabel Chade, Alberto Dubrovsky, Oscar Gershanik, SilviaKochen, Arthur Glenberg, Facundo Manes, Tristán Bekinschtein. 2012. Motor-language coupling: Direct evidence from earlyParkinson’s disease and intracranial cortical recordings. Cortex . [CrossRef]