Published papers - Wiki for iCub and Friends

353
SIXTH FRAMEWORK PROGRAMME PRIORITY 1 FOCUSING AND INTEGRATING RESEARCH New and Emerging Science and Technology Project no. 5010 CONTACT Learning and Development of CONTextual ACTion Instrument: SPECIFIC TARGETED RESEARCH PROJECT Thematic Priority: New and Emerging Science and Technology (NEST), Adventure Activities Published papers Period covered: from 01/09/2005 to 31/08/2006 Date of preparation: 9/10/2006 Start date of project: 01/09/2005 Duration: 36 months Project coordinators name: Laila Craighero, Giulio Sandini Project coordinator organisation name: University of Genova, DIST, LIRA-lab Revision 1

Transcript of Published papers - Wiki for iCub and Friends

SIXTH FRAMEWORK PROGRAMME PRIORITY 1

FOCUSING AND INTEGRATING RESEARCH New and Emerging Science and Technology

Project no. 5010 CONTACT

Learning and Development of CONTextual ACTion Instrument: SPECIFIC TARGETED RESEARCH PROJECT Thematic Priority: New and Emerging Science and Technology (NEST), Adventure

Activities

Published papers

Period covered: from 01/09/2005 to 31/08/2006 Date of preparation: 9/10/2006 Start date of project: 01/09/2005 Duration: 36 months Project coordinators name: Laila Craighero, Giulio Sandini Project coordinator organisation name: University of Genova, DIST, LIRA-lab Revision 1

1. Arbib, M., Metta, G., Van der Smagt, P. Neurorobotics: From Vision to Action. In

Handbook of Robotics, Chapter 64. Submitted, July 2006.

2. Fadiga L, Craighero L, Olivier E. (2005). Human motor cortex excitability during the

perception of others' action. Current Opinion in Neurobiology, 15:213-8.

3. Fadiga L., Craighero L. (2006). Hand actions and speech representation in Broca's area.

Cortex, 42:486-90.

4. Fadiga L., Craighero L. Roy A.C. (2006). Broca's area: A speech area? In Y. Grodzinsky &

K. Amunts (Eds.), Broca's region. New York: Oxford University Press.

5. Fadiga L., Craighero L., Fabbri Destro M., Finos L., Cotillon-Williams N., Smith A.T.,

Castiello U. (in press). Language in shadow. Social Neuroscience.

6. Falck-Ytter, T., Gredebäck,G., & von Hofsten, C. (2006). Infants predict other people's

action goals. Nature Neuroscience, 9, 878 - 879.

7. Fitzpatrick, P., Needham, A., Natale, L., Metta, G. (in press) Shared Challenges in Object

Perception for Robots and Infants. Infant and Child Development, special issue 2006.

8. Gredebäck, G., Örnkloo, H, Von Hofsten, C. (2006). The development of reactive saccade

latencies. Experimental Brain Research, 173(1), 159-164.

9. Gustavsson, L., Marklund, E., Klintfors, E., Lacerda, F. (2006). Directional Hearing in a

Humanoid Robot - Evaluation of Microphones Regarding HRTF and Azimuthal

Dependence. In Fonetik 2006, pp. 45-48

10. Hörnstein, J., Lopes, M., Santos-Victor, J., Lacerda. F. (2006) Sound localization for

humanoid robots - building audio-motor maps based on the HRTF. Proceedings of The

IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, Beijing,

China, Oct. 9-15, 2006.

11. Jamone, L., Metta, G., F. Nori, F., Sandini, G. (2006) James: A Humanoid Robot Acting

over an Unstructured World. Accepted to the Humanoids 2006 conference, December 4-6th,

2006, Genoa, Italy.

12. Klintfors, E., Lacerda, F. (2006). Potential relevance of audio-visual integration in mammals

for computational modeling. In Proceedings of the Ninth International Conference on

Spoken Language Processing (Interspeech 2006 - ICSPL), pp. 1403-1406

13. Lacerda, F. (in press) Ett ekologiskt perspektiv på tidig talspråksutveckling. Proceedings of

the 10th Nordisk Barnspråkssymposium.

14. Lacerda, F., Sundberg, U. (2006). Proposing ETLA An Ecological Theory of Language

Acquisition. Revista de Estudos Linguísticos da Universidade do Porto.

15. Marklund, E., Lacerda, F. (2006). Infants' ability to extract verbs from continuous speech. In

Proceedings of the Ninth International Conference on Spoken Language Processing

(Interspeech 2006 - ICSPL), pp.1360-1362

16. Metta G., Sandini G., Natale L., Craighero L. Fadiga L. (2006). Understanding mirror

neurons: A bio-robotic approach. Interaction Studies 7:197-232.

17. Natale, L. , Metta, G. , Sandini, G. (2006). Sviluppo sensomotorio in un robot antropomorfo.

In Sistemi Intelligenti, La via italiana alla vita artificiale, special issue (1) 95-104

18. Orabona, F., Metta, G., Sandini, G. (2006). Learning Associative Fields from Natural

Images. In the 5th IEEE Computer Society Workshop on Perceptual Organization in

Computer Vision, New York City, June 22, 2006. In Conjunction With IEEE CVPR.

19. Rizzolatti G., Craighero L. (2005). Mirror neuron: a neurological approach to empathy. In

J.-P. Changeux, A.R. Damasio, W. Singer and Y. Christen (Eds.) Neurobiology of Human

Values.

20. Theuring, C., Gredebäck, G., Hauf, P. (in press) Object processing during a joint gaze

following task. European Journal of Developmental Psychology.

21. Von Hofsten, C. (in press) Action in Development. Developmental Science.

22. Von Hofsten, C., Kochukhova, O., Rosander, K. (in press) Predictive occluder tracking in 4-

month-old infants. Developmental Science.

Arbib et al. for Robotics Handbook June 21, 2006 1

1

Handbook of Robotics Chapter 64: Neurorobotics: From Vision to Action

Michael Arbib Computer Science, Neuroscience, and USC Brain Project

University of Southern California

Los Angeles, CA 90089-2520, USA

Giorgio Metta

LIRA-Lab, DIST

University of Genova

Viale Causa, 13, 16145 Genova, Italy

Patrick van der Smagt

DLR/Institute of Robotics and Mechatronics

P.O.Box 1116, 82230 Wessling, Germany

July 17, 2006 64.1. Introduction 5070 0.9 64.2. Neuroethological Inspiration 32287 5.9 64.3. The Role of the Cerebellum 24447 4.4 64.4. The Role of Mirror Systems 31702 5.8 64.5. Extroduction 2050 0.4 64.6. References 21008 4.8 Figures 9 2.25 24.45

64.1. Introduction 1 64.2. Neuroethological Inspiration 2

64.2.1 Optic Flow in Bee & Robot..................................................................................................2 64.2.2. Visually-Guided Behavior in Frog & Robot .....................................................................4 64.2.3. Navigation in Rat & Robot .................................................................................................5 64.2.4 Schemas and Coordinated Control Programs ...................................................................8 64.2.5 Salience and Visual Attention............................................................................................10

64.3. The Role of the Cerebellum 11 64.3.1 The Human Control Loop .................................................................................................12 64.3.2 Models of Cerebellar Control............................................................................................13 64.3.3 Cerebellar Models and Robotics .......................................................................................17

64.4. The Role of Mirror Systems 18 64.4.1 Mirror Neurons and the Recognition of Hand Movements ............................................19 64.4.2 A Bayesian View of the Mirror System ............................................................................22 64.4.3 Mirror Neurons and Imitation ..........................................................................................26

64.5. Extroduction 28 64.6. References 28

Arbib et al. for Robotics Handbook June 21, 2006 1

1

64.1. Introduction Neurorobotics may be defined as “the design of computational structures for robots inspired by the study of the

nervous systems of humans and other animals”. We note the success of artificial neural networks – networks of

simple computing elements whose connections change with “experience” – as providing a medium for parallel

adaptive computation that has seen application in robot vision systems and controllers but here we emphasize neural

networks derived from the study of specific neurobiological systems. Neurorobotics has a two-fold aim: creating

better machines which employ the principles of natural neural computation; and using the study of bio-inspired

robots to improve understanding of the functioning of the brain. Chapter 60, Biologically Inspired Robots,

complements our study of “brain design” with work on “body design”, the design of robotic control and actuator

systems based on careful study of the relevant biology.

Walter (1953) described two ‘biologically inspired’ robots, the electromechanical tortoises Machina speculatrix

and M. docilis (though each body has wheels not legs). M. speculatrix has a steerable photoelectric cell, which

makes it sensitive to light, and an electrical contact, which allows it to respond when it bumps into obstacles. The

photo-receptor rotates until a light of moderate intensity is registered, at which time the organism orients towards the

light and approaches it. However, very bright lights, material obstacles and steep gradients are repellent to the

“tortoise”. The latter stimuli convert the photo-amplifier into an oscillator; which causes alternating movements of

butting and withdrawal, so that the robot pushes small objects out of its way, goes around heavy ones, and avoids

slopes. The “tortoise” has a “hutch”, which contains a bright light. When the machine’s batteries are charged, this

bright light is repellent. When the batteries are low, the light becomes attractive to the machine and the light

continues to exert an attraction until the tortoise enters the hutch, where the machine’s circuitry is temporarily turned

off until the batteries are recharged, at which time the bright hutch light again exerts a negative tropism. The second

robot, M. docilis was produced by grafting onto M. speculatrix a circuit designed to form conditioned reflexes. In

one experiment, Walter connected this circuit to the obstacle avoiding device in M. speculatrix. Training consisted

in blowing a whistle just before bumping the shell.

Although Walter’s controllers are simple and not based on neural analysis, they do illustrate the attempt to gain

inspiration from seeking the simplest mechanisms that will yield an interesting class of biologically inspired robot

behaviors, and then showing how different additional mechanisms yield a variety of enriched behaviors. Valentino

Braitenberg’s (1984) book Vehicles is very much in this spirit and has entered the canon of neurorobotics. While

their work provides a historical background for the studies surveyed here, we instead emphasize studies inspired by

the computational neuroscience of the mechanisms serving vision and action in the human and animal brain. We

seek lessons from linking behavior to the analysis of the internal workings of the brain (i) at the relatively high-level

of characterizing the functional roles of specific brain regions (or the functional units of analysis called schemas,

Section 64.2.4), and the behaviors which emerge from the interactions between them, and (ii) at the more detailed

level of models of neural circuitry linked to the data of neuroanatomy and neurophysiology. There are lessons for

neurorobotics to be learned from even finer-scale analysis of the biophysics of individual neurons and the

neurochemistry of synaptic plasticity but these are beyond the scope of this chapter (see Segev & London, 2003, and

Fregnac, 2003, respectively, for entry points into the relevant computational neuroscience).

Arbib et al. for Robotics Handbook June 21, 2006 2

2

The plan of this Chapter is as follows: After some selected examples from computational neuroethology, the

computational analysis of neural mechanisms underlying animal behavior, we show how perceptual & motor

schemas and visual attention provide the framework for our action-oriented view of perception, and show the

relevance of the computational neuroscience to robotic implementations (Section 64.2). We then pay particular

attention to two systems of the mammalian brain, the cerebellum and its role in tuning and coordinating actions

(Section 64.3); and the mirror system and its roles in action recognition and imitation (Section 64.4). The

extroduction will then invite readers to explore the many other areas in which neurorobotics offers lessons from

neuroscience to the development of novel robot designs. What follows, then, can be seen as a contribution to the

continuing dialogue between robot behavior and animal and human behavior in which particular emphasis is placed

on the search for the neural underpinnings of vision, visually-guided action, and cerebellar control.

64.2. Neuroethological Inspiration Biological evolution has yielded a staggering variety of creatures each with brains and bodies adapted to specific

niches. One may thus turn to the neuroethology of specific creatures to gain inspiration for special-purpose robots.

In Section 64.2.1, we will see how researchers have studied bees and flies for inspiration for the design of flying

robots, but have also learned lessons for the visual control of terrestrial robots. In Section 64.2.2, we introduce Rana

computatrix, an evolving model of visuo-motor co-ordination in frog and toad. The name “the frog that computes”,

was inspired by Walter’s M. speculatrix and inspired in turn the names of a number of other “species”’ of

neuroethologically-inspired robots, including Beer’s (1990) computational cockroach Periplaneta computatrix and

Cliff’s (1992) hoverfly Syritta computatrix.

Moreover, we learn not only from the brains of specific creatures but also from comparative analysis of the

brains of diverse creatures, looking for homologous mechanisms as computational variants which may be related to

the different ecological niches of the creatures that contain them. A basic theme of brain evolution is that new

functions often emerge through modulation and coordination of existing structures. In other words, to the extent that

new circuitry may be identified with the new function, it need not be as a module that computes the function

autonomously, but rather as one that can deploy prior resources to achieve the novel functionality. Section 64.2.3

will introduce the role of the rat brain in navigation, while Section 64.2.4 will look at the general framework of

perceptual schemas motor schemas and coordinated control programs for a high level view of the neuroscience and

neurorobotics of vision and action. Finally, Section 64.2.5 will look at the control of visual attention in mammals as

a homologue of orienting behavior in frog and toad. All this sets the stage for our emphasis on the roles of the

cerebellum (Section 64.3) and mirror systems (Section 64.4) in the brains of mammals and their implications for

neurorobotics. We stress that the choice of these two systems is conditioned by our own expertise, and that studies

of many other brain systems also hold great importance for neurorobotics.

64.2.1 Optic Flow in Bee & Robot Before we turn to vertebrate brains for much of our inspiration for neurorobotics, we briefly sample the rich

literature on insect-inspired research. Among the founding studies in computational neuroethology were a series of

reports from the laboratory of Werner Reichardt in Tübingen which linked the delicate anatomy of the fly’s brain to

the extraction of visual data needed for flight control. More than 40 years ago, Reichardt (1961) published a model

of motion detection inspired by this work that has long been central in discussions of visual motion in both the

Arbib et al. for Robotics Handbook June 21, 2006 3

3

neuroscience and robotics literatures. Borst & Dickinson (2003) provide a recent study of continuing biological

research on visual course control in flies. Such work has inspired a large number of robot studies, including those of

van der Smagt & Groen (1997), Liu & Usseglio-Viretta (2001), Ruffier et al., 2003, and Reiser & Dickinson, 2003.

Here, however, we look in a little more detail at honeybees. Srinivasan, Zhang & Chahl (2001) continued the

tradition of studying image motion cues in insects by investigating how optic flow (the flow of pattern across the eye

induced by motion relative to the environment) is exploited by honeybees to guide locomotion and navigation. They

analyzed how bees perform a smooth landing on a flat surface: image velocity is held constant as the surface is

approached, thus automatically ensuring that flight speed is close to zero at touchdown. This obviates any need for

explicit knowledge of flight speed or height above the ground. This landing strategy was then implemented in a

robotic gantry to test its applicability to autonomous airborne vehicles. Barron & Srinivasan (2006) investigated the

extent to which ground speed is affected by headwinds. Honey bees were trained to enter a tunnel to forage at a

sucrose feeder placed at its far end (Fig. 1 left). The bees used visual cues to maintain their ground speed by

adjusting their air speed to maintain a constant rate of optic flow, even against headwinds which were, at their

strongest, 50% of a bee's maximum recorded forward velocity.

Vladusich et al. (2005) studied the effect of adding goal-defining landmarks. Bees were trained to forage in an

optic-flow-rich tunnel with a landmark positioned directly above the feeder. They searched much more accurately

when both odometric and landmark cues were available than when only odometry was available. When the two cue

sources were set in conflict, by shifting the position of the landmark in the tunnel during tests, bees overwhelmingly

used landmark cues rather than odometry. This, together with other such experiments, suggests that bees can make

use of odometric and landmark cues in a more flexible and dynamic way than previously envisaged.

Fig. 1. Observation of the trajectories of honeybees flying in visually textured tunnels has provided insights into how

bees use optic flow cues to regulate flight speed and estimate distance flown, and balance optic flow in the two eyes to

Arbib et al. for Robotics Handbook June 21, 2006 4

4

fly safely through narrow gaps. This information has been used to build autonomously navigating robots. Upper right

shows schematic illustration of a honeybee brain, carrying about a million neurons within about one cubic millimeter.

(Image credits: Left: Science Vol. 287, pp. 851-853, 2000; upper right: “Virtual Atlas of the Honeybee Brain”,

http://www.neurobiologie.fu-berlin.de/beebrain/Bee/VRML/SnapshotCosmoall.jpg and lower right: Research School of

Biological Sciences, Australian National University.)

In earlier studies of bees flying down a tunnel, Srinivasan & Zhang (1997) placed different patterns on the left

and right walls. They found that bees balance the image velocities in the left and right visual fields. This strategy

ensures that bees fly down the middle of the tunnel, without bumping into the side walls, enabling them to negotiate

narrow passages or to fly between obstacles. This strategy has been applied to a corridor-following robot (Fig. 1,

bottom right). By holding constant the average image velocity as seen by the two eyes during flight,. the bee avoids

potential collisions, slowing down when it flies through a narrow passage. The movement-sensitive mechanisms

underlying these various behaviors differ qualitatively as well as quantitatively, from those that mediate the

optomotor response (e.g., turning to track a pattern of moving stripes) that had been the initial target of investigation

of the Reichardt laboratory. The lesson for robot control is that flight appears to be coordinated by a number of

visuomotor systems acting in concert, and the same lesson can apply to a whole range of tasks which must convert

vision to action. Of course, vision is but one of the sensory systems that play a vital role in insect behavior. Webb

(2001) uses her own work on robot design inspired by the auditory control of behavior in crickets to anchor a far-

ranging assessment of the extent to which robotics can offer good models of animal behaviors.

64.2.2. Visually-Guided Behavior in Frog & Robot “What the frog’s eye tells the frog brain” (Lettvin et al., 1959) treated the frog’s visual system from an

ethological perspective, analyzing circuitry in relation to the animal’s ecological niche to show that different cells in

the retina and the visual midbrain region known as the tectum were specialized for detecting predators and prey.

However, in much visually guided behavior, the animal does not respond to a single stimulus, but rather to some

property of the overall configuration. We thus turn to the question “What does the frog’s eye tell the frog?”,

stressing the embodied nervous system or, perhaps equivalently, an action-oriented view of perception. Consider, for

example, the snapping behavior of frogs confronted with one or more fly-like stimuli. Ingle (1968) found that it is

only in a restricted region around the head of a frog that the presence of a fly-like stimulus elicits a snap, that is, the

frog turns so that its midline is pointed at the stimulus and then lunges forward and captures the prey with its tongue.

There is a larger zone in which the frog merely orients towards the target, and beyond that zone the stimulus elicits

no response at all. When confronted with two “flies” within the snapping zone, either of which is vigorous enough

that alone it could elicit a snapping response, the frog exhibits one of three reactions: it snaps at one of the flies; it

does not snap at all; or it snaps in between at the “average fly”. Didday (1976) offered a simple model of this choice

behavior which may be considered as the prototype for a winner-take-all (WTA) model which receives a variety of

inputs and (under ideal circumstances) suppresses the representation of all but one of them; the one that remains is

the “winner’” which will play the decisive role in further processing. This was the beginning of Rana computatrix

(Arbib, 1987, 1989 for overviews).

Studies on frog brains and behavior inspired the successful use of potential fields for robot navigation strategies.

Data on the strategies used by frogs to capture prey while avoiding static obstacles (Collett, 1982) grounded the

Arbib et al. for Robotics Handbook June 21, 2006 5

5

model by Arbib & House (1987) which linked systems for depth perception to the creation of spatial maps of both

prey and barriers. In one version of their model, they represented the map of prey by a potential field with long-

range attraction and the map of barriers by a potential field with short-range repulsion, and showed that summation

of these fields yielded a field that could guide the frog’s detour around the barrier to catch its prey. Corbacho &

Arbib (1995) later explored a possible role for learning in this behavior. Their model incorporated learning in the

weights between the various potential fields to enable adaptation over trials as observed in the real animals. The

success of the models indicated that frogs use reactive strategies to avoid obstacles while moving to a goal, rather

than employing a planning or cognitive system. Other work (e.g. Cobas and Arbib, 1992) studied how the frog’s

ability to catch prey and avoid obstacles was integrated with its ability to escape from predators. These models

stressed the interaction of the tectum with a variety of other brain regions such as the pretectum (for detecting

predators) and the tegmentum (for implementing motor commands for approach or avoidance).

Arkin (1989) showed how to combine a computer vision system with a frog-inspired potential field controller to

create a control system for a mobile robot that could successfully navigate in a fairly structured environment using

camera input. The resultant system thus enriched other roughly contemporaneous applications of potential fields in

path planning with obstacle avoidance for both manipulators and mobile robots (Khatib, 1986; Krogh & Thorpe,

1986). The work on Rana Computatrix proceeded at two levels – both biologically realistic neural networks, and in

terms of functional units called schemas which compete and cooperate to determine behavior. Section 64.2.4 will

show how more general behaviors can emerge from the competition and cooperation of perceptual and motor

schemas as well as more abstract coordinating schemas too. Such ideas were, of course, developed independently by

a number of authors, and so entered the robotics literature by various routes, of which the best known may be the

subsumption architecture of Brooks (1986) and the ideas of Braitenberg cited above; while Arkin’s work on

behavior-based robotics (Arkin 1998) is indeed rooted in schema theory. Arkin et al. (2003) present a recent

example of the continuing interaction between robotics and ethology, offering a novel method for creating high-

fidelity models of animal behavior for use in robotic systems based on a behavioral systems approach (i.e. based on

a schema-level model of animal behavior, rather than analysis of biological circuits in animal brains), and describe

how an ethological model of a domestic dog can be implemented with AIBO, the Sony entertainment robot.

64.2.3. Navigation in Rat & Robot

The tectum, the midbrain visual system which determines how the frog turns its whole body towards it prey or

orients it for escape from predators (Section 64.2.2), is homologous with the superior colliculus of the mammalian

midbrain. The rat superior colliculus has been shown to be “frog-like”, mediating approach and avoidance (Dean, et

al., 1989), whereas the best-studied role of the superior colliculus of cat, monkey and human is in the control of

saccades, rapid eye movements to acquire a visual target. Moreover, the superior colliculus can integrate auditory

and somatosensory information into its visual frame (Stein & Meredith, 1993) and this inspired Strosslin et al.

(2002) to use a biologically inspired approach based on the properties of neurons in the superior colliculus to learn

the relation between visual and tactile information in control of a mobile robot platform. More generally, then, the

comparative study of mammalian brains has yielded a rich variety of computational models of importance in

neurorobotics. In this section, we further introduce the study of “mammalian neurorobotics” by looking at studies of

mechanisms of the rat brain for spatial navigation.

Arbib et al. for Robotics Handbook June 21, 2006 6

6

The frog’s detour behavior is an example of what O'Keefe & Nadel (1978) called the taxon (behavioral

orientation) system (as in Braitenberg, 1965, a taxis [plural taxes] is an organism's response to a stimulus by

movement in a particular direction). They distinguished this from a system for map-based navigation, and proposed

that the latter resides in the hippocampus, though Guazzelli et al. (1998) qualified this assertion, showing how the

hippocampus may function as part of a cognitive map. The taxon versus map distinction is akin to the distinction

between reactive and deliberative control in robotics (Arkin, 2003). It will be useful to relate taxis to the notion of an

affordance (Gibson, 1966), a feature of an object or environment relevant to action. For example, in picking up an

apple or a ball, the identity of the object may be irrelevant, but the size of the object is crucial. Similarly, if we wish

to push a toy car, recognizing the make of car copied in the toy is irrelevant, whereas it is crucial to recognize the

placement of the wheels to extract the direction in which the car can be readily pushed. Just as a rat may have basic

taxes for approaching food or avoiding a bright light, say, so does it have a wider repertoire of affordances for

possible actions associated with the immediate sensing of its environment. Such affordances include "go straight

ahead" for visual sighting of a corridor, "hide" for a dark hole; "eat" for food as sensed generically; "drink"

similarly; and the various turns afforded by e.g., the sight of the end of the corridor. It also makes rich use of

olfactory cues. In the same way, a robot’s behavior will rely on a host of reactions to local conditions in fulfilling a

plan –e.g., knowing it must go to the end of a corridor it will nonetheless use local visual cues to avoid hitting

obstacles, or to determine through what angle to turn when reaching a bend in the corridor.

������������

���� ����

������������

��������������

�������������

����������������������

��� ���������

����������� ���

���!��������

�"���������#�����"������

$�����������������������

���������

$��"����

������"%��

$��&����������������

�������������

��� ���� �����

������������

Fig. 2. The TAM-WG model has at its basis a system, TAM, for exploiting affordances. This is elaborated by a system,

WG, which can use a cognitive map to plan paths to targets which are not currently visible. Note that the model

processes two different kinds of sensory inputs. At bottom right are those associated with, e.g., hypothalamic systems

for feeding and drinking, and may provide both incentives and rewards for the animal’s behavior, contributing both to

behavioral choices, and to the reinforcement of certain patterns of behavior. The nucleus accumbens and caudo-

Arbib et al. for Robotics Handbook June 21, 2006 7

7

putamen mediate an actor-critic style of reinforcement learning based on hypothalamic drive of the dopamine system.

The sensory inputs at top left are those that allow the animal to sense its relation with the external world, determining

both where it is (the hippocampal place system) as well as affordances for action (the parietal recognition of

affordances can shape the premotor selection of an action). The TAM model focuses on the parietal-premotor reaction

to immediate affordances; the WG (World Graph) model places action selection within the wider context of a cognitive

map. (Adapted from Guazzelli et al., 1998.)

Both normal and hippocampal-lesioned rats can learn to solve a simple T-maze (e.g., learning whether to turn left

or right to find food) in the absence of any consistent environmental cues other than the T-shape of the maze. If

anything, the lesioned animals learn this problem faster than normals. After criterion was reached, probe trials with

an 8-arm radial maze were interspersed with the usual T-trials. Animals from both groups consistently chose the side

to which they were trained on the T-maze. However, many did not choose the 90° arm but preferred either the 45° or

135° arm, suggesting that the rats eventually solved the T-maze by learning to rotate within an egocentric orientation

system at the choice point through approximately 90°. This leads to the hypothesis of an orientation vector being

stored in the animal's brain but does not tell us where or how the orientation vector is stored. One possible model

would employ coarse coding in a linear array of cells, coding for turns from - 180° to +180°. From the behavior, one

might expect that only the cell's close to the preferred behavioral direction are excited, and that learning "marches"

this peak from the old to the new preferred direction. To "unlearn" -90°, say, the array must reduce the peak there,

while at the same time "building" a new peak at the new direction of +90°. If the old peak has "mass" p(t) and the

new peak has "mass" q(t), then as p(t) declines toward 0 while q(t) increases steadily from 0, the center of mass will

progress from -90° to +90°, fitting the behavioral data.

The determination of movement direction was modeled by "rattification" of the Arbib & House (1987) model of

frog detour behavior. There, prey were represented by excitation coarsely coded across a population, while barriers

were encoded by inhibition whose extent closely matched the retinotopic extent of each barrier. The sum of

excitation was passed through a winner-take-all circuit to yield the choice of movement direction. As a result, the

direction of the gap closest to the prey, rather than the direction of the prey itself, was often chosen for the frog's

initial movement. The same model serves for behavioral orientation once we replace the direction of the prey (frog)

by the direction of the orientation vector (rat), while the barriers correspond to the presence of walls rather than alley

ways.

To approach the issue of how a cognitive map can extend the capability of the affordance system, Guazzelli et al.

extended the Lieblich & Arbib (1982) approach to building a cognitive map as a world graph, a set of nodes

connected by a set of edges, where the nodes represent recognized places or situations, and the links represent ways

of moving from one situation to another. A crucial notion is that a place encountered in different circumstances may

be represented by multiple nodes, but that these nodes may be merged when the similarity between these

circumstances is recognized. They model the process whereby the animal decides where to move next, on the basis

of its current drive state (hunger, thirst, fear, etc.). The emphasis is on spatial maps for guiding locomotion into

regions not necessarily current visible, rather than retinotopic representations of immediately visible space, and

yields exploration and latent learning without the introduction of an explicit exploratory drive. The model shows (i)

how a route, possibly of many steps, may be chosen that leads to the desired goal; (ii) how short cuts may be chosen,

Arbib et al. for Robotics Handbook June 21, 2006 8

8

and (iii) through its account of node-merging why, in open fields, place cell firing does not seem to depend on

direction.

The overall structure and general mode of operation of the complete model is shown in Fig. 2, which gives a

vivid sense of the lessons to be learned by studying not only specific systems of the mammalian brain but also their

patterns of large-scale interaction. This model is but one of many inspired by the data on the role of the

hippocampus and other regions in rat navigation. Here, we just mention as pointers to the wider literature the papers

by Girard et al. (2005) and Meyer et al. (2005) which are part of the Psikharpax project, which is doing for rats what

Rana computatrix did for frogs and toads.

64.2.4 Schemas and Coordinated Control Programs Schema theory complements neuroscience's well-established terminology for levels of structural analysis (brain

region, neuron, synapse) with a functional vocabulary, a framework for analysis of behavior with no necessary

commitment to hypotheses on the localization of each schema (unit of functional analysis), but which can be linked

to a structural analysis whenever appropriate. Schemas provide a high-level vocabulary which can be shared by

brain theorists, cognitive scientists, connectionists, ethologists, kinesiologists – and roboticists. In particular, schema

theory can provide a distributed programming environment for robotics (see, e.g., the RS [Robots Schemas]

language of Lyons & Arbib, 1989, and supporting architectures for distributed control as in Metta et al., 2006a).

Schema theory becomes specifically relevant to neurorobotics when the schemas are inspired by a model

constrained by data provided by, e.g., human brain mapping, studies of the effects of brain lesions, or

neurophysiology.

A perceptual schema not only determines whether an object or other "domain of interaction" is present in the

environment but can also provide important. The activity level of a perceptual schema signals the credibility of the

hypothesis that what the schema represents is indeed present; whereas other schema parameters represent relevant

properties such as size, location, and motion of the perceived object. Given a perceptual schema, we may need

several schema instances, each suitably tuned, to subserve perception of several instances of its domain, e.g., several

chairs in a room.

Motor schemas provide the control systems which can be coordinated to effect a wide variety of actions. The

activity level of a motor schema instance may signal its "degree of readiness" to control some course of action. What

distinguishes schema theory from usual control theory is the transition from emphasizing a few basic controllers

(e.g., for locomotion or arm movement) to a large variety of motor schemas for diverse skills (peeling an apple,

climbing a tree, changing a light bulb), with each motor schema depending on perceptual schemas to supply

information about objects which are targets for interaction. Note the relevance of this for robotics – the robot needs

to know not only what the object is but also how to interact with it. Modern neuroscience (Ungerleider & Mishkin,

1982; Goodale & Milner, 1992) has indeed established that the monkey and human brain each use a dorsal pathway

(via the parietal lobe) for the “how” and a ventral pathway (via inferotemporal cortex) for the “what”.

A coordinated control program, interweaves the activation of various perceptual, motor, and coordinating

schemas in accordance with the current task and sensory environment to mediate complex behaviors. Fig. 3 shows

the original coordinated control program (Arbib 1981, inspired by data of Jeannerod & Biguer, 1982). As the hand

moves to grasp a ball, it is preshaped so that when it has almost reached the ball, it is of the right shape and

Arbib et al. for Robotics Handbook June 21, 2006 9

9

orientation to enclose some part of the ball prior to gripping it firmly. The outputs of three perceptual schemas are

available for the concurrent activation of two motor schemas, one controlling the arm to transport the hand towards

the object and the other preshaping the hand. Once the hand is preshaped, it is only the completion of the fast initial

phase of hand transport that "wakes up" the final stage of the grasping schema to shape the fingers under control of

tactile feedback. (This model anticipates the much later discovery of perceptual schemas for grasping in a localized

area [AIP] of parietal cortex and motor schemas for grasping in a localized area [F5] of premotor cortex. See Fig. 4.)

The notion of schema is thus recursive – a schema may be analyzed as a coordinated control program of finer

schemas, and so on until such time as a secure foundation of neural specificity is attained.

visual ,

kinesthetic, and

tactile input

Visual

Location

Hand Reaching

Fast Phase

Movement

Hand

PreshapeHand

Rotation

Actual

Grasp

Size

RecognitionOrientation

Recognition

size

Grasping

orientation

activation of

visual search

visual and

kinesthetic input

target

location

visual

input

visual

input

Slow Phase

Movement

visual

input

recognition

criteria

activation

of reaching

Fig. 3. Hypothetical coordinated control program for reaching and grasping. The perceptual schemas (at the top)

provide parameters for the motor schemas (at the bottom) for the control of "reaching" (arm transport ≈ hand reaching)

and "grasping" (controlling the hand to conform to the object). Dashed lines indicate activation signals which establish

timing relations between schemas; solid lines indicate transfer of data. (Adapted from Arbib 1981.)

Subsequent work has refined the scheme of Fig. 3. For example, Hoff & Arbib’s (1993) model uses “time needed

for completion” of each of the movements – transporting the hand and preshaping the hand – to explain data on how

the reach to grasp responds to perturbation of target location or size. Moreover, Hoff and Arbib (1992) show how to

embed an optimality principle for arm trajectories into a controller which can use feedback to resist noise and

compensate for target perturbations, and a predictor element to compensate for delays from the periphery. The result

is a feedback system which can act like a feedforward system described by the optimality principle in "familiar"

situations, where the conditions of the desired behavior are not perturbed and accuracy requirements are such that

"normal" errors in execution may be ignored. However, when perturbations must be corrected for or when great

precision is required, feedback plays a crucial role in keeping the behavior close to that desired, taking account of

Arbib et al. for Robotics Handbook June 21, 2006 10

10

delays in putting feedback into effect. This integrated view of feedback and feedforward within a single motor

schema seems to us of value for neurorobotics as well as the neuroscience of motor control.

It is standard to distinguish a forward or direct model which represents the path from motor command to motor

output, from the inverse model which models the reverse pathway – i.e., going from a desired motor outcome to a

set of motor commands likely to achieve it. As we have just suggested, the action plan unfolds as if it were

feedforward or open-loop when the actual parameters of the situation match the stored parameters, while a feedback

component is employed to counteract disturbances (current feedback) and to learn from mistakes (learning from

feedback). This is obtained by relying on a forward model that predicts the outcome of the action as it unfolds in

real-time. The accuracy of the forward model can be evaluated by comparing the output generated by the system

with the signals derived from sensory feedback (Miall et al., 1993). Also, delays must be accounted for to address

the different propagation times of the neural pathways carrying the predicted and actual outcome of the action. Note

that the forward model in this case is relatively simple, predicting only the motor output in advance: since motor

commands are generated internally it is easy to imagine a predictor for these signals (known as an efference copy).

The inverse model, on the other hand, is much more complicated since it maps sensory feedback (e.g. vision) back

into motor terms. These concepts will prove important both in our study of the cerebellum (Section 64.3) and mirror

systems (Section 64.4).

64.2.5 Salience and Visual Attention Discussions of how an animal (or robot) grasps an object assume that the animal or robot is attending to the

relevant object. Thus, whatever the subtlety of processing in the canonical and mirror systems for grasping, its

success rests on the availability of a visual system coupled to an oculomotor control system that bring foveal vision

to bear on objects to set parameters needed for successful interaction. Indeed, directing attention appropriately is a

topic for which there is a great richness of both neurophysiological data and robotic application. In this

neuromorphic model of the bottom-up guidance of attention in primates (Itti & Koch, 2000), the input video stream

is decomposed into eight feature channels at six spatial scales. After surround suppression, only a sparse number of

locations remain active in each map, and all maps are combined into a unique saliency map. This map is scanned by

the focus of attention in order of decreasing saliency through the interaction between a winner-take-all mechanism

(which selects the most salient location) and an inhibition-of-return mechanism (which transiently suppresses

recently attended locations from the saliency map). Because it includes a detailed low-level vision front-end, the

model has been applied not only to laboratory stimuli, but also to a wide variety of natural scenes, predicting a

wealth of data from psychophysical experiments.

When specific objects are searched for, low-level visual processing can be biased not only by the gist (e.g.,

“outdoor suburban scene”) but also for the features of that object. This top-down modulation of bottom-up

processing results in an ability to guide search towards targets of interest (Wolfe, 1994). Task affects eye

movements (Yarbus, 1967), as do training and general expertise. Navalpakkam & Itti (2005) propose a

computational model which emphasizes four aspects that are important in biological vision: determining task-

relevance of an entity, biasing attention for the low-level visual features of desired targets, recognizing these targets

using the same low-level features, and incrementally building a visual map of task-relevance at every scene location.

It attends to the most salient location in the scene, and attempts to recognize the attended object through hierarchical

Arbib et al. for Robotics Handbook June 21, 2006 11

11

matching against object representations stored in long-term memory. It updates its working memory with the task-

relevance of the recognized entity and updates a topographic task-relevance map with the location and relevance of

the recognized entity. For example, in one task the model forms a map of likely locations of cars from a video clip

filmed while driving on a highway. Such work illustrates the continuing interaction between models based on visual

neurophysiology and human psychophysics with the tackling of practical robotic applications.

Orabona, Metta, Sandini (2005) implemented an extension of the Itti-Koch model on a humanoid robot with

moving eyes, using log-polar vision as in Sandini and Tagliasco (1980), and changing the feature construction

pyramid by considering proto-object elements (blob-like structures rather than edges). The inhibition of return

mechanism has to take into account a moving frame of reference, the resolution of the fovea is very different from

that at the periphery of the visual field, and head and body movements need to be stabilized. The control of

movement might thus have a relationship with the structure and development of the attention system. Rizzolatti et al.

(1987) proposed a role for the feedback projections from premotor cortex to the parietal lobe, assuming that they

form a tuning signal that dynamically changes visual perception. In practice this can be seen as an implicit attention

system which “selects” sensory information while the action is being prepared and subsequently executed. The early

responses, before action onset, of many premotor and parietal neurons suggest a premotor mechanism of attention

that deserves exploration in further work in neurorobotics.

64.3. The Role of the Cerebellum Although cerebellar involvement in muscle control was advocated long ago by the Greek gladiator surgeon

Galen of Pergamum (129-216/17 CE), it was the publication by Eccles, Ito and Szentágothai (1967) of the first

comprehensive account of the detailed neurophysiology and anatomy of the cerebellum (Ito, 2006) that provided the

inspiration for the Marr-Albus model of cerebellar plasticity (Marr, 1969; Albus, 1971) that is at the heart of most

current modeling of the role of the cerebellum in control of motion and sensing. From a robotics point of view, the

most convincing results are based on Albus’ (1975) Cerebellar Model Articulation Controller (CMAC) model and

subsequent implementations by Miller (e.g., 1994). These models, however, are only remotely based on the structure

of the biological cerebellum. More detailed models are usually only applied to 2-degree-of-freedom robotic

structures, and have not been generalized to real-world applications (Peters & van der Smagt, 2002). The problem

may lie with viewing the cerebellum as a stand-alone dynamics controller. An important observation about the brain

is that schemas are widely distributed, and different aspects of the schemas are computed in different parts of the

brain. Thus, one view is that (i) cerebral cortex has the necessary models for choosing appropriate actions and

getting the general shape of the trajectory assembled to fit the present context, whereas (ii) the cerebellum provides a

side-path which (on the basis of extensive learning of a forward motor model) provides the appropriate corrections

to compensate for control delays, muscle nonlinearities, Coriolis and centrifugal forces occasioned by joint

interactions, and subtle adjustments of motor neuron firing in simultaneously active motor pattern generators to

ensure their smooth coordination. Thus, for example, a patient with cerebellar lesions may be able to move his arm

to successfully reach a target, and to successfully adjust his hand to the size of an object. However, he lacks the

machinery to perform either action both swiftly and accurately, and further lacks the ability to coordinate the timing

of the two subactions. His behavior will thus exhibit “decomposition of movement” – he may first move the hand till

the thumb touches the object, and only then shape the hand appropriately to grasp the object. Thus analysis of how

Arbib et al. for Robotics Handbook June 21, 2006 12

12

various components of cerebral cortex interact to support forward and inverse models which determine the “overall

shape of the behavior” must be complemented by analysis of how the cerebellum handles control delays and

nonlinearities to transform a well-articulated plan into graceful, coordinated action. Within this perspective,

cerebellar structure and function will be very helpful in the control of a new class of highly antagonistic robotic

systems as well as in adaptive control.

64.3.1 The Human Control Loop Lesions and deficits of the cerebellum impair the coordination and timing of movements while introducing

excessive, undesired motion; effects which cannot be compensated by the cerebral cortex. According to mainstream

models, the cerebellum filters descending motor cortex commands to cope with timing issues and communication

delays which go up to 50ms one-way for arm control. Clearly, closed-loop control with such delays is not viable in

any reasonable setting, unless augmented with an open-loop component, predicting the behavior of the actuator

system. This is where the cerebellum comes into its own. The complexity of the vertebrate musculoskeletal system,

clearly demonstrated by the human arm using a total of 19 muscle groups for planar motion of the elbow and

shoulder alone (Nijhof & Kouwenhoven, 2002) requires a control mechanism coping with this complexity,

especially in a setting with long control delays. One cause for this complexity is that animal muscles come in

antagonistic pairs (e.g., flexing versus extending a joint). Antagonistic control of muscle groups leads to energy-

optimal (Damsgaard, Rasmussen, & Christensen, 2000) and intrinsically flexible systems. Contact with stiff or fast-

moving objects requires such flexibility to prevent breakage. In contrast, classical (industrial) robots are stiff, with

limb segments controlled by linear or rotary motors with gear boxes. Even so, most laboratory robotic systems have

passively stiff joints, with active joint flexibility obtainable only by using fast control loops and joint torque

measurement. Although it may be debatable whether such robotic systems require cerebellar-based controllers, the

steady move of robotics towards complete anthropomorphism by mimicking human (hand and arm) kinematics as

well as dynamics as closely as possible, requires the search for alternative, neuromorphic control solutions.

Figure 4. Simplified control loop relating cerebellum and cerebral motor cortex in supervising the spinal cord’s control

of the skeleto-muscular system.

Vertebrate motor control involves the cerebral motor cortex, basal ganglia, thalamus, cerebellum, brain stem and

spinal cord. Motor programs, originating in the cortex, are fed into the cerebellum. Combined with sensory

information through the spinal cord, it sends motor commands out to the muscles via the brain stem and spinal cord,

which controls muscle length and joint stiffness (Bullock & Contreras-Vidal, 1993). The full control loop is depicted

in Fig. 4 (see Schaal & Schweighofer, 2005, for an overview of robotic vs. brain control loops). The model in Fig. 4

Arbib et al. for Robotics Handbook June 21, 2006 13

13

clearly resembles the well-known computed torque model and, when the cerebellum is interpreted as a Smith model,

it serves to cope with long delays in the control loop (Miall et al, 1993; van der Smagt & Hirzinger, 2000). It is thus

understood to incorporate a forward model of the skeletomuscular system. Alternative approaches use the

cerebellum as an inverse model (e.g., Ebadzadeh et al, 2005), which however leads to increased complexity and

control loop stability problems.

64.3.2 Models of Cerebellar Control The cerebellum can be divided into two parts: the cortex and the deep nuclei. There are two systems of fibers

bringing input to the both the cortex and nuclei: the mossy fibers and the climbing fibers. The only output from the

cerebellar nucleus comes from cells called Purkinje cells, and they project only to the cerebellar nuclei, where their

effect is inhibitory. This inhibition sculpts the output of the nuclei which (the effect varies from nucleus to nucleus)

may act by modulating activity in the spinal cord, the mid-brain or the cerebral cortex. We now turn to models

which make explicit use of the cellular structure of the cerebellar cortex (Eccles et al., 1967; Ito, 1984; see Fig. 5a).

The human cerebellum has 7-14 million Purkinje cells (PCs), each receiving about 200,000 synapses. Mossy Fibers

(MFs) arise from the spinal cord and brainstem. They synapse onto granule cells and deep cerebellar nuclei. Granule

cells have axons which each project up to form a T with the bars of the Ts forming the Parallel Fibers (PFs). Each

PF synapses on about 200 PCs. The PCs, which are grouped into microzones, inhibit the deep nuclei. PCs with their

target cells in cerebellar nuclei are grouped together in microcomplexes (Ito, 1984). Microcomplexes are defined by

a variety of criteria to serve as the units of analysis of cerebellar influence on specific types of motor activity. The

Climbing Fibers (CF) arise from the inferior olive. Each PC receives synapses from only one CF, but a CF makes

about 300 excitatory synapses on each PC which it contacts. This powerful input alone is enough to fire the PC,

though most PC firing depends on subtle patterns of PF activity. The cerebellar cortex also contains a variety of

inhibitory interneurons. The Basket Cell is activated by PF afferents and makes inhibitory synapses onto PCs. Golgi

Cells receive input from PFs, MFs, and CFs and inhibit granule cells.

(a)

(b)

Arbib et al. for Robotics Handbook June 21, 2006 14

14

(c)

(d)

Figure 5. (a) Major cells in the cerebellum. (b) Cells in the Marr-Albus model. The granule cells are state encoders,

feeding system state and sensor data into the PC. PC/PF synapses are adjusted using the Widrow-Hoff rule. The output

of the PC are steering signals for the robotic system. (c) The APG model, using the same state encoder as b. (d) The

MPFIM model. A single module corresponds to a group of Purkinje cells: predictor, controller, and responsibility

estimator. The granule cells generate the necessary basis functions of the original information.

The Marr-Albus Model: In the Marr-Albus model (Marr, 1969; Albus, 1971) the cerebellum functions as a

classifier of sensory and motor patterns received through the MFs. Only a small fraction of the parallel fibers (PF)

are active when a Purkinje cells (PC) fires and thus influence the motor neurons. Both Marr and Albus hypothesized

that the error signals for improving PC firing in response to PF, and thus MF input, were provided by the climbing

fibers (CF), since only one CF affects a given PC. However, Marr hypothesized that CF activity would strengthen

the active PF/PC synapses using a Widrow-Hoff learning rule, whereas Albus hypothesized they would weaken

them. This is an important example of a case where computational modeling inspired important experimentation.

Eventually, Masao Ito was able to demonstrate that Albus was correct – the weakening of active synapses is now

known to involve a process called long-term depression (Ito, 1984). However, the rule with weakening of synapses

still known as the Marr-Albus model, and remains the reference model for studies of synaptic plasticity of cerebellar

cortex. However, both Marr and Albus viewed each PC as functioning as a perceptron whose job it was to control an

elemental movement, contrasting with more plausible models in which PCs serve to modulate the involvement of

microcomplexes (which include cells of the deep nuclei) in motor pattern generators (e.g., the APG model described

below).

Since the development of the Marr-Albus model several cerebellar models have been introduced in which

cerebellar plasticity plays a key role. Limiting our overview to computational models, we will describe (1) the

Cerebellar Model Articulation Controller CMAC, (2) the Adjustable Pattern Generator APG, (3) the Schweighofer-

Arbib model, and (4) the Multiple Paired Forward-Inverse Models (van der Smagt, 1998, 2000).

The Cerebellar Model Articulation Controller CMAC: One of the first well-known computational models of

the cerebellum is the Cerebellar Model Arithmetic Controller CMAC (Albus, 1975; see Fig. 5b). The algorithm was

based on Albus’ understanding of the cerebellum, but it was not proposed as a biologically plausible model. The

idea has its origins in the BOXES approach, in which for n variables an n-dimensional hypercube stores function

Arbib et al. for Robotics Handbook June 21, 2006 15

15

values in a lookup-table. BOXES su��ers from the curse of dimensionality: if each variable can be discretized into D

di��erent steps, the hypercube has to store Dn function values in memory. Albus assumed that the mossy fibers

provided discretized function values. If the signal on a mossy fiber is in the receptive field of a particular granule

cell, it fires onto a parallel fiber. This mapping of inputs onto binary output variables is often considered to be the

generalization mechanism in CMAC. The learning signals are provided by the climbing fibers.

Albus’ CMAC can be described in terms of a large set of overlapping, multi-dimensional receptive fields with

finite boundaries. Every input vector falls within the range of some local receptive fields. The response of CMAC to

a given input is determined by the average of the responses of the receptive fields excited by that input. Similarly,

the training for a given input vector a��ects only the parameters of the excited receptive fields.

The organization of the receptive fields of a typical Albus CMAC with a two-dimensional input space can be

described as follows: The set of overlapping receptive fields is divided into C subsets, commonly referred to as

“layers”. Any input vector excites one receptive field from each layer, for a total of C excited receptive fields for

any input. The overlap of the receptive fields produces input generalization, while the o��set of the adjacent layers of

receptive fields produces input quantization. The ratio of the width of each receptive field (input generalization) to

the o��set between adjacent layers of receptive fields (input quantization) must be equal to C for all dimensions of

the input space. This organization of the receptive fields guarantees that only a fixed number, C, of receptive fields

is excited by any input.

If a receptive field is excited, its response equals the magnitude of a single adjustable weight specific to that

receptive field. The CMAC output is the average of the weights of the excited receptive fields. If nearby points in

the input space excite the same receptive fields, they produce the same output value. The output only changes when

the input crosses one of the receptive field boundaries. The Albus CMAC thus produces piece-wise constant outputs.

Learning takes place as described above.

CMAC neural networks have been applied in various control situations (see Miller, 1994, for a review), starting

from adaptation of PID control parameters for an industrial robot arm and hand-eye-systems up to biped walking

(Sabourin & Bruneau, 2005).

The Adjustable Pattern Generator APG: The APG model (Houk et al., 1996) got its name because the model

can generate a burst command with adjustable intensity and duration. The APG is based on the same understanding

of the mossy fiber-granule cell-parallel fiber structure as CMAC, using the same state encoder, but has the crucial

difference (Fig. 2c) that the role of the nuclei is crucial. In the APG model, each nucleus cell is connected to a motor

cell in a feedback circuit. Activity in the loop is then modulated by Purkinje cell inhibition, a modeling idea

introduced by Arbib et al. (1974).

The learning algorithm determines which of the PF-PC synapses will be updated in order to improve movement

generation performance. This is the traditional credit assignment problem: which synapse (the structural credit

assignment) must be updated based on a response issued when (temporal credit assignment). While the former is

solved by the CFs, which are considered binary signals, for the latter eligibility traces on the synapses are introduced

serving as memory for recent activity to determine which synapses are eligible for updates. The motivation for the

eligibility signal is this: Each firing of a PC cell will take some time to affect the animal’s movement, and an even

Arbib et al. for Robotics Handbook June 21, 2006 16

16

further delay will occur before the CF can signal an error in the movement in which the PC is involved. Thus the

error signal should not affect those PF-PC synapses which are currently active, but should instead act upon those

synapses which affected the activity whose error is now being registered – and this the eligibility is timed to signal.

The APG has been applied in a few control situations, e.g., a single muscle-mass system and a simulated two-link

robot arm. Unfortunately these applications do not allow us judge the performance of the APG scheme itself due to

the fact that the control task itself was hidden within spinal cord and muscle models.

The Schweighofer-Arbib Model: The Schweighofer-Arbib model was introduced in Schweighofer (1995). It

does not use the CMAC state encoder but tries to copy the anatomy of the cerebellum. All cells, fibers and axons in

Fig. 2a are included. Several assumptions are made: (1) there are two types of mossy fibers, one type reflecting the

desired state of the controlled plant and another which carries information on the current state. A mossy fiber

diverges into approximately 16 branches. (2) Granule cells have an average of four dendrites each of which receive

input from di��erent mossy fibers through a synaptic structure called the glomerulus. (3) Three Golgi cells synapse

on a granule cell through the glomerulus and the strength of their influence depends on the simulated geometric

distance between the glomerulus and the Golgi cell. (4) The climbing fiber connection on nuclear cells as well as

deep nuclei is neglected.

Learning in this model depends on directed error information given by the climbing fibers from the inferior olive

(IO). Here, long term depression is performed when the IO firing rate provides an error signal for an eligible

synapse, while compensatory but slower increases in synaptic strength can occur when no error signal is present.

Schweighofer applied the model to explain several acknowledged cerebellar system functions: (1) saccadic eye

movements, (2) two-link limb movement control (Schweighofer et al., 1998a,b), and (3) prism adaptation (Arbib et

al., 1995). Furthermore, control of a simulated human arm was demonstrated.

Multiple Paired Forward-Inverse Models (MPFIM): Building on a long history of cerebellar modeling by

Kawato, Wolpert and Kawato (1998) proposed a novel functional model of the cerebellum which uses multiple

coupled predictors and controllers which are trained for control, each being responsible for a small state space

region. The MPFIM model is based on the indirect/direct model approach by Kawato, and is also based on the

microcomplex theory. We noted earlier that a microzone is a group of PCs, while a microcomplex combines the PCs

of a microzone with their target cells in cerebellar nuclei. In MPFIM, a microzone consists of a set of modules

controlling the same degree of freedom and is learned by only one particular climbing fiber. The modules in this

microzone compete to control this particular synergy. Inside such a module there are three types of PC which

perform the computations of a forward model, an inverse model or a responsibility predictor, but all receiving the

same input. A single internal model i is considered to be a controller which generates a motor command �i and a

predictor which predicts the current acceleration. Each predictor is a forward model of the controlled system, while

each controller contains an inverse model of the system in a region of specialization. The responsibility signal

weights the contribution which this model will make to the overall output of the microzone. Indeed, MPFIM further

assumes that each microzone contains n internal models of situations occurring in the control task. Model i generates

motor command �i, and estimates its own responsibility ri. The feedforward motor command �ff consists only of the

output of the single models adjusted by the sum of responsibility signals: ��= iii rr /ff ττ .

Arbib et al. for Robotics Handbook June 21, 2006 17

17

The PCs are considered to be roughly linear. The MF inputs carry all necessary information including state

information, e��erence copies of the last motor commands as well as desired states. Granule cells, and eventually the

inhibitory interneurons as well, nonlinearly transform the state information to provide a rich set of basis functions

through the PFs. A climbing fiber carries a scalar error signal while each Purkinje cell encodes a scalar output –

responsibilities, predictions and controller outputs are all one-dimensional values. MPFIM has been introduced with

di��erent learning methods: its first implementations were done using gradient descent methods; subsequently, EM

batch-learning and hidden Markov chain EM learning have been applied.

Comparison of the Models: Summing up, we can categorize the cerebellar models CMAC, APG,

Schweighofer-Arbib, and MPFIM as follows:

(a) State-encoder driven models: This kind of model assumes that the granule cells are on-o�� types of entities

which split up the state-space. This kind of model is best suited for, e.g., simple function approximation, and su��ers

strongly from the curse of dimensionality.

(b) Cellular-level models: Obviously, the most realistic simulations would be at the cellular level. Unfortunately,

modeling only a few Purkinje cells at realistic conditions is an immense computational challenge, and other relevant

neurons are even less well understood. Still, from the biological point of view this kind of model is the most

important since it allows obtaining insight into cerebellar function on cellular level. The first steps into this direction

were taken by the Schweighofer-Arbib model.

(c) Functional models: From the computer science point of view, the most interesting models are based on

functional understanding of the cells. In this case, we obtain only a basic insight of the functions of the parts and

apply it as a crude approximation. This kind of approach is very promising and MPFIM, with its emphasis on the

use of responsibility signals to combine models appropriately, provides an interesting example of this approach..

64.3.3 Cerebellar Models and Robotics From the previous discussions, it is clear that a popular view is that the function of the cerebellum within the

motor control loop is to represent a forward model of the skeletomuscular system. As such it predicts the movements

of the body, or rather the perceptually coded (e.g. through muscle spindles, skin-based positional information, and

visual feedback) representation of the movements of the body. With this prediction a fast control loop between

motor cortex and cerebellum can be realized, and motor programs are played before being sent to the spinal cord (cf.

Fig. 4). Proprioceptive feedback is used for adaptation of the motor programs as well as for updating the forward

model stored in the cerebellum. However, the Schweighofer-Arbib model is based on the view that the cerebellum

offers not so much a total forward model of the skeletomuscular system so much as a forward model of the

difference between the crude model of the skeletomuscular system available to the motor planning circuits of the

cerebral cortex, and the more intricately parameterized forward model of the skeletomuscular system needed to

support fast, graceful movements with minimal use of feedback. This hypothesis is reinforced by the fact that

cerebellar lesions do not prohibit motion but substantially reduces its quality, since the forward model of the

skeletomuscular system is of lesser quality.

As robotic systems move towards their biological counterparts, the control approaches can or must do the same.

There are many lines of research investigating the former part; cf. Chapter 13 “Flexible Arms” and Chapter 60

Arbib et al. for Robotics Handbook June 21, 2006 18

18

“Biologically-Inspired Robots”. It be noted that the drive principle that is used to move the joints does not

necessarily have a major impact on the outer control loop. Whether McKibben muscles, which are intrinsically

flexible but bulky (van der Smagt et al, 1996), low-dynamics polymer linear actuators, or dc motors with spindles

and added elastic components are used does not affect the control approach at the cerebellar level, but rather at the

motor control level (cf. spinal cord level). Of key importance, however, are the resulting dynamical properties of the

system, which are of course influenced by its actuators.

Passive flexibility at the joints, which is a key feature of muscle systems, is essential for reasons of safety,

stability during contact with the environment, and storage of kinetic energy. As mentioned before, however,

biological systems are immensely complex, requiring large groups of muscles for comparatively simple movements.

A reason for this complexity is the resulting nearly linear behavior, which has been noted for, e.g., muscle activation

with respect to joint stiffness (Osu & Gomi, 1999). By this regularization of the complexity of the skeletomuscular

system, the complexity of the forward model stored in the cerebellum is correspondingly reduced. The whole picture

therefore seems to be that the cerebellum, controlling a piecewise linear skeletomuscular system, incorporates a

forward model thereof to cope with delays in the peripheral nervous system. Consequently, although the

applicability of cerebellar systems to highly nonlinear dynamics control of traditional robots is questionable, the use

of cerebellar systems as forward models appears to be useful in the control of more complex and flexible robotic

systems. The control challenge posed by the currently emerging generation of robots employing antagonistic motor

control therefore opens a new wealth of applications of cerebellar systems.

64.4. The Role of Mirror Systems Mirror neurons were first discovered in the brain of the macaque monkey – neurons that fire both when the

monkey exhibits a particular grasping action, and when the monkey observes another (monkey or human) perform a

similar grasp (Rizzolatti et al. 1995; Gallese et al., 1996). Since then, human studies have revealed a mirror system

for grasping in the human brain – a mirror system for a class X of actions being a set of regions that are active both

when the human performs some action from X and when the human observes someone else performing an action

from X (see, e.g., Grafton et al., 1996; Rizzolatti et al., 1996; Fadiga, et al., 1999). We have no single neuron studies

proving the reasonable hypothesis that the human mirror system for grasping contains mirror neurons for specific

actions, but most models of the human mirror system for grasping assume that it contains circuitry analogous to the

mirror neuron circuitry of the macaque brain. However, the current consensus is that monkeys have little or no

ability for imitation, whereas the human mirror system plays a key role in our capability for imitation (Iacoboni, et

al., 1999), pantomime, and even language (Arbib & Rizzolatti, 1997; Arbib, 2005). Indeed, it has been suggested

that mirror neurons underlie the motor theory of speech perception of Alvin Liberman et al. (1967) which holds that

speech perception rests on the ability to recognize the motor acts that produce speech sounds.

Section 64.4.1 reviews basic neurophysiological data on mirror neurons in the macaque, and presents both the

FARS model of canonical neurons (unlike mirror neurons, these are active when the monkey executes an action but

not when he observes it) and the MNS model of mirror neurons and their supporting brain regions. Section 64.4.2

then use Bayes rule to offer a new, probabilistic view of the mirror system’s role in action recognition, and

demonstrates the operation of the new model in the context of studies with two robots. Finally, Section 64.4.3

Arbib et al. for Robotics Handbook June 21, 2006 19

19

briefly shifts the emphasis of our study of mirror neurons to imitation, which in fact is the area that has most

captured the imagination of roboticists. 64.4.1 Mirror Neurons and the Recognition of Hand Movements

Area F5 in premotor cortex of the macaque contains, among others, neurons which fire when the monkey

executes a specific manual action: e.g., one neuron might fire when the monkey performs a precision pinch, another

when it executes a power grasp. (In discussing neurorobotics, it seems unnecessary to explain in any detail the areas

like F5, AIP and STS described here – they will function as labels for components of functional systems. To fill in

the missing details see, e.g., Rizzolatti et al., 1998, 2001.) A subset of these neurons, the so-called mirror neurons,

also discharge when the monkey observes meaningful hand movements made by the experimenter which are similar

to those whose execution is associated with the firing of the neuron. By contrast, the canonical neurons are those

belonging to the complementary, anatomically segregated subset of grasp-related F5 neurons which fire when the

monkey performs a specific action and also when it sees an object as a possible target of such an action – but do not

fire when the monkey sees another monkey or human perform the action. Finally, F5 contains a large population of

motor neurons which are active when the monkey grasps an object (either with the hand or mouth) but do not

possess any visual response. F5 is clearly a motor area although the details of the muscular activation are abstracted

out – F5 neurons can be effector-independent. By contrast, primary motor cortex (F1) formulates the neural

instructions for lower motor areas and motor neurons.

Moreover, macaque mirror neurons encode transitive actions and do not fire when the monkey sees the hand

movement unless it can also see the object or, more subtly, if the object is not visible but is appropriately “located”

in working memory because it has recently been placed on a surface and has then been obscured behind a screen

behind which the experimenter is seen to be reaching (Umiltà et al., 2001). All mirror neurons show visual

generalization. They fire when the instrument of the observed action (usually a hand) is large or small, far from or

close to the monkey. They may also fire even when the action instrument has shapes as different as those of a human

or monkey hand. Some neurons respond even when the object is grasped by the mouth. When naive monkeys first

see small objects grasped with a pair of pliers, mirror neurons do not respond, but after extensive training some

precision pinch mirror neurons do show activity also to this new grasp type (Ferrari et al., 2005).

Mirror neurons for grasping have also been found in parietal areas of the macaque brain and, recently, it has been

shown that parietal mirror neurons are sensitive to the context of the observed action being predictive of the

outcome as a function of contextual cues – e.g., some grasp related parietal mirror neurons may fire for a grasp that

precedes eating the grasped object while others fire for a grasp that precedes placing the object in a container (see

Fogassi, et al., 2005). In practice the parieto-frontal circuitry seems to encode action execution and simultaneously

action recognition by taking into account a large set of potential candidate actions which are selected on the basis of

a range of cues such as vision of the relation of the effector to the object and certain sounds (when relevant for the

task). Further, feedback connections (frontal to parietal) are thought to be part of a stimulus selection process which

refines the sensory processing by “attending” to stimuli relevant for the ongoing action (Rizzolatti et al., 1987;

recall discussion in Section 64.2.5). Recognition is then supported by the activation of the same circuitry in the

absence of overt movement.

Arbib et al. for Robotics Handbook June 21, 2006 20

20

Fig. 6. The original FARS diagram (Fagg & Arbib, 1998) is here modified to show PFC acting on AIP rather than F5.

The idea is that prefrontal cortex uses the IT identification of the object, in concert with task analysis and working

memory, to help AIP select the appropriate affordance from its “menu”.

We clarify these ideas by briefly presenting the FARS model of the canonical F5 neurons and the MNS model of

the F5 mirror neurons. In each case, the F5 neurons function effectively only because of the interaction of F5 with a

wide range of other regions. We have stressed (Section 64.2.3) the distinction between recognition of the category of

an object and recognition of its affordances. The parietal area AIP processes visual information to extract

affordances, in this case properties of the object relevant to grasping it (Taira et al., 1990). AIP and F5 are

reciprocally connected with AIP being more visual and F5 more motoric.

The FARS (Fagg-Arbib-Rizzolatti-Sakata) model (Fagg and Arbib 1998; Fig. 6) embeds F5 canonical neurons in

a larger system. The dorsal stream (which passes through AIP) can only analyze the object as a set of possible

affordances; whereas the ventral stream (via inferotemporal cortex, IT) is able to recognize what the object is. The

latter information is passed to prefrontal cortex (PFC) which can then, on the basis of the current goals of the

organism, bias the choice of affordances appropriate to the task at hand. Neuroanatomical data (as analyzed by

Rizzolatti and Luppino, 2003) suggest that PFC and IT may modulate action selection at the level of parietal cortex.

Fig. 6 gives a partial view of FARS updated to show this modified pathway. The affordance selected by AIP

activates F5 neurons to command the appropriate grip once it receive a “go signal” from another region, F6, of

prefrontal cortex. F5 also accepts signals from other PFC areas to respond to working memory and instruction

stimuli in choosing among the available affordances. Note that this same pathway could be implicated in tool use,

bringing in semantic knowledge as well as perceptual attributes to guide the dorsal system (Johnson-Frey, 2004).

With this, we turn to the mirror system. Since grasping a complex object require careful attention to motion of,

e.g., fingertips relative to the object we hold that the primary evolutionary impetus for the mirror system was to

facilitate feedback control of dexterous movement. We now show how parameters relevant to such feedback could

be crucial in enabling the monkey to associate the visual appearance of what it is doing with the task at hand. The

key “side effect” will be that this feedback-serving self-recognition is so structured as to also support recognition of

Arbib et al. for Robotics Handbook June 21, 2006 21

21

the action when performed by others – and it is this “recognition of the actions of others” that has created the

greatest interest in mirror neurons and systems.

Figure 7. The MNS (Mirror Neuron System) model (Oztop and Arbib, 2002). Note that this basic mirror system for

grasping crucially links the visual process of the superior temporal sulcus (STS) to the parietal regions (7b) and

premotor regions (F5) which have been shown to contain mirror neurons for manual actions.

The MNS model of Oztop & Arbib (2002) provides some insight into the anatomy while focusing on the learning

capacities of mirror neurons. Here, the task is to determine whether the shape of the hand and its trajectory are "on

track" to grasp an observed affordance of an object using a known action. The model is organized around the idea

that the AIP → F5canonical pathway emphasized in the FARS model (Fig. 6) is complemented by another pathway

7b → F5mirror. As shown in Fig. 7 (middle diagonal), object features are processed by AIP to extract grasp

affordances, these are sent on to the canonical neurons of F5 that choose a particular grasp. Top diagonal.

Recognizing the location of the object (top diagonal) provides parameters to the motor programming area F4 which

computes the reach. The information about the reach and the grasp is taken by the motor cortex M1 (=F1) to control

the hand and the arm. The rest of the figure provides components that can learn and apply key criteria for activating

a mirror neuron, recognizing that the preshape of the observed hand corresponds to the grasp that the mirror neuron

encodes and is appropriate to the object, and that the hand is moving on an appropriate trajectory. Making crucial

use of input from the superior temporal sulcus (STSa; Perrett et al., 1990; Carey et al., 1997), schemas at bottom left

recognize the shape of the observed hand, and how that hand is moving. Other schemas implement hand-object

spatial relation analysis and check how object affordances relate to hand state. Together with F5 canonical neurons,

this last schema (in parietal area 7b) provides the input to the F5 mirror neurons.

In the MNS model, the hand state was defined as a vector whose components represented the movement of the

wrist relative to the location of the object and of the hand shape relative to the affordances of the object. Oztop and

Arbib showed that an artificial neural network corresponding to PF and F5mirror could be trained to recognize the

grasp type from the hand state trajectory, with correct classification often being achieved well before the hand

Arbib et al. for Robotics Handbook June 21, 2006 22

22

reached the object, using activity in the F5 canonical neurons that commands a grasp as training signal for

recognizing it visually. Crucially, this training prepares the F5 mirror neurons to respond to hand-object relational

trajectories even when the hand is of the "other" rather than the "self” because the hand state is based on the view of

movement of a hand relative to the object, and thus only indirectly on the retinal input of seeing hand and object

which can differ greatly between observation of self and other. Bonaiuto et al (2006) have developed MNS2, a new

version of the MNS model to address data on audio-visual mirror neurons that respond to the sight and sound of

actions with characteristic sounds such as paper tearing and nut cracking (Kohler et al., 2002), and on response of

mirror neurons when the target object was recently visible but is currently hidden (Umiltà et al., 2001). Such

learning models, and the data they address, make clear that mirror neurons are not restricted to recognition of an

innate set of actions but can be recruited to recognize and encode an expanding repertoire of novel actions.

The discussion of this section avoided any reference to imitation (see Section 64.4.5. On the other hand, even

without considering imitation, mirror neurons provide a new perspective for tackling the problem of robotic

perception by incorporating action (and motor information) into a plausible recognition process. In role of the

fronto-parietal system in relating affordances, plans and actions shows the crucial role motor information and

embodiment. We argue that this holds lessons for neurorobotics: the richness of the “motor” system should strongly

influences what the robot can learn, proceeding autonomously via a process of exploration of the environment rather

than relying overmuch on the intermediary of logic-like formalisms. When recognition exploits the ability to act,

then the breadth of the action space becomes crucially related to the precision, quality, and robustness of the robot’s

perception.

64.4.2 A Bayesian View of the Mirror System We now show how to cast much that is known about the mirror system into a controller-predictor model (Miall et

al, 1993; Wolpert et al., 2001) and analyze the system in Bayesian terms. As shown by the FARS model, the

decision to initiate a particular grasping action is attained by the convergence in area F5 of several factors including

contextual and object-related information; similarly many factors affect the recognition of an action. All this

depends on learning both direct (from decision to executed action) and inverse models (from observation of an

action to activation of a motor command that could yield it). Similar procedures are well known in the

computational motor control literature (Jordan & Rumelhart, 1992; Kawato et al., 1987). Learning of the affordances

of objects with respect to grasping can also be achieved autonomously by learning from the consequences of

applying many different actions to different parts of different objects.

But how is the decision made to classify an observed behavior as an instance of one action or another? Many

comparisons could be performed in parallel with the model for one action becoming predominantly activated. There

are plausible implementations of this mechanism using a gating network (Demiris & Johnson, 2003; Haruno et al.,

2001). A gating network learns to partition an input space into regions; for each region a different model can be

applied or a set of models can be combined through an appropriate weight function. The design of the gating

network can encourage collaboration between models (e.g. linear combination of models) or competition (choosing

only one model rather than a combination). Oztop et al. (2005) offer a similar approach to the estimation of the

mental states of the observed actor, using some additional circuitry involving the frontal cortex.

Arbib et al. for Robotics Handbook June 21, 2006 23

23

We now offer a Bayesian approach to using the predictor-controller formulation approach to the mirror system.

This Bayesian approach views affordances as priors in the action recognition process where the evidence is

conveyed by the visual information of the hand, providing the data for finding the posterior probabilities as mirror

neurons-like responses which automatically activate for the most probable observed action. Recalling that the

presence of a goal (at least in working memory) is needed to elicit mirror neuron responses in the macaque. We

believe it is also particularly important during the ontogenesis of the human mirror system. For example, Woodward

(1998) has shown that even at nine months of age, infants recognized an action as being novel if it was directed

toward a novel object rather than just having a different kinematics – showing that the goal is more fundamental

than the enacted trajectory. Similarly, if one sees someone drinking from a coffee mug then one can hypothesize that

a particular action (that one already knows in motor terms) is used to obtain that particular effect. The association

between the canonical response (object-action) and the mirror one (including vision) is made when the observed

consequences (or goal) are recognized as similar in the two cases. Similarity can be evaluated following criteria

ranging from kinematic to social consequences.

Many formulations of recognition tasks are available in the literature (Duda, Hart, & Stork, 2000) besides those

keyed to the study of mirror neurons. Here, however, we focus on the Metta et al. (2006b) Bayesian interpretation of

the recognition of actions. We equate the prior probabilities for actions with the object affordances, that is:

( | )i kp A O (1)

where Ai is the ith action from a motor repertoire of I actions, and Ok is the target object of the grasping action out of

a set of K possible objects. The affordances of an object identify the set of actions that are most likely to be executed

upon it, and consequently the mirror activation of F5 can be thought as:

( | , )i kp A F O (2)

where F are the features obtained by observation of the trajectory of the temporal evolution of the observed action.

This probability can be computed from Bayes rule as:

( | , ) ( | , ) ( | )i k i k i kp A F O p F A O p A O= (3)

(An irrelevant normalization factor has been neglected so that, strictly speaking, the posterior in (3) is no longer a

probability.) With this, a classifier is constructed by taking the maximum over the possible actions:

ˆ max ( | , )i ki

A p A F O= (5)

Following Lopes & Santos-Victor (2005), Metta et al. assume that the features F along the trajectories are

independent. This is clearly not true for a smooth trajectory linking the movement of the hand and fingers to the

observed action. However, this approximation simplified the estimation of the likelihoods in equation, though later

implementations should take into account the dependence across time.

The object recognition stage (i.e., finding Ok ) requires as much vision as is needed to determine the probability

of the various grasp types being effective, the hand features F correspond to the STS response, and the response of

the mirror neurons determines the most probable observed action Ai. We can identify certain circuits in the brain

with the quantities described above:

( | , )i kp A F O Mirror neuron responses, obtained by a combination of the information as in (3)

Arbib et al. for Robotics Handbook June 21, 2006 24

24

( | , )i kp F A O The activity of the F5 motor neurons generating certain motor patterns given the selected action

and the target object

( | )i kp A O Object affordances: the response of the circuit linking AIP and the F5 canonical neurons

AIP�F5

Visuomotor map Transformation of the hand-related visual information into motor data: identified with the

response of STS � PF/PFG � F5, this is represented in F5 by ( | , )i kp F A O

However this table does not exhaust the various computations required in the model: In the learning phase, the

visuo-motor map is learned by a sigmoidal feedforward neural network trained with the backpropagation algorithm

(both visual and motor signals are available); affordances are learned simply by counting the number of occurrences

of actions given the object (visual processing of object features was assumed); and the likelihood was approximated

by a mixture of Gaussians and the parameters learned by the expectation maximization (EM) algorithm. During the

recognition phase, motor information is not available, but is recovered from the visuo-motor map. Fig. 8 shows a

block diagram of the operation of the classifier.

Figure 8:block diagram of the recognition process. Recognition (mirror neurons activation) is due to the

convergence in F5 of two main contributions: signals from the AIP-F5canonical connections and signals from the

STS-PF/PFG-F5 circuit. In this model, activations are thought of as probabilities and combined using Bayes’ rule.

A comparison (Lopes & Santos-Victor, 2005) was made of system performance (a) when using the output of the

inverse model and thus employing motor features to aid classification during the training phase; and (b) when only

visual data were available for classification. Overall, their interpretation of their results is that by mapping in motor

space they allow the classifier to choose features that are much better suited for performing optimally, which in turn

facilitates generalization. The same is not true in visual space, since a given action may be viewed from different

viewpoints. One may compare this to the viewpoint-invariant of hand state adopted in the MNS model – which has

the weakness there of being built in rather than emerging from training.

Another set of experiments was performed on a humanoid robot upper-torso called Cog (Brooks et al, 1999). Cog

has a head, arms, and a moving waist for a total of 22 degrees of freedom but does not have hands. It has instead

simple flippers that could be used to push and prod objects. Fitzpatrick & Metta (2003; Metta, et al., 2006b) were

Arbib et al. for Robotics Handbook June 21, 2006 25

25

interested in starting from minimal initial hypotheses yet yielding units with responses similar to mirror neurons.

The robot was programmed to identify suitable cues to start the interaction with the environment and direct its

attention towards potential objects, using an attention system similar to that (Itti & Koch, 2000) described in section

64.2.5. Although the robot could not grasp objects, it could generate useful cues from touch, push and prod actions.

The problem of defining objects reflects the problem of segmenting the visual scene. Criteria such as contrast,

binocular disparity and motion can be applied, but any of them can fail in a given situation. In the present system,

optic flow was used to detect and segment an object after impact with the arm end-point, and from segmentation

based on shape, color, and behavior data. Depending on the available motor repertoire, the robot could explore a

range of possible object behaviors (affordances) and form an object model which combines both sensorial and motor

properties of the object the robot had the chance to interact with.

In a typical experiment, the human operator waves an object in front of the robot which reacts by looking at it. If

the object is dropped on the table in front of the robot, a reaching action is initiated, and the robot possibly makes

contact with the object. Vision is used during the reaching and touching movement to guide the robot’s flipper

toward the object, to segment the hand from the object upon contact, and to collect information about the behavior

of the object caused by the application of a certain action (Fitzpatrick, 2003). Unfortunately, interaction of the

robot’s flipper with objects does not result in a wide class of different affordances and so this study focused on the

rolling affordances of a toy car, an orange juice bottle, a ball, and a colored toy cube.

Besides reaching, the robot’s motor repertoire consists of four different stereotyped approach movements covering a

range of directions of about 180° around the object. For each successful trial, the robot “stored” the result of the

segmentation, the object’s principal axis which was selected as representative shape parameter, the action – initially

selected randomly from the set of four approach directions – and the movement of the center of mass of the object

for some hundreds of milliseconds after impact with the flipper was detected. This gives information about the

rolling properties (affordances) of the different objects: e.g., the car tends to roll along its principal axis, the bottle at

right angle with respect to the axis. Figure 9 shows the result of collecting about 700 samples of generic poking

actions and estimating the average direction of displacement of the object. Note, for example, that the action labeled

as “backslap” (moving the object with the flipper outward from the robot) gives consistently a visual object motion

upward in the image plane (corresponding to the peak at –100°, 0° being the direction parallel to the image x axis;

the y axis pointing downward). A similar consideration applies to the other actions. Although crude, this

implementation shows that with little pre-existing structure the robot could acquire the crucial elements for building

knowledge of objects in terms of their affordances. Given a sufficient level of abstraction, this implementation is

close to the response of canonical neurons in F5 and their interaction with neurons observed in AIP that respond to

object orientation (Sakata et al., 1997).

Arbib et al. for Robotics Handbook June 21, 2006 26

26

Figure 9. Top panels: the affordances of two objects: a toy car and a bottle represented as the probability of moving

along a given direction measured with respect to the object principal axis (visual). It is evident that the toy car tends to

move mostly along the direction identified with the principal axis and the orange juice bottle at right angle with respect

to its principal axis. Bottom: the probability of a given action generating a certain object movement. The reference

frame is anchored to the image plane in this case.

64.4.3 Mirror Neurons and Imitation Fitzpatrick & Metta (2003) also addressed the question of what is further required for interpreting observed

actions. Where in the previous section, the robot identified the motion of the object because of a specific action

applied to it, here it could backtrack and derive the type of action from the observed motion of the object. It can

further explore what is causing motion and learn about the concept of manipulator in a more general setting. In fact,

the same segmentation procedure mentioned earlier could visually interpret poking actions generated by a human as

well as those generated by the robot. One might argue that observation could be exploited for learning about object

affordances. This is possibly true to the extent passive vision is reliable and action is not required. The advantage of

the active approach, at least for the robot, is that it allows controlling the amount of information impinging on the

visual sensors by, for instance, controlling the speed and type of action. This strategy might be especially useful

given the limitations of artificial perceptual systems. Thus, observations can be converted into interpreted actions.

The action whose effects are closest to the observed consequences on the object (which we might translate into the

goal of the action) is selected as the most plausible interpretation given the observation. Most importantly, the

interpretation reduces to the interpretation of the “simple” kinematics of the goal and consequences of the action

rather than to understanding the “complex” kinematics of the human manipulator. The robot understands only to the

extent it has learned to act. One might note that a more refined model should probably include visual cues from the

Arbib et al. for Robotics Handbook June 21, 2006 27

27

appearance of the manipulator into the interpretation process. Indeed, the hand state that was central to the Oztop-

Arbib model was based on an object-centered view of the hand’s trajectory in a coordinate frame based on the

object’s affordances. The last question to address is whether the robot can imitate the “goal” of a poking action. The

step is indeed small since most of the work is actually in interpreting observations. Imitation was generated in the

following by replicating the latest observed human movement with respect to the object and irrespective of its

orientation. For example, in case the experimenter poked the toy car sideways, the robot imitated him/her by

pushing the car sideways. Starting from this simple experiment, we need to formalize what is required for a system

that has to acquire and deliver imitation.

Roboticists have been fascinated by the discovery of mirror neurons and the purported link to imitation that

exists in the human nervous system. The literature on the topic extends from models of the monkey’s (non-imitative)

action recognition system (Oztop & Arbib, 2002) to models of the putative role of the mirror system in imitation

(Demiris & Johnson, 2003; Arbib et al., 2000), and in real and virtual robots (Schaal, Ijspeert, & Billard, 2003).

Oztop, Kawato, Arbib (2006) propose a taxonomy of the models of the mirror system for recognition and imitation,

and it is interesting to note how different are the computational approaches that have been now framed as mirror

system models, including recurrent neural networks with parametric bias (Tani et al., 2004), behavior-based modular

networks (Demiris & Johnson, 2003), associative memory based methods (Kuniyoshi et al., 2003), and the use of

multiple direct-inverse models as in the MOSAIC architecture (Wolpert et al., 2003; cf. the Multiple Paired

Forward-Inverse Models of Section 64.3.2).

Following the work of Schaal et al. (2003) and Oztop et al. (2006), we can propose a set of schemas required to

produce imitation:

• determine what to imitate, inferring the goal of the demonstrator

• establish a metric for imitation (correspondence; Nehaniv, 2006)

• map between dissimilar bodies (mapping)

• imitate behavior formation

which are also discussed also in greater detail by Nehaniv & Dautenhahn (1998). In practice, computational and

robotic implementations have tackled these problems with different approaches and emphasizing different parts or

specific subproblems of the whole. For example, in the work of Demiris and Hayes (2002), the rehearsal of the

various actions (akin to the aforementioned theory of motor perception) was used to generate hypotheses to be

compared with the actual sensory input. It is then remarkable how more recently a modified approach of this

paradigm has been used in comparison with real human transcranial magnetic stimulation (TMS) data.

Ito et al. (2006) (not the Masao Ito of cerebellar fame) take a dynamical systems approach using a recurrent

neural network with parametric bias (RNNPB) and teach a humanoid robot to manipulate certain objects. In this

approach the parametric bias (PB) encodes (tags) certain sensorimotor trajectories. Once learning is complete the

neural network can be used either to recall a given trajectory by setting the PB externally or provide input for the

sensory data only and observe the PB vector that would represent in that case the recognition of the situation on the

basis of the sensory input only (not motor information available). It is relatively easy to interpret these two situations

as the motor generation and the observation in a mirror neurons model.

Arbib et al. for Robotics Handbook June 21, 2006 28

28

The problem of building useful mappings between dissimilar bodies (consider a human imitating a bird’s

flapping wings) was tackled by Nehaniv and Dautenhahn (1998) where an algebraic framework for imitation is

described and the correspondence problem formally addressed. Any system implementing imitation should clearly

provide a mapping between either dissimilar bodies or even in the case of similar bodies when either the kinematics

or dynamics is different depending on the context of the imitative action.

Sauser & Billard (2006) modeled the ideomotor principle, according to which, observing the behavior of others

influences our own performances. The ideomotor principle points directly to one of the core issues of the mirror

system, that is, the fact that watching somebody else’s actions changes something in the activation of the observer,

thus facilitating certain neural pathways. The work in question gives also a model implemented in terms of neural

fields (see Sauser & Billard, 2006 for details) and tries to explain the imitative cortical pathways and the behavior

formation.

64.5. Extroduction As the foregoing makes clear, robotics has much to learn from neuroscience and much to teach neuroscience.

Neurorobotics can learn from the ways in which the brains and bodies of different creatures adapt to diverse

ecological niches – as computational neuroethology helps us undertand how the brain of a creature has evolved to

serve “action-oriented perception”, and the attendant ptrocesses of learning, memory, planning and social

interaction.

We have sampled the “design” of just a few subsystems (both functional and structural) in just a few animals –

optic flow in the bee, approach, escape and barrier avoidance in frog & toad, and navigation in the rat, as well as the

control of eye movements in visual attention, the role of mammalian cerebellum in handling the nonlinearities and

time delays of flexible motor systems, and the mirror systems of primates in action recognition and of humans in

imitation. And there are many more creatures with lessons to offer the roboticist than we can sample here.

Moreover, if we just confine attention to the brains of humans, this Chapter has mention at least 7a, 7b, AIP,

Area 46, caudo-putamen, cerebellum, cIPS, F2, F4, F5, hippocampus, hypothalamus, inferotemporal cortex, LIP,

MIP, motor cortex, nucleus accumbens, parietal cortex, prefrontal cortex, premotor cortex, pre-SMA (F6), spinal

cord, STS, and VIP – and it is clear that there are many more details to be understood for each region, and many

more regions whose interactions hold lessons for roboticists. We say this not to depress the reader, but rather to

encourage further exploration of the literature of computational neuroscience and to note that the exchange with

neurorobotics proceeds both ways: neuroscience can inspire novel robotic designs; conversely, robots can be used to

test whether brain models still work when they make the transition from disembodied computer simulation to

meeting the challenge of guiding the interactions of a physically embodied system with the complexities of its

environment.

64.6. References Albus, J. S., 1971, A theory of cerebellar function, Mathematical Biosciences 10: 25-61.

Albus, J. S., 1975, Data Storage in the Cerebellar Model Articulation Controller (CMAC), Journal of Dynamic

Systems, Measurement and Control, ASME, pp. 228-233.

Arbib et al. for Robotics Handbook June 21, 2006 29

29

Arbib, M. A., 1981, Perceptual structures and distributed motor control, in Handbook of Physiology – The Nervous

System II. Motor Control (V. B. Brooks, Ed.), Bethesda, MD: American Physiological Society, pp.1449-1480.

Arbib, M. A. (1987). Levels of modeling of visually guided behavior. Behavioral and Brain Science, 10:407-465.

Arbib, M. A., 1989, Visuomotor Coordination: Neural Models and Perceptual Robotics, in Visuomotor

Coordination: Amphibians, Comparisons, Models, and Robots, (J. -P. Ewert and M. A. Arbib, Eds.), Plenum

Press, pp. 121-171.

Arbib, M.A. (2005). From Monkey-like Action Recognition to Human Language: An Evolutionary Framework for

Neurolinguistics (with commentaries and author’s response), Behavioral and Brain Sciences, 28, 105-167.

Arbib, M. A., & House, D. H. (1987). Depth and detours: An essay on visually guided behavior. in M.A. Arbib &

A.R. Hanson (Eds.), Vision, Brain and Cooperative Computation, Bradford Books/ The MIT Press, pp.129-163.

Arbib, M., and Rizzolatti, G., 1997, Neural expectations: a possible evolutionary path from manual skills to

language, Communication and Cognition, 29, 393-424.

Arbib, M. A., Billard, A., Iacoboni, M., & Oztop, E. (2000). Synthetic brain imaging: grasping, mirror neurons and

imitation. Neural Networks, 13(8-9), 975-997.

Arbib, M. A., Boylls, C. C., and Dev, P., 1974, Neural Models of Spatial Perception and the Control of Movement,

in Kybernetik und Bionik/Cybernetics R. Oldenbourg, pp. 216-231.

Arbib, M.A., Schweighofer, N., and Thach, W. T., 1995, Modeling the Cerebellum: From Adaptation to

Coordination, in Motor Control and Sensory-Motor Integration: Issues and Directions, (D. J. Glencross and J. P.

Piek, Eds. ), Amsterdam: North-Holland Elsevier Science, pp. 11-36.

Arkin, R. C. (1998). Behavior-Based Robotics, Cambridge, MA: The MIT Press.

Arkin, R. C. (1989). Motor schema-based mobile robot navigation. International Journal of Robotics Research,

8:92-112.

Arkin, R., Fujita, M., Takagi, T., and Hasegawa, R., 2003, An Ethological and Emotional Basis for Human-Robot

Interaction, Robotics and Autonomous Systems, 42(3-4).

Barron A, Srinivasan MV., 2006, Visual regulation of ground speed and headwind compensation in freely flying

honey bees (Apis mellifera L.). J Exp Biol. 209(Pt 5):978-84.

Beer, R. D., 1990, Intelligence as Adaptive Behavior: An Experiment in Computational Neuroethology, Academic

Press.

Bonaiuto, B., Rosta, E., & Arbib, M.A., 2006, Extending the Mirror Neuron System Model, I. Audible Actions and

Invisible Grasps, Biological Cybernetics (in press).

Borst, A., & Dickinson, M., 2003, Visual Course Control in Flies, in The Handbook of Brain Theory and Neural

Networks, Second Edition, (M.A. Arbib, Ed.), Cambridge, MA: Bradford Books/The MIT Press, pp.1205-1210.

Braitenberg, V., 1965, Taxis, kinesis, decussation, Progress in Brain Research, 17:210-222.

Braitenberg, V., 1984, Vehicles: Experiments in Synthetic Psychology, Bradford Books/The MIT Press.

Brooks, R. A., Breazeal, C. L., Marjanovi�, M., & Scassellati, B. (1999). The COG project: Building a Humanoid

Robot. In C. L. Nehaniv (Ed.), Computation for Metaphor, Analogy and Agents (Vol. 1562, pp. 52-87):

Springer-Verlag.

Arbib et al. for Robotics Handbook June 21, 2006 30

30

Bullock, D., & Contreras-Vidal, J. (1993). How spinal neural networks reduce discrepancies between motor

intention and motor realization. In K. Newell & D. Corcos (Eds.), Variability and motor control (pp. 183–221).

Champaign, Illinois: Human Kinetics Press.

Cliff, D., 2002, Neuroethology, Computational, in The Handbook of Brain Theory and Neural Networks, Second

Edition, (M.A. Arbib, Ed.), Cambridge, MA: Bradford Books/The MIT Press.

Cobas, A., and Arbib, M., 1992, Prey-Catching and Predator-Avoidance in Frog and Toad: Defining the Schemas. J.

Theor. Biol, 157:271-304.

Collett, T. (1982). Do toads plan routes? A study of detour behavior of B. viridis. Journal of Comparative

Physiology [A], 146:261-271.

Corbacho, F.J., and Arbib, M.A., 1995, Learning to Detour, Adaptive Behavior, 4:419-468.

Damsgaard, M., Rasmussen, J., & Christensen, S.T. (2000): Numerical simulation and justification of antagonists in

isometric squatting. In: Prendergast, P.J., Lee, T.C., Carr, A.J.: Proceedings of the 12th Conference of the

European Society of Biomechanics, Royal Academy of Medicine in Ireland.

Dean, P., Redgrave, P. & Westby, G. W. M. 1989 Event or emergency? Two response systems in the mammalian

superior colliculus. Trends Neurosci. 12: 138–147.

Demiris, Y., Hayes, G. (2002). Imitation as a dual-route process featuring predictive and learning components: a

biologically-plausible computational model. In K. Dautenhahn and C. Nehaniv (Eds.), Imitation in Animals and

Artifacts. MIT Press.

Demiris, Y., & Johnson, M. H. (2003). Distributed, predictive perception of actions: a biologically inspired robotics

architecture for imitation and learning. Connection Science, 15(4), 231-243.

Didday, R. L., 1976, A model of visuomotor mechanisms in the frog optic tectum, Mathematical Biosciences,

30:169-180.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd. ed.). New York: John Wiley and Sons

Inc.

Ebadzadeh, M., Tondu, B., and Darlot, C., 2005, Computation of inverse functions in a model of cerebellar and

reflex pathways allow to control a mobile mechanical segment, Neuroscience 133: 29-49.

Eccles, J. C., Ito, M., and Szentágothai, J., 1967, The Cerebellum as a Neuronal Machine, Springer-Verlag.

Fadiga, L., Buccino, G., Craighero, L., Fogassi, L., Gallese, V., & Pavesi, G. (1999). Corticospinal excitability is

specifically modulated by motor imagery: a magnetic stimulation study. Neuropsychologia, 37, 147-158.

Fagg, A. H., and Arbib, M. A., 1998, Modeling Parietal-Premotor Interactions in Primate Control of Grasping,

Neural Networks, 11:1277-1303.

Ferrari, P. F., Rozzi, S., & Fogassi, L. (2005). Mirror Neurons Responding to Observation of Actions Made with

Tools in Monkey Ventral Premotor Cortex. Journal of Cognitive Neuroscience, 17: 212-226.

Fitzpatrick, P. (2003). First Contact: an active vision approach to segmentation. Paper presented at the IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS), October 27-31, Las Vegas, Nevada, USA.

Fitzpatrick, P., & Metta, G. (2003). Grounding vision through experimental manipulation. Philosophical

Transactions of the Royal Society: Mathematical, Physical, and Engineering Sciences, 361(1811), 2165-2185.

Arbib et al. for Robotics Handbook June 21, 2006 31

31

Fogassi, L., Ferrari, P. F., Gesierich, B., Rozzi, S., Chersi, F., & Rizzolatti, G. (2005). Parietal Lobe: From Action

Organization to Intention Understanding. Science, 308(4), 662-667.

Fregnac, Y., 2003, Hebbian Synaptic Plasticity, in The Handbook of Brain Theory and Neural Networks, Second

Edition, (M.A. Arbib, Ed.), Cambridge, MA: Bradford Books/The MIT Press, pp.5125-522.

Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119,

593-609.

Gibson, J. J. 1966 The senses considered as perceptual systems. London: Allen & Unwin.

Girard B, Filliat D, Meyer JA, et al., 2005, Integration of navigation and action selection functionalities in a

computational model of cortico-basal-ganglia-thalamo-cortical loops, Adaptive Behavior 13(2):115-130.

Goodale, M. A., and Milner, A. D., 1992, Separate visual pathways for perception and action, Trends in

Neuroscience, 15:20-25.

Guazzelli, A., Corbacho, F. J., Bota, M., and Arbib, M. A., 1998, Affordances, Motivation, and the World Graph

Theory, Adaptive Behavior, 6:435-471.

Grafton, S. T., Arbib, M. A., Fadiga, L., and Rizzolatti, G., 1996, Localization of grasp representations in humans by

PET: 2. Observation compared with imagination. Exp Brain Res. 112:103-111.

Haruno, M., Wolpert, D. M., & Kawato, M. (2001). MOSAIC Model for sensorimotor learning and control. Neural

Computation, 13, 2201-2220.

Hoff, B., and Arbib, M.A., 1992, A model of the effects of speed, accuracy, and perturbation on visually guided

reaching, in Control of Arm Movement in Space: Neurophysiological and Computational Approaches, (R.

Caminiti, P.B. Johnson, and Y. Burnod, Eds.), Experimental Brain Research Series 22, pp.285-306.

Hoff, B. and Arbib, M.A., 1993, Simulation of Interaction of Hand Transport and Preshape During Visually Guided

Reaching to Perturbed Targets, J .Motor Behav. 25: 175-192.

Houk, J. C., Buckingham, J. T., & Barto, A. G., 1996, Models of the cerebellum and motor learning, Behavioral and

Brain Sciences 19(3): 368-383.

Iacoboni, M., Woods, R. P., Brass, M., Bekkering, H., Mazziotta, J. C., & Rizzolatti, G. (1999). Cortical

Mechanisms of Human Imitation. Science, 286, 2526-2528.

Ingle, D., 1968, Visual releasers of prey catching behavior in frogs and toads Brain, Behav., Evol. 1: 500-518.

Ito, M., 1984, The cerebellum and neural control, New York: Raven Press.

Ito, M., 2006, Cerebellar circuitry as a neuronal machine, Progress in Neurobiology 78 (5): 272-303

Ito, M., Noda, K., Hoshino, Y., & Tani, J. (2006). Dynamic and interactive generation of object handling behaviors

by a small humanoid robot using a dynamic neural network model. Neural Networks, 19(2), 323-337.

Itti, L., and Koch, C., 2000 A saliency-based search mechanism for overt and covert shifts of visual attention, Vision

Research, 40:1489-1506.

Jeannerod M., and Biguer, B., 1982, Visuomotor mechanisms in reaching within extra-personal space. In: Advances

in the Analysis of Visual Behavior (D.J. Ingle, R.J.W. Mansfield and M.A. Goodale, Eds.), The MIT Press,

pp.387-409.

Johnson-Frey SH. 2004 The neural bases of complex tool use in humans. Trends in Cognitive Sciences 8:71-78.

Arbib et al. for Robotics Handbook June 21, 2006 32

32

Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive

Science, 16(3), 307-354.

Kawato, M., Furukawa, K., & Suzuki, R. (1987). A hierarchical neural network model for control and learning of

voluntary movement. Biological Cybernetics, 57, 169-185.

Khatib, O. (1986). Real-time obstacle avoidance for manipulators and mobile robots. The International Journal of

Robotics Research, 5:90-98.

Kuniyoshi, Y., Yorozu, Y., Inaba, M., & Inoue, H. (2003). From visuomotor self learning to visual imitation - a

neural architecture for humanoid learning. Paper presented at the International conference on Robotics and

Automation, Taipei, Taiwan.

Lettvin, J. Y., Maturana, H., McCulloch, W. S., and Pitts, W. H., 1959, What the frog's eye tells the frog brain,

Proceedings of the IRE, 47:1940-1951.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code.

Psychological Review, 74, 431-461.

Lieblich, I., and Arbib, M.A., 1982, Multiple Representations of Space Underlying Behavior, The Behavioral and

Brain Sciences, 5: 627-659.

Liu, SC, & Usseglio-Viretta, A., 2001, Fly-like visuomotor responses of a robot using aVLSI motion-sensitive

chips. Biol Cybern. 85(6):449-57.

Lopes, M., & Santos-Victor, J. (2005). Visual learning by imitation with motor representations. IEEE Transactions

on Systems, Man, and Cybernetics, Part B Cybernetics, 35(3):438-449.

Lyons, D.M., and Arbib, M.A., 1989, A Formal Model of Computation for Sensory-Based Robotics, IEEE Trans. on

Robotics and Automation, 5:280-293.

Marr, D. A., 1969, A theory of cerebellar cortex, Journal of Physiology 202: 437-470.

Metta, G., Fitzpatrick, P., & Natale, L. (2006a). YARP: yet another robot platform. International Journal on

Advanced Robotics Systems, 3(1):43-48.

Metta, G., Sandini, G., Natale, L., Craighero, L., & Fadiga, L. (2006a). Understanding mirror neurons: a bio-robotic

approach. Interaction Studies, special issue on Epigenetic Robotics, 7(2):197–232.

Meyer JA, Guillot A, Girard B, et al., 2005, The Psikharpax project: towards building an artificial rat, Robotics and

Autonomous Systems 50(4):211-223.

Miall, R.C., Weir, D.J., Wolpert, D.M. and Stein, J.F., 1993, Is the cerebellum a Smith predictor?, Journal of Motor

Behavior 25: 203-216

Miller, W. T., 1994, Real-time application of neural networks for sensor-based control of robots with vision. IEEE

Transactions on Systems, Man, and Cybernetics, vol. 19, pp. 825-831.

Navalpakkam, V., & Itti, L., 2005, Modeling the influence of task on attention, Vision Research 45:205–231

Nehaniv, C.L., 2006, Nine billion correspondence problems, in Imitation and Social Learning in Robots, Humans,

and Animals: Behavioural, Social and Communicative Dimensions (C.L. Nehaniv and K. Dautenhahn, Eds.),

Cambridge: Cambridge University Press.

Arbib et al. for Robotics Handbook June 21, 2006 33

33

Nehaniv, C.L., & Dautenhahn, K. (1998). Mapping between dissimilar bodies: Affordances and the algebraic

foundations of imitation. Paper presented at the European Workshop on Learning Robots (EWRL-7), Edinburgh,

UK.

Nijhof, E.-J., Kouwenhoven, E., 2002: Simulation of multijoint arm movements. In: Biomechanics and Neural

Control of Posture and Movement (Winters, J.M., Crago, P.E., Eds.). New York : Springer-Verlag, pp. 363-372.

O'Keefe, J. and Nadel, L., 1978, The Hippocampus as a Cognitive Map, Oxford: Clarendon Press.

Orabona, F., Metta, G., & Sandini, G. (2005, June 25th). Object-based Visual Attention: a Model for a Behaving

Robot. Paper presented at the 3rd International Workshop on Attention and Performance in Computational

Vision, CVPR, San Diego, CA, USA

Osu, R. and Gomi, H., 1999, Multijoint muscle regulation mechanisms examined by measured human arm stiffness

and EMG signals, Journal of Neurophysiology, 81 (4), 1458-1468.

Oztop, E., and Arbib, M.A., 2002, Schema Design and Implementation of the Grasp-Related Mirror Neuron System,

Biological Cybernetics, in press.

Oztop, E., Kawato, M., & Arbib, M. A. (2006). Mirror neurons and imitation: A computationally guided review.

Neural Networks, 19(3), 254-271.

Oztop, E., Wolpert, D. M., & Kawato, M. (2005). Mental state inference using visual control parameters. Cognitive

Brain Research, 22, 129-151.

Perrett, D.I., Hietanen, J.K., Oram, M.W. and Benson, P.J. (1992) Organization and functions of cells in the

macaque temporal cortex. Philosophical Transactions of the Royal Society of London, B, 335: 23-50.

Perrett. D.M. & Puce, A. (2003) Electrophysiology and brain imaging of biological motion. Phil. Trans. R. Soc.

Lond. B 358, 435–445.

Peters, J. and van der Smagt, P., 2002, Searching a scalable approach to cerebellar based control, Applied

Intelligence 17: 11-33.

Reichardt, W. (1961) Autocorrelation, a principle for the evaluation of sensory information by the central nervous

system. In W.A. Rosenblith (ed): Sensory Communication. New York, London: The M.I.T. Press and John Wiley

& Sons, pp. 303-317.

Reiser MB, Dickinson MH, 2003, A test bed for insect-inspired robotic control., Philos Transact A Math Phys Eng

Sci. Oct 15;361(1811):2267-85.

Rizzolatti, G., & Luppino, G. (2003). Grasping movements: visuomotor transformations. In M. A. Arbib (Ed.), The

handbook of brain theory and neural networks (2nd ed). Cambridge, MA: The MIT Press, pp. 501–504.

Rizzolatti, G., Fadiga L., Gallese, V., and Fogassi, L., 1995, Premotor cortex and the recognition of motor actions.

Cogn Brain Res., 3: 131-141.

Rizzolatti, G., Fadiga, L., Matelli, M., Bettinardi, V., Perani, D., and Fazio, F., 1996, Localization of grasp

representations in humans by positron emission tomography: 1. Observation versus execution. Exp Brain Res.,

111:246-252.

Rizzolatti, G., Fogassi L., & Gallese, V. (2001) Neurophysiological mechanisms underlying the understanding and

imitation of action. Nat. Rev. Neurosc. 2, 661-670.

Arbib et al. for Robotics Handbook June 21, 2006 34

34

Rizzolatti, G., G. Luppino, & M. Matelli (1998) The organization of the cortical motor system: new concepts.

Electroencephalography and Clinical Neurophysiology (106:283-296.

Rizzolatti, G., Riggio, L., Dascola, I., & Umiltà, C. (1987). Reorienting attention across the horizontal and vertical

meridians: evidence in favor of a premotor theory of attention. Neuropsychologia, 25, 31-40.

Ruffier, F. Viollet, S. Amic, S. Franceschini, N., 2003, Bio-inspired optical flow circuits for the visual guidance of

micro air vehicles, in: Proceedings of the 2003 International Symposium Circuits and Systems, 2003. ISCAS '03.,

Volume: 3, pp. III-846-III-849.

Sabourin, C. and Bruneau, O., 2005, Robustness of the dynamic walk of a biped robot subjected to disturbing

external forces by using CMAC neural networks, Robotics and Autonomous Systems 51: 81-99

Sakata, H., Taira, M., Kusunoki, M., Murata, A., & Tanaka, Y. (1997). The TINS lecture - The parietal association

cortex in depth perception and visual control of action. Trends in Neurosciences, 20(8), 350-358.

Sandini, G., & Tagliasco, V. (1980). An Anthropomorphic Retina-like Structure for Scene Analysis. Computer

Vision, Graphics and Image Processing, 14(3), 365-372.

Sauser, E.L., & Billard, A. (2006). Parallel and distributed neural models of the ideomotor principle: An

investigation of imitative cortical pathways. Neural Networks, 19:285-298.

Schaal, S. and Schweighofer, N., 2005, Computational motor control in humans and robots, Current Opinion in

Neurobiology 15: 675-682.

Schaal, S., Ijspeert, A. J., & Billard, A. (2003). Computational approaches to motor learning by imitation.

Philosophical Transaction of the Royal Society: Biological Sciences, 358(1431), 537–547.

Schweighofer, N., 1995, Computational models of the cerebellum in the adaptive control of movements, Ph.D.

thesis, University of Southern California.

Schweighofer, N., Arbib, M. A., and Kawato, M., 1998a, Role of the Cerebellum in Reaching Quickly and

Accurately: I. A Functional Anatomical Model of Dynamics Control, European Journal of Neuroscience, 10: 86-

94.

Schweighofer, N., Spoelstra, J., Arbib, M. A., and Kawato, M., 1998b, Role of the Cerebellum in Reaching Quickly

and Accurately: II. A Neural Model of the Intermediate Cerebellum, European Journal of Neuroscience, 10:95-

105.

Segev, I., & London, M., 2003, Dendritic Processing, in The Handbook of Brain Theory and Neural Networks,

Second Edition, (M.A. Arbib, Ed.), Cambridge, MA: Bradford Books/The MIT Press, pp.324-332.

Srinivasan MV & Zhang SW 1997, Visual control of honeybee flight. EXS, 84:95-113.

Srinivasan MV, Zhang S, Chahl JS., 2001, Landing strategies in honeybees, and possible applications to

autonomous airborne vehicles. Biol Bull. 200(2):216-21.

Stein, B.E., and Meredith, M.A., 1993, The Merging of the Senses, Cambridge, MA: The MIT Press.

Strosslin T, Krebser C, Arleo A, Gerstner W, 2002, Combining multimodal sensory input for spatial learning, in

Artificial Neural Networks - ICANN 2002, Lecture Notes In Computer Science, 2415:87-92.

Taira, M., Mine, S., Georgopoulos, A.P., Murata, A., and Sakata, H., 1990, Parietal cortex neurons of the monkey

related to the visual guidance of hand movement, Exp. Brain Res. 83:29-36.

Arbib et al. for Robotics Handbook June 21, 2006 35

35

Tani, J., Ito, M., & Sugita, Y. (2004). Self-organization of distributedly represented multiple behavior schemata in a

mirror system: Reviews of robot experiments using RNNPB. Neural Networks, 17(8-9), 1273-1289.

Umiltà MA, Kohler E, Gallese V, Fogassi L, Fadiga L, Keysers C, Rizzolatti G. (2001) I know what you are doing.

a neurophysiological study, Neuron 31(1):155-65.

Ungerleider, L. G., and Mishkin, M., 1982, Two cortical visual systems, in Analysis of Visual Behavior (D. J. Ingle,

M. A. Goodale, and R. J. W. Mansfield, Ed.), Cambridge, MA: The MIT Press, pp.549-586.

van der Smagt, P., 1997, Teaching a robot to see how it moves, Neural Network Perspectives on Cognition and

Adaptive Robotics, (Browne, A, Ed.) pp. 195-219. Institute of Physics Publishing.

van der Smagt, P., 1998, Cerebellar control of robot arms, Connection Science, 10:301-320.

van der Smagt, P., 2000, Benchmarking cerebellar control, Robotics and Autonomous Systems, 32:237-251.

van der Smagt, P. and Groen, F., 1997, Visual feedback in motion, Neural Systems for Robotics, (Omidvar, O. and

van der Smagt, P., Eds.), Academic Press, pp.37-73

van der Smagt, P., Groen, F. and Schulten, K., 1996, Analysis and control of a rubbertuator robot arm, Biological

Cybernetics 75(5): 433-440.

Walter, W.G., 1953, The Living Brain, London Duckworth (reprinted by Pelican Books, Harmondsworth, 1961).

Wolfe, J M, 1994, Guided search 2.0: a revised model of visual search, Psychonomic Bull Rev, 1:202-238.

Wolpert, D. and Kawato, M., 1998, Multiple paired forward and inverse models for motor control, Neural Networks

11: 1317-1329.

Wolpert, D. M., Doya, K., & Kawato, M. (2003). A unifying computational framework for motor control and social

interaction. Philosophical Transaction of the Royal Society: series B, Biological Sciences, 358(1431), 593-602.

Wolpert, D. M., Ghahramani, Z., & Flanagan, R. J. (2001). Perspectives and problems in motor learning. Trends in

Cognitive Sciences, 5(11), 487-494.

Woodward, A. L. (1998). Infant selectively encode the goal object of an actor's reach. Cognition, 69, 1-34.

Yarbus, A, 1967, Eye Movements and Vision, New York: Plenum Press.

Human motor cortex excitability during the perception ofothers’ actionLuciano Fadiga1, Laila Craighero1 and Etienne Olivier2

Neuroscience research during the past ten years has

fundamentally changed the traditional view of the motor

system. In monkeys, the finding that premotor neurons also

discharge during visual stimulation (visuomotor neurons) raises

new hypotheses about the putative role played by motor

representations in perceptual functions. Among visuomotor

neurons, mirror neurons might be involved in understanding

the actions of others and might, therefore, be crucial in

interindividual communication. Functional brain imaging

studies enabled us to localize the human mirror system, but the

demonstration that the motor cortex dynamically replicates the

observed actions, as if they were executed by the observer, can

only be given by fast and focal measurements of cortical

activity. Transcranial magnetic stimulation enables us to

instantaneously estimate corticospinal excitability, and has

been used to study the human mirror system at work during the

perception of actions performed by other individuals. In the

past ten years several TMS experiments have been performed

investigating the involvement of motor system during others’

action observation. Results suggest that when we observe

another individual acting we strongly ‘resonate’ with his or

her action. In other words, our motor system simulates

underthreshold the observed action in a strictly congruent

fashion. The involved muscles are the same as those used in

the observed action and their activation is temporally strictly

coupled with the dynamics of the observed action.

Addresses1 Section of Human Physiology, Universita di Ferrara, Ferrara, Italy2 Laboratory of Neurophysiology, Universite Catholique de Louvain,

Brussels, Belgium

Corresponding author: Fadiga, Luciano ([email protected])

Current Opinion in Neurobiology 2005, 15:213–218

This review comes from a themed issue on

Cognitive neuroscience

Edited by Angela D Friederici and Leslie G Ungerleider

Available online 17th March 2005

0959-4388/$ – see front matter

# 2005 Elsevier Ltd. All rights reserved.

DOI 10.1016/j.conb.2005.03.013

IntroductionA large amount of evidence suggests that actions are

represented in the brain in a similar way to words in a

vocabulary [1]. Neurophysiological studies of monkey

premotor cortex have established that hand and mouth

goal directed actions are represented in area F5. This

www.sciencedirect.com

goal-directed encoding is demonstrated by the discrimi-

native behavior of F5 neurons when an action that is

motorically similar to the one effective in triggering

neuronal response is executed in a different context.

For instance, a F5 neuron that responds during hand

grasping will not respond when similar finger movements

are performed with a different purpose, for example,

scratching [2]. Several F5 neurons, in addition to their

motor properties, also respond to visual stimuli. Mirror

neurons form a class of visuomotor neurons that respond

both when the monkey performs goal-directed hand

actions and when it observes other individuals performing

similar actions [3–5].

Prompted by the discovery of monkey mirror neurons

and stimulated by their possible involvement in high-

level cognitive functions, such as understanding others’

behavior and interindividual communication, several

functional brain imaging studies were performed to inves-

tigate whether or not a mirror-neuron system is also

present in the human brain. Results showed that observa-

tion of an action recruits a consistent network of cortical

areas, including the ventral premotor cortex (which

extends posteriorly to the primary motor cortex), the

inferior frontal gyrus, the inferior parietal lobule and

the superior temporal cortex (for recent literature see

Rizzolatti and Craighero [6]). However, brain imaging

studies give us a static picture of the activated areas and

do not enable us to conclude that the observer’s motor

system is dynamically (on-line) replicating the observed

movements. Transcranial magnetic stimulation (TMS)

can be used to measure the corticospinal (CS) excitability

with a relatively high temporal resolution, and has been

used extensively to address this issue.

Here, we review the most recent studies that investigate

using TMS how the human motor cortex reacts to other’s

action observation.

Transcranial magnetic stimulation: a tool tomeasure motor activation during observationof others’ actionsAlthough TMS was originally designed to test the integ-

rity of the CS system by recording a motor evoked

potential (MEP) from a given muscle in response to

primary motor cortex (M1) stimulation, the potential of

TMS to investigate brain functions has proved much

greater. TMS can be used either to inactivate specific

brain regions, by repetitively stimulating the brain to

obtain long lasting inhibition, or to interfere transiently

with its neural activity, by applying a single TMS pulse

Current Opinion in Neurobiology 2005, 15:213–218

214 Cognitive neuroscience

[7,8]. This approach enables us to infer the contribution

of the ‘perturbed’ area to the investigated function,

similar to the situation in animal experiments in which

muscimol injections enable us to deduce the role of the

inactivated structure by quantifying the induced beha-

vioral deficit [9,10]. The precision of localization of the

TMS target has also been recently greatly improved by

using frameless stereotaxic methods (for recent literature,

see Noirhomme et al. [11�]), enabling us to target more

precisely those ‘motorically silent’ cortical regions lying

outside M1. In addition to its use in inactivation studies,

TMS can also be used to monitor changes in CS excit-

ability that specifically accompany motor performance

[12], or that are induced by the activity of various brain

regions connected with M1. This can be done by measur-

ing the amplitude of MEPs elicited by TMS under

various experimental conditions (see Figure 1).

The first evidence that CS excitability is modulated

during observation of an action was given by our group

ten years ago [13]. TMS was applied to the area of motor

cortex that represents the hand and MEPs were recorded

from contralateral hand muscles (extensor digitorum com-

munis [EDC], flexor digitorum superficialis [FDS], first

dorsal interosseus [FDI], and opponens pollicis [OP])

during observation of transitive (grasping of different

objects) and intransitive (arm elevation) arm–hand move-

ments. During observation of grasping action, the ampli-

tude of MEPs recorded from OP and FDI increased

compared with those observed in the control conditions.

During observation of arm movement the increase was

present in all muscles except OP (Figure 2c).

This experimental outcome raised the question of

whether the muscles that were facilitated during the

Figure 1

~ 3000 V~ 5500 A~ 2.5 T (Field intensity)

~ 2 cm (Activated region)~ 100 µs (Pulse total time)~ 2 µs (Pulse rising time)

Transcranial magnetic stimulation. The fast circulation of a strong electrical

in the brain. If the induced current is large enough, underlying cortical neuro

spinal motoneurons, evoking a MEP detectable by standard electromyograp

Current Opinion in Neurobiology 2005, 15:213–218

observation of a given action were the same as those

active during its execution. To answer this question,

EMG from hand muscles was recorded during rest, object

grasping and arm lifting movements, and the pattern of

EMG activation replicated exactly that of MEPs elicited

by TMS during action observation. These results have

been successfully replicated and extended by other

groups. Brighina et al. [14] investigated the effect of

the observation of simple intransitive thumb movements

(abduction) and of sequential thumb–finger opposition

movements on left and right motor cortex excitability.

Although some methodological details are absent from

the paper (e.g. whether the presented action was per-

formed by a right or a left hand), the results show an

increase of the CS excitability in both hemispheres that

depended on the complexity of the observed task. More

recently, Gangitano et al. [15] showed the presence of a

strict temporal coupling between the changes in CS

excitability and the dynamics of the observed action.

Indeed, MEPs recorded from the FDI muscle at different

time intervals during passive observation of a pincer

grasping action matched in time the dynamics of the

kinematics of the pinch that characterized the actual

movements. Clark et al. [16�] have recently assessed

the specificity of the CS facilitation induced by action

observation. They recorded MEPs from FDI muscle of

the dominant hand during TMS of contralateral M1 while

participants first, merely observed, second, imagined, or

third, observed to subsequently imitate simple hand

actions. They did not find any statistically significant

difference among the three conditions. However, MEPs

recorded during these three conditions were strongly

facilitated when compared with those recorded during

highly demanding non-motor cognitive tasks (i.e. MEPS

collected during a backwards mental counting task).

MEP

Current Opinion in Neurobiology

current in the coil positioned on the skull induces an electric current

ns are brought over threshold and the descending volley reaches the

hy techniques.

www.sciencedirect.com

Human motor cortex excitability during the perception of others’ action Fadiga, Craighero and Olivier 215

Glossary

H-reflex: Hoffmann reflex, its amplitude (as recorded by EMG,

electromyography) depends upon spinal motoneuron excitability, and

it is evoked by stimulating the afferent fibers in peripheral nerves.

F-wave: Centrifugal discharge recorded by EMG and evoked in

motoneurons by antidromic excitation of the motoneuron axon–soma.

Figure 2

*

*

*

OPFDI-0.50

-0.25

0.00

0.25

0.50

ME

P to

tal a

reas

(z-

scor

e)TMS

(a)

(c)

(b)

Transcranial magnetic stimulation can be used to instantaneously

measure the excitability of the CS system. Schematic depiction of the

effect of a same intensity TMS on (a) resting and (b) underthreshold

depolarized (note the yellow soma) CS neurons. In (c) typical changes

of corticospinal excitability during action observation are shown. Bar

histograms describe the effect of hand grasping observation (black

bars), arm movement observation (white bars) and control condition

(shaded bars) on the TMS-induced MEPs recorded from first dorsal

interosseus (FDI) and opponens pollicis (OP) muscles (modified with

permission from Fadiga et al. [13]).

There are at least two possible explanations for the

origin of the MEP facilitation induced by action obser-

vation. The first one is that the facilitation of MEPs is

due to the enhancement of M1 excitability produced

through excitatory cortico–cortical connections. Consid-

ering that monkey area F5, where mirror neurons are

located, is premotor cortex that is strongly connected

with M1 [17], one could explain the facilitation of MEPs

induced by action observation as the consequence of a

facilitatory cortico–cortical effect arising from the

human homolog of monkey area F5. The second possi-

ble explanation is that TMS reveals, through the CS

www.sciencedirect.com

descending volley, a facilitation of motoneurons (MNs)

mediated by parallel pathways (e.g. from the premotor

cortex to the brainstem). Indeed, when MEP amplitude

variation is used as an end-point measure, a change in

M1 excitability cannot be firmly established unless any

modification in MN excitability is ruled out. Further-

more, changes in MEP amplitude caused by a variation

of M1 or MN excitability are undistinguishable and,

unfortunately, techniques that enable us to address this

issue are uncomfortable for subjects (transcranial elec-

trical stimulation, brainstem electrical stimulation),

inapplicable (as in the case of H-reflex for intrinsic hand

muscles), or unreliable because they assess only a small

fraction of the whole MN population (as in the case of

F-wave).

To assess the spinal involvement in action observation-

related MEP facilitation, Baldissera et al. [18] investi-

gated spinal cord excitability during action viewing. To

do this, they measured the amplitude of H-reflex (see

glossary) in finger flexor forearm muscles while subjects

were looking at goal-directed hand actions. Results

showed that although there was a significant modulation

of the H-reflex specifically related to the different phases

of the observed movement, the modulation pattern was

opposite to that occurring at the cortical level. Whereas

modulation of cortical excitability strictly mimics the

seen movements, as if they were performed by the

observer, the spinal cord excitability appears to be reci-

procally modulated. Indeed, the spinal MNs of finger

flexors were facilitated during observation of hand open-

ing (finger extension) but inhibited during observation of

hand closure (finger flexion). This effect was interpreted

as the expression of a mechanism serving to block overt

execution of seen actions. However, other authors failed

in showing specific changes in H-reflex amplitude during

observation of simple finger movements (see below;

[19]).

The more recently developed paired-pulse TMS might

help to determine the cortical or spinal origin of CS

facilitation (see [20]). The rationale of this technique is

the use of a subthreshold conditioning TMS pulse

followed, at various delays, by a supra-threshold TMS

test pulse. Depending on the delay between these two

pulses, it is possible to investigate changes in the excit-

ability of excitatory or inhibitory interneurons within M1

itself. Intracortical inhibition (ICI) is usually observed

for short (1–5 ms) or long (50–200 ms) intervals between

conditioning and test TMS, whereas intracortical

Current Opinion in Neurobiology 2005, 15:213–218

216 Cognitive neuroscience

facilitation (ICF) was maximal for 8–20 ms intervals.

Strafella and Paus [21] used this approach to examine

changes in cortical excitability during action observa-

tion. They stimulated the left M1 during rest, observa-

tion of handwriting and observation of arm movements.

MEPs were recorded from the FDI and biceps brachialis

muscles. Results showed that action observation

induced a facilitation of MEP amplitude evoked by

the single test stimulus and led to a decreased ICI at

3 ms interstimulus interval. The authors, therefore,

came to the conclusion that CS facilitation induced by

action observation was attributable to cortico–cortical

facilitating connections.

A series of experiments was set up with the aim of

clarifying the modulation of CS excitability induced

by some peculiar characteristics of the observed action.

In a recent experiment aiming to investigate whether

human mirror neurons are somehow tuned for one’s own

actions or not, Patuzzo et al. [19] showed an increase in

right FDS MEP amplitude associated with a reduction in

ICI, during the observation of both self (pre-recorded

videos representing subjects’ own motor performance)

and non-self finger flexion, as compared with those at

rest. Moreover, these authors failed to observe significant

changes in spinal excitability as tested with H-reflex or

F-wave (see glossary). Maeda et al. [22] investigated the

self–others issue extensively, by using TMS to measure

CS excitability during observation of actions of various

degrees of familiarity. In two conditions, subjects were

watching their own previously recorded hand actions. In

the ‘frequently observed configuration’ the acting fingers

were directed away from the body, and in the ‘not

frequently observed configuration’ the fingers were

directed toward the body. In the remaining two condi-

tions, subjects were watching the previously recorded

hand actions of unknown people both in the frequently

(fingers toward the body) and in the not frequently

(fingers away from the body) configuration. Results

showed that the frequently observed hand actions pro-

duced greater CS excitability with respect to the control

condition, whereas this was not the case for the less

frequently observed hand actions. In a subsequent

experiment, Maeda et al. [23] further investigated the

issue of the orientation of the observed hand. Results

showed that MEP facilitation was greater during obser-

vation of natural hand orientations. Aziz-Zadeh et al. [24]

investigated whether CS facilitation during action obser-

vation is modulated by the laterality of the observed body

part or not, and they found that when TMS was applied

over the left M1, MEPs were larger while observing right

hand actions. Likewise, when TMS was applied over the

right M1, MEPs were larger while observing left hand

actions.

Finally, in a very recent experiment, Gangitano et al.[25��] investigated the reason behind the presence of a

Current Opinion in Neurobiology 2005, 15:213–218

strict temporal coupling between the CS excitability

modulation and the dynamics of an observed reaching–

grasping movement (see Gangitano et al. [15]). Two sets

of visual stimuli were presented in two distinct exper-

iments. The first stimulus was a video-clip showing a

natural reaching–grasping movement. The second video-

clip represented an anomalous movement, in which the

temporal coupling between reaching and grasping com-

ponents was disrupted by changing the time of occur-

rence of the maximal finger aperture. This effect was

realized either by keeping the hand closed throughout the

whole reaching and opening it just in proximity to the

target (Experiment 1) or by substituting part of the

natural finger opening with a sudden movement of clo-

sure (Experiment 2). Whereas in Experiment 1 MEP

modulation was generally absent, the observation of

video-clip presented in Experiment 2 induced a clear,

significant modulation of CS excitability. This modula-

tion, however, was limited to the part of the observed

action preceding the visual perturbation (the sudden

finger closure). These results suggest that any modification

of the canonical plan of an observed action induces a reset

of the mirror system, which therefore stops its activity. It

can be deduced that the mirror system works online to

predict the goal and the outcome of the observed action.

Motor facilitation induced by ‘listening’ toothers’ actions: a link with speechperception?Others’ actions do not generate only visually perceivable

signals. Action-generated sounds and noises are also very

common in nature. One might expect, therefore, that also

this sensory information, related to a particular action,

could determine motor activation specific for that same

action. Very recently, it has been reported that a fraction

of monkey mirror neurons, in addition to their visual

response, also become active when the monkey listens

to an action-related sound (e.g. breaking of a peanut)

[26��]. It is tempting, therefore, to conclude that mirror

neurons might form a multimodal representation of goal

directed actions involved in action recognition. Experi-

mental evidence shows similar results in humans. Aziz-

Zadeh et al. [27] used TMS to explore whether or not CS

excitability is modulated by listening to action-related

sounds. In their experiment, left and right M1 were

studied. MEPs were recorded from the contralateral

FDI muscle while subjects listened to one of two kinds

of bimanual hand action sounds (typing or tearing paper),

to a bipedal leg action sound (walking) and to a control

sound (thunder). Results showed that sounds associated

with bimanual actions produced greater CS excitability

than sounds associated with leg movements or control

sounds. Moreover, this facilitation was exclusively later-

alized to the left hemisphere.

Sundara et al. [28] tested the possibility that a mirror

mechanism, similar to that found during observation of

www.sciencedirect.com

Human motor cortex excitability during the perception of others’ action Fadiga, Craighero and Olivier 217

actions, would also be present during presentation of

speech gestures both in the visual and in the auditory

modalities. The authors found that visual observation of

speech movement enhanced MEP amplitude specif-

ically in facial muscles involved in production of the

observed speech. By contrast, listening to the sound did

not produce MEP enhancement. This negative result,

as far as speech sounds are concerned, might be

explained by the fact that they recorded MEPs from

muscles (facial) not directly involved in speech sound

production. Indeed, our group [29] demonstrated that

during speech listening there is an increase of MEP

recorded from the listeners’ tongue muscles when the

presented words would strongly involve, when pro-

nounced, tongue movements. Furthermore, the effect

was stronger in the case of words than in the case of

pseudowords, suggesting a possible unspecific facilita-

tion of the motor speech center due to recognition that

the presented material belongs to an extant word. As

suggested by the ‘motor theory of speech perception’

originally proposed by A. Liberman [30], the presence

of a phonetic resonance in the motor speech centers

might subserve speech perception. This theory main-

tains that the ultimate constituents of speech are not

sounds but articulatory gestures that have evolved

exclusively at the service of language. According to

Liberman’s theory, the listener understands the

speaker when his or her articulatory gesture representa-

tions are activated by the listening to verbal sounds.

Also in this line is the study by Watkins et al. [31]

demonstrating that speech perception, either by listen-

ing to speech or by observing speech-related lip move-

ments, enhanced MEPs recorded from the orbicularis

oris muscle of the lips. This increase in motor excit-

ability during speech perception was, however, evident

for left hemisphere stimulation only. In a subsequent

experiment, Watkins and Paus [32��] combined TMS

with positron emission tomography to identify the brain

regions mediating the changes in motor excitability

during speech perception. Results showed that during

auditory perception of speech, the increased size of the

MEP obtained by stimulation over the face representa-

tion of the primary motor cortex correlated with regio-

nal increase of blood flow in the posterior part of the left

inferior frontal gyrus (Broca’s area). To verify the pre-

sence of an evolutionary link between the hand motor

and the language system, as proposed by an influential

theory of the evolution of communication [33], Floel

et al. [34�] examined if, and to what extent, language

activates the hand motor system. They recorded MEPs

from the FDI muscle while subjects underwent three

consecutive experiments attempting to isolate the com-

ponents (articulatory linguistic, auditory, cognitive and

visuospatial attentional) more crucial in activating the

motor system. Results showed that productive and

receptive linguistic tasks only excite the motor cortices

for both hands.

www.sciencedirect.com

ConclusionsA large body of evidence supports the view that percep-

tion of others’ actions is constantly accompanied by motor

facilitation of the observer’s CS system. This facilitation

is not only present during action observation but also

while listening to action-related sounds and, more inter-

estingly, while listening to speech. Further research is,

however, necessary to investigate if the cytoarchitectonic

homologies linking Broca’s area — and particularly Brod-

mann’s area 44 — to monkey’s area F5, where mirror

neurons have been found [35], might reflect the evolu-

tionary continuity of a communicative system originally

developed in our ancestors.

When considering TMS experiments, one should be

aware of the fact that any change in CS excitability,

even if a spinal contribution is firmly excluded, does

not tell us much about the actual brain structures

underlying the facilitation. Indeed, given the large

number of non-primary motor areas that establish exci-

tatory connections with M1, a change in M1 excitability

could originate from any of these areas. However, the

combination of TMS experiments with brain imaging

studies represents a new powerful method of analysis.

This new potential, together with the technical

improvements of TMS technique (i.e. paired pulse

stimulation, enhanced focalization and use of frameless

stereotaxic systems), is expected to increase our working

knowledge of the complex functions of the human

motor system.

AcknowledgementsThis work has been supported by European Commission grantsMIRROR and NEUROBOTICS to L Fadiga and L Craighero and byEuropean Science Foundation Origin of Man, Language and Languages,Eurocores and Italian Ministry of Education grants to L Fadiga.

References and recommended readingPapers of particular interest, published within the annual period ofreview, have been highlighted as:

� of special interest�� of outstanding interest

1. Rizzolatti G, Fadiga L: Grasping objects and grasping actionmeanings: the dual role of monkey rostroventral premotorcortex (area F5). In Sensory Guidance of Movement, NovartisFoundation Symposium. Edited by GR Bock, JA Goode. JohnWiley and Sons; 1988:81-95.

2. Rizzolatti G, Camarda R, Fogassi L, Gentilucci M, Luppino G,Matelli M: Functional organization of inferior area 6 in themacaque monkey: II. Area F5 and the control of distalmovements. Exp Brain Res 1988, 71:491-507.

3. Di Pellegrino G, Fadiga L, Fogassi L, Gallese V, Rizzolatti G:Understanding motor events: a neurophysiological study.Exp Brain Res 1992, 91:176-180.

4. Gallese V, Fadiga L, Fogassi L, Rizzolatti G: Action recognition inthe premotor cortex. Brain 1996, 119:593-609.

5. Rizzolatti G, Fadiga L, Gallese V, Fogassi L: Premotor cortex andthe recognition of motor actions. Brain Res Cogn Brain Res1996, 3:131-141.

6. Rizzolatti G, Craighero L: The mirror-neuron system. Annu RevNeurosci 2004, 27:169-192.

Current Opinion in Neurobiology 2005, 15:213–218

218 Cognitive neuroscience

7. Walsh V, Cowey A: Transcranial magnetic stimulation andcognitive neuroscience. Nat Rev Neurosci 2000, 1:73-79.

8. Rafal R: Virtual neurology. Nat Neurosci 2001, 4:862-864.

9. Wardak C, Olivier E, Duhamel JR: Saccadic target selectiondeficits after lateral intraparietal area inactivation in monkeys.J Neurosci 2002, 22:9877-9884.

10. Wardak C, Olivier E, Duhamel JR: A deficit in covert attentionafter parietal cortex inactivation in the monkey. Neuron 2004,42:501-508.

11.�

Noirhomme Q, Ferrant M, Vandermeeren Y, Olivier E, Macq B,Cuisenaire O: Registration and real-time visualization oftranscranial magnetic stimulation with 3-D MR images.IEEE Trans Biomed Eng 2004, 51:1994-2005.

A methodological study presenting a new system that maps the TMStarget directly onto the individual anatomy of the subject. The approachfollowed by the authors enables accurate and fast processing that makesthe method suitable for on-line applications.

12. Lemon RN, Johansson RS, Westling G: Corticospinal controlduring reach, grasp, and precision lift in man. J Neurosci 1995,15:6145-6156.

13. Fadiga L, Fogassi L, Pavesi G, Rizzolatti G: Motor facilitationduring action observation: a magnetic stimulation study.J Neurophysiol 1995, 73:2608-2611.

14. Brighina F, La Bua V, Oliveri M, Piazza A, Fierro B: Magneticstimulation study during observation of motor tasks. J NeurolSci 2000, 174:122-126.

15. Gangitano M, Mottaghy FM, Pascual-Leone A: Phase-specificmodulation of cortical motor output during movementobservation. Neuroreport 2001, 12:1489-1492.

16.�

Clark S, Tremblay F, Ste-Marie D: Differential modulation ofcorticospinal excitability during observation, mental imageryand imitation of hand actions. Neuropsychologia 2004,42:105-112.

A TMS study that directly compares various conditions known to activatemotor representations in humans.

17. Shimazu H, Maier MA, Cerri G, Kirkwood PA, Lemon RN:Macaque ventral premotor cortex exerts powerful facilitationof motor cortex outputs to upper limb motoneurons. J Neurosci2004, 24:1200-1211.

18. Baldissera F, Cavallari P, Craighero L, Fadiga L: Modulation ofspinal excitability during observation of hand actions inhumans. Eur J Neurosci 2001, 13:190-194.

19. Patuzzo S, Fiaschi A, Manganotti P: Modulation of motor cortexexcitability in the left hemisphere during action observation: asingle- and paired-pulse transcranial magnetic stimulationstudy of self- and non-self-action observation.Neuropsychologia 2003, 41:1272-1278.

20. Di Lazzaro V, Oliviero A, Pilato F, Saturno E, Dileone M, Mazzone P,Insola A, Tonali PA, Rothwell JC: The physiological basis oftranscranial motor cortex stimulation in conscious humans.Clin Neurophysiol 2004, 115:255-266.

21. Strafella AP, Paus T: Modulation of cortical excitability duringaction observation: a transcranial magnetic stimulation study.Neuroreport 2000, 11:2289-2292.

Current Opinion in Neurobiology 2005, 15:213–218

22. Maeda F, Chang VY, Mazziotta J, Iacoboni M: Experience-dependent modulation of motor corticospinal excitabilityduring action observation. Exp Brain Res 2001, 140:241-244.

23. Maeda F, Kleiner-Fisman G, Pascual-Leone A: Motor facilitationwhile observing hand actions: specificity of the effect androle of observer’s orientation. J Neurophysiol 2002,87:1329-1335.

24. Aziz-Zadeh L, Maeda F, Zaidel E, Mazziotta J, Iacoboni M:Lateralization in motor facilitation during action observation:a TMS study. Exp Brain Res 2002, 144:127-131.

25.��

Gangitano M, Mottaghy FM, Pascual-Leone A: Modulation ofpremotor mirror neuron activity during observation ofunpredictable grasping movements. Eur J Neurosci 2004,20:2193-2202.

This is the first investigation of the effects of visual perturbations on themirror-neuron system in humans. The authors found that an unexpectedfinger closure during the execution of a grasping movement resets theobserver’s mirror system and zeroes the previously induced corticospinalfacilitation.

26. Kohler E, Keysers CM, Umilta A, Fogassi L, Gallese V,Rizzolatti G: Hearing sounds, understanding actions:Action representation in mirror neurons. Science 2002,297:846-848.

27. Aziz-Zadeh L, Iacoboni M, Zaidel E, Wilson S, Mazziotta J: Lefthemisphere motor facilitation in response to manual actionsounds. Eur J Neurosci 2004, 19:2609-2612.

28. Sundara M, Namasivayam AK, Chen R: Observation-executionmatching system for speech: a magnetic stimulation study.Neuroreport 2001, 12:1341-1344.

29. Fadiga L, Craighero L, Buccino G, Rizzolatti G: Speech listeningspecifically modulates the excitability of tongue muscles: aTMS study. Eur J Neurosci 2002, 15:399-402.

30. Liberman AM, Mattingly IG: The motor theory of speechperception revised. Cognition 1985, 21:1-36.

31. Watkins KE, Strafella AP, Paus T: Seeing and hearing speechexcites the motor system involved in speech production.Neuropsychologia 2003, 41:989-994.

32.��

Watkins K, Paus T: Modulation of motor excitability duringspeech perception: the role of Broca’s area. J Cogn Neurosci2004, 16:978-987.

By using the association of TMS and positron emission tomography theauthors investigate the homology of frontal cortex and demonstrate therole of Broca’s area in the speech listening-induced motor facilitation.

33. Rizzolatti G, Arbib MA: Language within our grasp.Trends Neurosci 1998, 21:188-194.

34.�

Floel A, Ellger T, Breitenstein C, Knecht S: Language perceptionactivates the hand motor cortex: implications for motortheories of speech perception. Eur J Neurosci 2003,18:704-708.

Here, the authors show the existence of a lexical link between speech andhand movement representation.

35. Petrides M, Pandya DN: Comparative architectonic analysis ofthe human and the macaque frontal cortex. In Handbook ofNeuropsychology Vol. IX. Edited by F Boller, J Grafman. Elsevier;1994:17-58.

www.sciencedirect.com

A series of studies on the brain correlates of theverbal function demonstrate the involvement ofBroca’s region (Brodmann’s area – BA 44) duringboth speech generation (see Liotti et al., 1994 forreview) and speech perception (see Papathanassiouet al., 2000 for a review of recent papers). Recently,however, several experiments have shown thatBroca’s area is involved also in very differentcognitive and perceptual tasks, not necessarilyrelated to speech. Brain imaging experiments havehighlighted the possible contribution of BA 44 in“pure” memory processes (Mecklinger et al., 2002;Ranganath et al., 2003), in calculation tasks (Gruberet al., 2001), in harmonic incongruity perception(Maess et al., 2001), in tonal frequencydiscrimination (Muller et al., 2001) and in binoculardisparity (Negawa et al., 2002). Another importantcontribution of BA 44 is certainly found in themotor domain and motor-related processes. Gerlachet al. (2002) found an activation of BA 44 during acategorization task only if performed on artifacts.Kellenbach et al. (2003) found a similar activationwhen subjects were required to answer a questionconcerning the action evoked by manipulableobjects. Several studies reported a significantactivation of BA 44 during execution of graspingand manipulation (Binkofski et al., 1999a, 1999b;Gerardin et al., 2000; Grèzes et al., 2003; Hamzei etal., 2003; Lacquaniti et al., 1997; Matsumura et al.,1996; Nishitani and Hari, 2000). Moreover, theactivation of BA 44 is not restricted to motorexecution but spreads over to motor imagery(Binkofski et al., 2000; Geradin et al., 2000; Grèzesand Decety, 2002).

From a cytoarchitectonical point of view(Petrides and Pandya, 1997), the monkey’s frontalarea which closely resembles human Broca’s regionis a premotor area (area F5 as defined by Matelli et

al., 1985). Single neuron studies (see Rizzolatti etal., 1988) showed that hand and mouth movementsare represented in area F5. The specificity of thegoal seems to be an essential prerequisite inactivating these neurons. The same neurons thatdischarge during grasping, holding, tearing,manipulating, are silent when the monkey performsactions that involve a similar muscular pattern butwith a different goal (i.e., grasping to put away,scratching, grooming, etc.). All F5 neurons sharesimilar motor properties. In addition to their motordischarge, however, a particular class of F5neurons discharges also when the monkey observesanother individual making an action in front of it(‘mirror neurons’; Di Pellegrino et al., 1992;Gallese et al., 1996; Rizzolatti et al., 1996a). Thereis a strict congruence between visual and motorproperties of F5 mirror neurons: e.g., mirrorneurons motorically coding whole hand prehensiondischarge during observation of whole handprehension performed by the experimenter but notduring observation of precision grasp. The mostlikely interpretation for the visual response of thesevisuomotor neurons is that, at least in adultindividuals, there is a close link between action-related visual stimuli and the corresponding actionsthat pertain to monkey’s motor repertoire. Thus,every time the monkey observes the execution ofan action, the related F5 neurons are addressed andthe specific action representation is “automatically”evoked. Under certain circumstances it guides theexecution of the movement, under others, itremains an unexecuted representation of it, thatmight be used to understand what others are doing.

Transcranial magnetic stimulation (TMS)(Fadiga et al., 1995; Strafella and Paus, 2000) andbrain imaging experiments demonstrated that amirror-neuron system is present also in humans:

Cortex, (2006) 42, 486-490

SPECIAL ISSUE: POSITION PAPER

HAND ACTIONS AND SPEECH REPRESENTATION IN BROCA’S AREA

Luciano Fadiga and Laila Craighero

(Department of Biomedical Sciences, Faculty of Medicine, Section of Human Physiology,University of Ferrara, Ferrara, Italy)

ABSTRACT

This paper presents data and theoretical framework supporting a new interpretation of the role played by Broca’s area.Recent brain imaging studies report that, in addition to speech-related activation, Broca’s area is also significantly involvedduring tasks devoid of verbal content. In consideration of the large variety of experimental paradigms inducing Broca’sactivation, here we present some neurophysiological data from the monkey homologue of Brodmann’s areas (BA) 44 and45 aiming to integrate on a common ground these apparently different functions. Finally, we will report electrophysiologicaldata on humans which connect speech perception to the more general framework of other’s action understanding.

Key words: hand action, speech, Broca’s area, mirror neurons, neurophysiology, transcranial magnetic stimulation

when the participants observe actions made byhuman arms or hands, motor cortex becomesfacilitated (this is shown by TMS studies) andcortical activations are present in the ventralpremotor/inferior frontal cortex (Rizzolatti et al.,1996b; Grafton et al., 1996; Decety et al., 1997;Grèzes et al., 1998, 2003; Iacoboni et al., 1999;Decety and Chaminade, 2003). Grèzes et al. (1998)showed that the observation of meaningful but notthat of meaningless hand actions activates the leftinferior frontal gyrus (Broca’s region). Two furtherstudies have shown that observation of meaningfulhand-object interaction is more effective inactivating Broca’s area than observation of nongoal-directed movements (Hamzei et al., 2003;Johnson-Frey et al., 2003). Similar conclusionshave been reached also for mouth movementobservation (Campbell et al., 2001). In addition,direct evidence for an observation/executionmatching system has been recently provided bytwo experiments, one employing functionalmagnetic resonance imaging (fMRI) technique(Iacoboni et al., 1999), the other using event-related magnetoencephalography (MEG) (Nishitaniand Hari, 2000), that directly compared in the samesubjects action observation and action execution.

The evidence that Broca’s area is activatedduring time perception (Schubotz et al., 2000),calculation tasks (Gruber et al., 2001), harmonicincongruity perception (Maess et al., 2001), tonalfrequency discrimination (Muller et al., 2001),prediction of sequential patterns (Schubotz and vonCramon, 2002a) as well as during prediction ofincreasingly complex target motion (Schubotz andvon Cramon, 2002b), suggest that this area couldhave a central role in human representation ofsequential information in several different domains(Lieberman, 1991). This could be crucial for actionunderstanding, allowing the parsing of observedactions on the basis of the predictions of theiroutcomes. Others’ actions do not generate onlyvisually perceivable signals. Action-generatedsounds and noises are also very common in nature.In a very recent experiment Kohler et al. (2002)have found that 13% of the investigated F5neurons discharge both when the monkeyperformed a hand action and when it heard theaction-related sound. Moreover, most of theseneurons discharge also when the monkey observedthe same action demonstrating that these ‘audio-visual mirror neurons’ represent actions,independently of whether they are performed,heard or seen. The presence of an audio-motorresonance in a region that, in humans, is classicallyconsidered a speech-related area, prompts theLiberman’s hypothesis on the mechanism at thebasis of speech perception (motor theory of speechperception; Liberman et al., 1967; Liberman andMattingly, 1985; Liberman and Wahlen, 2000).This theory maintains that the ultimate constituentsof speech are not sounds but articulatory gestures

Hand and speech in Broca’s area 487

that have evolved exclusively at the service oflanguage. Speech perception and speech productionprocesses could thus use a common repertoire ofmotor primitives that, during speech production,are at the basis of articulatory gesture generation,and during speech perception are activated in thelistener as the result of an acoustically evokedmotor “resonance”. According to Liberman’s theory(Liberman et al., 1967; Liberman and Mattingly,1985; Liberman and Wahlen, 2000), the listenerunderstands the speaker when his/her articulatorygestures representations are activated by thelistening to verbal sounds. Although this theory isnot unanimously accepted, it offers a plausiblemodel of an action/perception cycle in the frame ofspeech processing.

To investigate if speech listening activateslistener’s motor representations, Fadiga et al.(2002) administered TMS on cortical tongue motorrepresentation, while subjects were listening tovarious verbal and non-verbal stimuli. Motorevoked potentials (MEPs) were recorded fromsubjects’ tongue muscles. Results showed thatduring listening of words formed by consonantsimplying tongue mobilization (i.e., Italian ‘R’ vs.‘F’) MEPs significantly increased. This indicatesthat when an individual listens to verbal stimuli,his/her speech related motor centers are specificallyactivated. Moreover, words-related facilitation wassignificantly larger than pseudo-words related one.

The presence of “audio-visual” mirror neuronsin the monkey and the presence of “speech-relatedacoustic motor resonance” in humans, suggeststhat, independently of the sensory nature of theperceived stimulus, the mirror-neuron resonantsystem retrieves from the action vocabulary (storedin the frontal cortex) the stimulus-related motorrepresentations. It is however unclear if theactivation of the motor system during speechlistening is causally related to speech perception, orif it is a mere epiphenomenon due, for example, toan automatic compulsion to repeat without any rolein speech processing. One experimental approachto answer this question could be to interfere withspeech perception by applying TMS on speech-related motor areas. Although classical theoriesconsider the inferior frontal gyrus as the “motorcenter” for speech production, cytoarchitectonicalhomologies with monkey area F5, and brainimaging and patients studies (among more recentpublications see Watkins and Paus, 2004; Dronkerset al., 2004, Wilson et al., 2004) suggest that thisregion may play a fundamental role in perceivedspeech processing. Broca’s area was thereforeselected as the best candidate for our study.

In order to investigate a possible role of Broca’sarea in speech perception, both at the lexical and atthe phonological level (Fadiga et al., 2002 showedthat both these speech-related properties influencemotor resonance) we selected a priming paradigm.Priming experiments, in general, demonstrate that

whenever a word (target) is preceded by a somehowrelated word (prime) it is processed faster than whenit is preceded by an unrelated word. The prime cantherefore have either a semantic or phonologicrelation with the target. Our starting aim was to testthe possibility to modulate this facilitation byinterfering on Broca’s activity with TMS. Amagnetic stimulus delivered immediately after thelistening of the prime, on a functionally-related brainregion, should impair prime processing, resulting ina modification in the priming effect. In ourexperiment we used the paradigm by Emmorey et al.(1989) in which subjects are requested to perform alexical decision on a target preceded by a rhyming ornot rhyming prime. By manipulating the lexicalcontent of both the prime and the target stimuli(Emmorey et al., 1989, used only word prime), inaddition to the rhyming effect we tested also the roleof Broca’s area at the lexical level. Single pulseTMS was administered to Broca’s region in 50% ofthe trials while subjects were submitted to a lexicaldecision task on the target. Subjects had to respondby pressing one of two keys with their left indexfinger. TMS was administered during the 20 msecpause between prime and target acousticpresentation (interstimulus interval – ISI). The click

488 Luciano Fadiga and Laila Craighero

of the stimulator never overlapped with the acousticstimuli. The pairs of verbal stimuli could pertain tofour categories which differed for presence of lexicalcontent (words vs. pseudo-words) in the prime andin the target (Table I).

From data analysis on trials without TMS (seeFigure 1) an interesting (and unexpected) findingemerged: lexical content of the stimuli modulatesthe phonological priming effect. No priming effectwas found in the pseudo-word/pseudo-wordcondition in which neither the target nor the primewas an element of the lexicon. In other words, inorder to have a phonological effect it is necessaryto have the access to the lexicon. In trials duringwhich TMS was delivered, a TMS-dependent effectwas found only in pairs where the prime was aword and the target was a pseudo-word, andconsisted in the abolition of the phonologicalpriming effect. Thus, TMS on Broca’s area madethe pairs word/pseudo-word similar to the pseudo-word/pseudo-word ones.

This finding suggests that the stimulation of theBroca’s region might have affected the primingeffect not because it interferes with phonologicalprocessing but because it interferes with lexicalcategorization of the prime. In support to this

TABLE I

Example of the stimuli used in the experiment

Rhyming Not-rhyming

Word/word zucca (pumpkin) – mucca (cow) fiume (river) – scuola (school)Word/pseudo-word freno (brake) – preno strada (street) – tertoPseudo-word/word losse – tosse (cough) stali – letto (bed)Pseudo-word/pseudo-word polta – solta brona – dasta

Fig. 1 – Reaction times (RTs ± standard error of mean – SEM – in msec) for the lexical decision during the phonological primingtask without (left panel) and with (right panel) transcranial magnetic stimulation (TMS) administration. White bars: conditions in whichprime and target share a rhyme. Black bars: no rhyme. Asterisk on the black bar means the presence (p > .05, Newman-Keuls test) of aphonological priming effect (response to rhyming target faster than response to not-rhyming target) in the relative condition. TMSadministration did not influence the accuracy of the participants that was always close to 100%. W-W: prime-word/target-word; W-PW:prime-word/target-pseudo-word; PW-W: prime-pseudo-word/target-word; PW-PW: prime-pseudo-word/target-pseudo-word.

interpretation are recent results from Blumstein etal. (2000) who have found that Broca’s aphasicsdisplay deficits in the facilitation of lexicaldecision targets by prime words that rhyme withthe target. In contrast, Wernicke’s aphasics showeda pattern of results similar to that of normalsubjects. Moreover, Milberg et al. (1988), in aphonological distortion study, showed that Broca’saphasics failed to show semantic priming when thephonological form of the prime stimulus wasdistorted. The authors interpreted this finding in theframework of the hypothesis that Broca’s aphasicshave reduced lexical activation levels (Utman etal., 2001). As a result, while in normal subjects anacoustically degraded input is able to activate thelexical representation, in aphasics it fails to reach asufficient level of activation. However, there isevidence that Broca’s aphasics have impairedlexical access even in response to intact acousticinputs (Milberg et al., 1988).

The results of our TMS experiment onphonological priming, together with the data onpatients reported above, lead to the conclusion thatBroca’s region is not the main responsible for theacoustic motor resonance effect shown by Fadigaet al. (2002). This effect was in fact present duringlistening of both words and pseudo-words and wasonly partially related to lexical properties of theheard stimuli. The localization of the premotor areainvolved in such a “low level” motor resonancewill be the argument of our future experimentalwork.

CONCLUSIONS

In the present paper we discuss the possibilitythat the activation of Broca’s region during speechprocessing, more than indicating a specific role ofthis area, may reflect its general involvement inmeaningful action recognition. This possibilityfounds its basis on the observation that in additionto speech-related activation, this area is activatedduring observation of meaningful hand or mouthactions. Speech represents a particular case of thisgeneral framework: among meaningful actions,phonoarticulatory gestures are meaningful actionsconveying words. This hypothesis is moreoversupported by the observation that Broca’s aphasics,in addition to speech production deficits, show animpaired access to the lexicon (although for somecategory of verbal stimuli). The consideration thatBroca’s area is the human homologue of monkeymirror neurons area, opens the possibility thathuman language may have evolved from an ancientability to recognize actions performed by others,visually or acoustically perceived. The Liberman’sintuition (Liberman et al., 1967; Liberman andMattingly, 1985; Liberman and Wahlen, 2000) thatthe ultimate constituents of speech are not soundsbut articulatory gestures that have evolved

Hand and speech in Broca’s area 489

exclusively at the service of language, seems to usa good way to consider speech processing in themore general context of action recognition.

Acknowledgements. The research presented in thispaper is supported within the framework of the EuropeanScience Foundation EUROCORES program “The Originof Man, Language and Languages”, by EC “MIRROR”Contract and by Italian Ministry of Education Grants.

REFERENCES

BINKOFSKI F, AMUNTS K, STEPHAN KM, POSSE S, SCHORMANN T,FREUND HJ, ZILLES K and SEITZ RJ. Broca’s region subservesimagery of motion: A combined cytoarchitectonic and fMRIstudy. Human Brain Mapping, 11: 273-285, 2000.

BINKOFSKI F, BUCCINO G, POSSE S, SEITZ RJ, RIZZOLATTI G andFREUND H-J. A fronto-parietal circuit for object manipulationin man: Evidence from an fMRI study. European Journal ofNeuroscience, 11: 3276-3286, 1999a.

BINKOFSKI F, BUCCINO G, STEPHAN KM, RIZZOLATTI G, SEITZ RJand FREUND H-J. A parieto-premotor network for objectmanipulation: Evidence from neuroimaging. ExperimentalBrain Research, 128: 21-31, 1999b.

BLUMSTEIN SE, MILBERG W, BROWN T, HUTCHINSON A, KUROWSKIK and BURTON MW. The mapping from sound structure to thelexicon in aphasia: Evidence from rhyme and repetitionpriming. Brain and Language, 72: 75-99, 2000.

CAMPBELL R, MACSWEENEY M, SURGULADZE S, CALVERT G,MCGUIRE P, SUCKLING J, BRAMMER MJ and DAVID AS.Cortical substrates for the perception of face actions: An fMRIstudy of the specificity of activation for seen speech and formeaningless lower-face acts (gurning). Cognitive BrainResearch, 12: 233-243, 2001.

DECETY J and CHAMINADE T. Neural correlates of feelingsympathy. Neuropsychologia, 41: 127-138, 2003.

DECETY J, GREZES J, COSTES N, PERANI D, JEANNEROD M, PROCYKE, GRASSI F and FAZIO F. Brain activity during observation ofactions: Influence of action content and subject’s strategy.Brain, 120: 1763-1777, 1997.

DI PELLEGRINO G, FADIGA L, FOGASSI L, GALLESE V andRIZZOLATTI G. Understanding motor events: Aneurophysiological study. Experimental Brain Research, 91:176-180, 1992.

DRONKERS NF, WILKINS DP, VAN VALIN RD JR, REDFERN BB andJAEGER JJ. Lesion analysis of the brain areas involved inlanguage comprehension. Cognition, 92: 145-77, 2004.

EMMOREY KD. Auditory morphological priming in the lexicon.Language and Cognitive Processes, 4: 73-92, 1989.

FADIGA L, CRAIGHERO L, BUCCINO G and RIZZOLATTI G. Speechlistening specifically modulates the excitability of tonguemuscles: A TMS study. European Journal of Neuroscience,15: 399-402, 2002.

FADIGA L, FOGASSI L, PAVESI G and RIZZOLATTI G. Motorfacilitation during action observation: A magnetic stimulationstudy. Journal of Neurophysiology, 73: 2608-2611, 1995.

GALLESE V, FADIGA L, FOGASSI L and RIZZOLATTI G. Actionrecognition in the premotor cortex. Brain, 119: 593-609,1996.

GERARDIN E, SIRIGU A, LEHÉRICY S, POLINE J-B, GAYMARD B,MARSAULT C and AGID Y Partially overlapping neuralnetworks for real and imagined hand movements. CerebralCortex, 10: 1093-1104, 2000.

GERLACH C, LAW I, GADE A and PAULSON OB. The role of actionknowledge in the comprehension of artefacts: A PET study.NeuroImage, 15: 143-152, 2002.

GRAFTON ST, ARBIB MA, FADIGA L and RIZZOLATTI G.Localization of grasp representations in humans by PET: 2.Observation compared with imagination. Experimental BrainResearch, 112: 103-111, 1996.

GRÈZES J, ARMONY JL, ROWE J and PASSINGHAM RE. Activationsrelated to “mirror” and “canonical” neurones in the humanbrain: An fMRI study. NeuroImage, 18: 928-937, 2003.

GRÈZES J, COSTES N and DECETY J. Top-down effect of strategy onthe perception of human biological motion: A PETinvestigation. Cognitive Neuropsychology, 15: 553-582, 1998.

GRÈZES J and DECETY J. Does visual perception of object affordaction? Evidence from a neuroimaging study.Neuropsychologia, 40: 212-222, 2002.

490 Luciano Fadiga and Laila Craighero

GRUBER O, INDERFEY P, STEINMEIZ H and KLEINSCHMIDT A.Dissociating neural correlates of cognitive components inmental calculation. Cerebral Cortex, 11: 350-359, 2001.

HAMZEI F, RIJNTJES M, DETTMERS C, GLAUCHE V, WEILLER C andBUCHEL C. The human action recognition system and itsrelationship to Broca’s area: An fMRI study. NeuroImage, 19:637-644, 2003.

IACOBONI M, WOODS R, BRASS M, BEKKERING H, MAZZIOTTA JCand RIZZOLATTI G. Cortical mechanisms of human imitation.Science, 286: 2526-2528, 1999.

JOHNSON-FREY SH, MALOOF FR, NEWMAN-NORLUND R, FARRER C,INATI S and GRAFTON ST. Actions or hand-object interactions?Human inferior frontal cortex and action observation. Neuron,39: 1053-1058, 2003.

KELLENBACH ML, BRETT M and PATTERSON K. Actions speaklouder than functions: The importance of manipulability andaction in tool representation. Journal of CognitiveNeuroscience, 15: 30-45, 2003.

KOHLER E, KEYSERS CM, UMILTÀ A, FOGASSI L, GALLESE V andRIZZOLATTI G. Hearing sounds, understanding actions: Actionrepresentation in mirror neurons. Science, 297: 846-848,2002.

LACQUANITI F, PERANI D, GUIGNON E, BETTINARDI V, CARROZZOM, GRASSI F, ROSSETTI Y and FAZIO F. Visuomotortransformations for reaching to memorized targets: A PETstudy. NeuroImage, 5: 129-146, 1997.

LIBERMAN AM, COOPER FS, SHANKWEILER DP and STUDDERT-KENNEDY M. Perception of the speech code. PsychologicalReview, 74: 431-461, 1967.

LIBERMAN AM and MATTINGLY IG. The motor theory of speechperception revised. Cognition, 21: 1-36, 1985.

LIBERMAN AM and WAHLEN DH. On the relation of speech tolanguage. Trends in Cognitive Neuroscience, 4: 187-196,2000.

LIEBERMAN P. Speech and brain evolution. Behavioral and BrainSciences, 14: 566-568, 1991.

LIOTTI M, GAY CT and FOX PT. Functional imaging and language:Evidence from positron emission tomography. Journal ofClinical Neurophysiology, 11: 175-90, 1994.

MAESS B, KOELSCH S, GUNTER TC and FRIEDERICI AD. Musicalsyntax is processed in Broca’s area: An MEG study. NatureNeuroscience, 4: 540-545, 2001.

MATELLI M, LUPPINO G and RIZZOLATTI G. Patterns of cytochromeoxidase activity in the frontal agranular cortex of macaquemonkey. Behavioral Brain Research, 18: 125-137, 1985.

MATSUMURA M, KAWASHIMA R, NAITO E, SATOH K, TAKAHASHI T,YANAGISAWA T and FUKUDA H. Changes in rCBF duringgrasping in humans examined by PET. NeuroReport, 7: 749-752, 1996.

MECKLINGER A, GRUENEWALD C, BESSON M, MAGNIÉ M-N andVON CRAMON Y. Separable neuronal circuitries formanipulable and non-manipualble objects in workingmemory. Cerebral Cortex, 12: 1115-1123, 2002.

MILBERG W, BLUMSTEIN S and DWORETZKY B. Phonologicalprocessing and lexical access in aphasia. Brain and Language,34: 279-293, 1988.

MULLER R-A, KLEINHANS N and COURCHESNE E. Broca’s area and

the discrimination of frequency transitions: A functional MRIstudy. Brain and Language, 76: 70-76, 2001.

NEGAWA T, MIZUNO S, HAHASHI T, KUWATA H, TOMIDA M, HOSHIH, ERA S and KUWATA K. M pathway and areas 44 and 45 areinvolved in stereoscopic recognition based on binoculardisparity. Japanese Journal of Physiology, 52: 191-198, 2002.

NISHITANI N and HARI R. Temporal dynamics of corticalrepresentation for action. Proceedings of the NationalAcademy of Sciences, 97: 913-918, 2000.

PAPATHANASSIOU D, ETARD O, MELLET E, ZAGO L, MAZOYER B andTZOURIO MAZOYER N. A common language network forcomprehension and production: A contribution to thedefinition of language epicenters with PET. NeuroImage, 11:347-57, 2000.

PETRIDES M and PANDYA DN. Comparative architectonic analysisof the human and the macaque frontal cortex. In Boller F andGrafman J (Eds), Handbook of Neuropsychology (vol. IX).New York: Elsevier, 1997.

RANGANATH C, JOHNSON MK and D’ESPOSITO M. Prefrontalactivity associated with working memory and episodic long-term memory. Neuropsychologia, 41: 378-389, 2003.

RIZZOLATTI G, CAMARDA R, FOGASSI L, GENTILUCCI M, LUPPINO Gand MATELLI M. Functional organization of inferior area 6 inthe macaque monkey: II. Area F5 and the control of distalmovements. Experimental Brain Research, 71: 491-507, 1988.

RIZZOLATTI G, FADIGA L, GALLESE V and FOGASSI L. Premotorcortex and the recognition of motor actions. Cognitive BrainResearch, 3: 131-141, 1996a.

RIZZOLATTI G, FADIGA L, MATELLI M, BETTINARDI V, PAULESU E,PERANI D and FAZIO F. Localization of grasp representation inhumans by PET: 1. Observation versus execution.Experimental Brain Research, 111: 246-252, 1996b.

SCHUBOTZ RI, FRIEDERICI AD and VON CRAMON DY. Timeperception and motor timing: A common cortical andsubcortical basis revealed by fMRI. NeuroImage, 11: 1-12,2000.

SCHUBOTZ RI and VON CRAMON DY. Predicting perceptual eventsactivates corresponding motor schemes in lateral premotorcortex: An fMRI study. NeuroImage, 15: 787-96, 2002a.

SCHUBOTZ RI and VON CRAMON DY. A blueprint for target motion:fMRI reveals perceived sequential complexity to modulatepremotor cortex. NeuroImage, 16: 920-35, 2002b.

STRAFELLA AP and PAUS T. Modulation of cortical excitabilityduring action observation: A transcranial magnetic stimulationstudy. NeuroReport, 11: 2289-2292, 2000.

UTMAN JA, BLUMSTEIN SE and SULLIVAN K. Mapping from soundto meaning: Reduced lexical activation in Broca’s aphasics.Brain and Language, 79: 444-472, 2001

WATKINS K and PAUS T. Modulation of motor excitability duringspeech perception: The role of Broca’s area. Journal ofCognitive Neuroscience, 16: 978-87, 2004.

WILSON SM, SAYGIN AP, SERENO MI and IACOBONI M. Listening tospeech activates motor areas involved in speech production.Nature Neuroscience, 7: 701-2, 2004.

Luciano Fadiga, Faculty of Medecine, Department of Biomedical Sciences, Section ofHuman Physiology, University of Ferrara, via Fossato di Mortara 17/19, 44100 Ferrara,Italy. e-mail: [email protected]

1

This chapter is dedicated in memory of Massimo Matelli.

Broca’s region: a speech area?

Luciano Fadiga, Laila Craighero, Alice Roy

Department of Biomedical Sciences, Section of Human Physiology

University of Ferrara - Italy

Introduction

Since its first description in the 19th Century, Broca’s area, largely coincident with

Brodmann areas 44/45 (pars opercularis and pars triangularis of the inferior frontal gyrus

(IFG), see Amunts et al., 1999) has represented one of the most challenging areas of the

human brain. Its discoverer, the French neurologist Paul Broca (1861), strongly claimed that

a normal function of this area is fundamental for the correct functioning of verbal

communication. In his view this area, which he considered a motor area, contains a

“memory” of the movements necessary to articulate words. Several Broca’s colleagues,

however, argued against his interpretation (i.e. Henry C. Bastian, who considered Broca’s

area a sensory area deputed to tongue proprioception), and perhaps the most famous

among them, Paul Flechsig, postulated for Broca’s area a role neither motor nor sensory

(see Mingazzini, 1913). According to his schema, drawn to explain the paraphasic

expression of sensory aphasia, Broca’s region is driven by Wernicke’s area and, due to its

strong connections to the inferior part of the precentral gyrus (gyrus frontalis ascendens), it

recruits and coordinates the motor elements necessary to produce words articulation. It is

important to stress here that the whole set of knowledge on the function of Broca’s area

possessed by the neurologists of the 19th Century, derived from the study of the correlation

between functional impairment and brain lesions, as assessed post-mortem by

neuropathological investigations.

The neurosurgeon Wilder Penfield was the first who experimentally demonstrated the

involvement of Broca’s region in speech production. By electrically stimulating the frontal

lobe in awake patients undergoing brain surgery for intractable epilepsy, he collected dozens

of cases and firstly reported that the stimulation of the inferior frontal gyrus evoked the arrest

of ongoing speech, although with some individual variability. The coincidence between the

focus of the Penfield effect and the location of the Broca’s area, was a strongly convincing

argument in favour of the motor role of this region (Penfield and Roberts, 1959).

2

Apart from some pioneeristic investigation of speech-related evoked potentials (Ertl

and Schafer, 1967; Lelord et al., 1973), the true scientific revolution in the study of verbal

communication was represented by the discovery of brain imaging techniques, such as

positron emission tomography (PET), functional magnetic resonance (fMRI) and

magnetoencephalography (MEG). This mainly because evoked potential technique, although

very fast in detecting neuronal responses, was definitely unable to localize brain activation

with enough spatial resolution (the situation has changed now with high-resolution EEG). As

soon as positron emission tomography became available, a series of independent studies on

the brain correlates of the verbal function demonstrated the involvement of Broca’s region

during generation of speech (for a review of early studies see Liotti et al., 1994). At the same

time, however, the finding that Broca’s region was activated also during speech perception

became more and more accepted (see Papathanassiou et al., 2000 for review). This finding

represents in our view the second revolution in the neuroscience of speech. Data coming

from cortical stimulation of collaborating patients undergoing neurosurgery confirmed these

observations. According to Schaffler et al. (1993) the electrical stimulation of the Broca’s

area in addition to the speech production interference originally shown by Penfield, produced

also comprehension deficits, particularly evident in the case of “complex auditory verbal

instructions and visual semantic material”. According to the same group, while Broca’s region

is specifically involved in speech production, Wernicke’s AND Broca’s areas both participate

to speech comprehension (Schaffler et al., 1993). This double-faced role of Broca’s area is

now widely accepted, although with different functional interpretations. The discussion in

deep of this debate is however outside the scope of the present chapter but readers will find

details on it in other contributions to the present book.

The perceptual involvement of Broca’s area seems not restricted to speech perception.

Since early ‘70s several groups have shown a strict correlation between frontal aphasia and

impairment in gestures/pantomimes recognition (Gainotti and Lemmo, 1976; Duffy and Duffy,

1975; Daniloff et al., 1982; Glosser et al., 1986; Duffy and Duffy, 1981; Bell, 1994). It is often

unclear, however, whether this relationship between aphasia and gestures recognition

deficits is due to Broca’s area lesion only or if it depends on the involvement of other,

possibly parietal, areas. In fact, it is a common observation that aphasic patients are

frequently affected by ideomotor apraxia too (see Goldenberg, 1996).

This chapter will present data and theoretical framework supporting a new

interpretation of the role played by Broca’s area. In a first part we will briefly review a series

of recent brain imaging studies which report, among others, the activation of area 44/45. In

consideration of the large variety of experimental paradigms inducing such activation, we will

make an interpretative effort by presenting neurophysiological data from the monkey

3

homologue of BA44/BA45. Finally we will report electrophysiological data on humans which

connect speech perception to the more general framework of other’s action understanding.

Does only speech activate Broca’s area?

In this section we will present recent brain imaging experiments reporting Broca’s area

activation. Experiments will be grouped according to the general cognitive function they

explore.

Memory and attention.

Brain imaging studies aiming at identifying the neuronal substrate of the working

memory have repeatedly observed activations of Broca’s area. However, these results

should be taken cautiously as most of the experimental tasks used verbal stimuli and do not

allowed to clearly disambiguate the role of BA44 in the memory processes from the well-

known one in language processes. Considered as the neuronal substrate of the phonological

loop, a component of the working memory system, few memory studies have highlighted the

possible contribution of Broca’s area in “pure” memory processes. Mecklinger and

colleagues (2002) recently reported the activation of BA44 during a delay match-to-sample

task in which subjects were required to match the orientation of non-manipulable objects.

When tested with pictures of manipulable objects the activation shifted caudally to the left

ventral premotor cortex. While in this study BA44 was mainly associated with the encoding

delay, Ranganath and co-workers (2003) failed to demonstrate any preferential activation of

Broca’s area for the encoding or, conversely, the retrieval phases. Furthermore the authors

found the activation of Broca’s area both during a long-term memory task and a working

memory task, questioning thus a specific involvement in the working memory system.

Interestingly, the stimuli used in this experiment were pictures of face and, as we will review

later, Broca’s area seems particularly responsive to facial stimuli. Moreover, there is a series

of papers by Ricarda Schubotz and Yves von Cramon in which they investigate non-motor

and non-language functions of the premotor cortex (for review see Schubotz and von

Cramon, 2003). According to this work the premotor cortex is also involved in prospective

attention to sensory events and in processing serial prediction tasks.

Arithmetics and calculation.

If the role of Broca’s area in non-verbal memory remains to be confirmed, its potential

participation in calculation tasks is equally confusing. Arithmetic studies face with two

problems that could both account for the activation of BA44, the working memory sub-

processes and the presence of covert speech (Delazer et al, 2003; Menon et al, 2000;

4

Rickard et al, 2000). To this respect, the study of Gruber et al (2001) is interesting as they

carefully controlled for the covert-speech but still found an activation of Broca’s area.

Moreover they compared a simple calculation task to a compound one and once again

observed an activation of Broca’s area. If the hypothesis of the verbal working memory

cannot be ruled out, these authors nevertheless strengthened the potential involvement of

Broca’s area in processing symbolic meaningful operations and applying calculation rules.

Music.

Playing with rules seems to be part of the cognitive attributions of BA44, as repeatedly

observed during tasks manipulating syntactic rules. In an elegant study, Maess and

colleagues (2001) have further extended these observations to musical syntax. Indeed, the

predictability of harmonics and the rules underlying music organization has been compared

to language syntax (Patel et al, 1998; Bharucha et Krumhansl, 1983). By inserting

unexpected harmonics Maess and co-workers (2001) created a sort of musical syntactic

violation. Using magnetoencephalography (MEG) they studied the neuronal counterpart of

hearing harmonic incongruity, as expected basing on a previous experiment, they found an

early right anterior negativity, a parameter that has already been associated with harmonics

violation (Koelsch et al, 2000). The source of the activity pointed out BA44, bilaterally.

However the story is not that simple, and in addition to a participation in high order processes

Broca’s area takes part to lower order processes such as tonal frequency discrimination

(Muller et al, 2001) or binocular disparity (Negawa et al, 2002).

Motor-related functions.

Excluding the linguistic field, another important contribution of BA44 is certainly found

in the motor domain and motor-related processes. Gerlach et al (2002) asked subjects to

perform a categorization task between natural and man-made objects and found an

activation of BA44 extending caudally to BA6 for artefacts only. The authors proposed that

categorization might rely on motor-based knowledge, artefacts being more closely linked with

hand actions than natural objects. Distinguishing between manipulable and non-manipulable

objects, Kellenbach and colleagues (2003) found a stronger activation of BA44 when

subjects were required to answer a question concerning the action evoked by manipulable

objects. However, performing a categorization task or answering a question even by the way

of a button press are likely to evoked covert speech. Against this criticism are several studies

reporting a significant activation of BA44 during execution of distal movements such as

grasping (Binkofski et al, 1999ab; Gerardin et al, 2000; Grezes et al, 2003; Hamzei et al,

2003; Lacquaniti et al, 1997; Matsumura et al, 1996; Nishitani et Hari, 2000). Moreover, the

5

activation of BA44 is not restricted to motor execution but spreads over motor imagery

(Binkofski et al, 2000; Geradin et al, 2000; Grezes et Decety, 2002).

How can we solve the puzzle?

In order to understand how the human brain works, usually neuroscientists try to define

which human areas are morphologically closer to electrophysiologically characterized

monkey ones. In our case, to navigate through this impressive amount of experimental data,

we will perform the reversal operation by stepping down the evolutionary scale in order to

examine the functional properties of the homologue of BA44 in our “progenitors”. From a

cytoarchitectonical point of view (Petrides and Pandya, 1997), the monkey’s frontal area

which closely resembles human Broca’s region is an agranular/disgranular premotor area

(area F5 as defined by Matelli et al. 1985) (see Rizzolatti et al., 2002). We will therefore

examine the functional properties of this area by reporting the results of experiments aiming

to find a behavioral correlate to single neuron responses.

Motor properties of the monkey homologue of human Broca’s area

Area F5 forms the rostral part of inferior area 6 (Fig. 1).

Figure 1. Lateral view of monkey left hemisphere. Area F5 is buried inside the arcuate sulcus (posterior bank)

and emerges on the convexity immediately posterior to it. Area F5 is bidirectionally connected with the inferior

parietal lobule (areas AIP, anterior intraparietal, PF and PFG) and represents the monkey homologue of human

Broca’s area (Petrides and Pandya, 1997). Areas F5 sends some direct connections also to hand/mouth

6

representations of primary motor cortex (area F1) and to the cervical enlargement of the spinal cord. This last

evidence definitely demonstrates its motor nature.

Microstimulation (Hepp-Reymond et al., 1994) and single neuron studies (see Rizzolatti

et al. 1988) show that in area F5 are represented hand and mouth movements. The two

representations tend to be spatially segregated: while hand movements are mostly

represented in the dorsal part of area F5, mouth movements are mostly located in its ventral

part. Although not much is known about the functional properties of “mouth” neurons, the

properties of “hand” neurons have been extensively investigated. Rizzolatti et al. (1988)

recorded single neuron activity in monkeys trained to grasp objects of different size and

shape. The specificity of the goal seems to be an essential prerequisite in activating these

neurons. The same neurons that discharge during grasping, holding, tearing, manipulating,

are silent when the monkey performs actions that involve a similar muscular pattern but with

a different goal (i.e. grasping to put away, scratching, grooming, etc.). Further evidence in

favor of such a goal representation is given by F5 neurons that discharge when the monkey

grasps an object with its right hand or with its left hand (Figure 2). This observation suggests

that some F5 premotor neurons are capable to generalize the goal, independently from the

acting effector. Using the action effective in triggering neuron’s discharge as classification

criterion, F5 neurons can be subdivided into several classes. Among them, the most

common are "grasping", "holding", "tearing", and "manipulating" neurons. Grasping neurons

form the most represented class in area F5. Many of them are selective for a particular type

of prehension such as precision grip, finger prehension, or whole hand prehension. In

addition, some neurons show specificity for different fingers configuration, even within the

same grip type. Thus, the prehension of a large spherical object (whole hand prehension,

requiring the opposition of all fingers) is coded by neurons different from those coding the

prehension of a cylinder (still whole hand prehension but performed with the opposition of the

four last fingers and the palm of the hand). The temporal relation between grasping

movement and neuron discharge varies from neuron to neuron. Some neurons become

active during the initial phase of the movement (opening of the hand), some discharge during

hand closure, and others discharge during the entire grasping movement from the beginning

of fingers opening until their contact with the object.

7

Figure 2. Example of F5 motor neuron. In each panel eight trials are represented (rasters in the uppermost part)

together with the sum histogram (lowermost part). Trials are aligned with the moment at which the monkey

touches the target (vertical line across rasters and histograms). Ordinates: spikes/second; Abscissae: 20 ms bins.

.

Taken together, the functional properties of F5 neurons suggest that this area stores a

set of motor schemata (Arbib, 1997) or, as it was previously proposed (Rizzolatti and

Gentilucci, 1988), contains a "vocabulary" of motor acts. The "words" composing this

vocabulary are constituted by populations of neurons. Some of them indicate the general

category of an action (hold, grasp, tear, manipulate). Others specify the effectors that are

appropriate for that action. Finally, a third group is concerned with the temporal segmentation

of the actions. What differentiates F5 from the primary motor cortex (M1, BA4) is that while

F5 motor schemata code for goal-directed actions (or fragments of specific actions), in the

primary motor cortex are represented movements, which are independent from the action

context in which they are used. In comparison with F5, M1 could therefore be defined as a

"vocabulary of movements".

Visuomotor properties of the monkey homologue of human Broca’s area

All F5 neurons share similar motor properties. In addition to their motor discharge,

however, several F5 neurons discharge also to the presentation of visual stimuli (visuomotor

neurons). Two radically different categories of visuomotor neurons are present in area F5:

Neurons of the first category discharge when the monkey observes graspable objects

(“canonical” F5 neurons, Rizzolatti et al., 1988; Rizzolatti and Fadiga, 1998). Neurons of the

second category discharge when the monkey observes another individual making an action

in front of it (di Pellegrino et al., 1992, Gallese et al., 1996; Rizzolatti et al., 1996a). For this

8

peculiar “resonant” properties, neurons belonging to the second category have been named

“mirror” neurons (Gallese et al., 1996). The two categories of F5 neurons are located in two

different sub-regions of area F5: canonical neurons are mainly found in that sector of area F5

buried inside the arcuate sulcus, whereas mirror neurons are almost exclusively located in

the cortical convexity of F5.

Recently, the visual responses of canonical neurons have been re-examined using a

formal behavioral paradigm, which allowed to separately test the response related to object

observation, during the waiting phase between object presentation and movements onset,

and during movement execution (Murata et al., 1997). The results showed that among the

canonical neurons recorded in area F5, two thirds were selective to one or few specific

objects. When visual and motor properties of F5 object observation neurons are compared, it

becomes clear that there is a strict congruence between the two types of responses.

Neurons that become active when the monkey observes small size objects, discharge also

during precision grip. On the contrary, neurons selectively active when the monkey looks at a

large object, discharge also during actions directed towards large objects (e.g. whole hand

prehension). The most likely interpretation for visual discharge in these visuomotor neurons

is that, at least in adult individuals, there is a close link between the most common 3D stimuli

and the actions necessary to interact with them. Thus, every time a graspable object is

visually presented, the related F5 neurons are addressed and the action is "automatically"

evoked. Under certain circumstances, it guides the execution of the movement, under others,

it remains an unexecuted representation of it, that might be used also for semantic

knowledge.

Mirror neurons, that become active when the monkey acts on an object and when it

observes another monkey or the experimenter making a similar goal-directed action, appear

to be identical to canonical neurons in terms of motor properties, but they radically differ from

them as far as visual properties are concerned (Rizzolatti and Fadiga, 1998). In order to be

triggered by visual stimuli, mirror neurons require an interaction between a biological effector

(hand or mouth) and an object. The sights of an object alone, of an agent mimicking an

action, or of an individual making intransitive (non-object directed) gestures are all

ineffective. The object significance for the monkey has no obvious influence on mirror neuron

response. Grasping a piece of food or a geometric solid produces responses of the same

intensity.

Mirror neurons show a large degree of generalization. Largely different visual stimuli,

but representing the same action, are equally effective. For example, the same grasping

mirror neuron that responds to a human hand grasping an object, responds also when the

grasping hand is that of a monkey. Similarly, the response is, typically, not affected if the

action is done near or far from the monkey, in spite of the fact that the size of the observed

9

hand is obviously different in the two conditions. It is also of little importance for neuron

activation if the observed action is eventually rewarded. The discharge is of the same

intensity if the experimenter grasps the food and gives it to the recorded monkey or to

another monkey, introduced in the experimental room. The observed actions which most

commonly activate mirror neurons are grasping, placing, manipulating, holding.

Most mirror neurons respond selectively to only one type of action (e.g. grasping).

Some are highly specific, coding not only the type of action, but also how that action is

executed. They fire, for example, during observation of grasping movements, but only when

the object is grasped with the index finger and the thumb. Typically, mirror neurons show

congruence between the observed and executed action. According to the type of congruence

they exhibit, mirror neurons have been subdivided into “strictly congruent” and “broadly

congruent” neurons (Gallese et al. 1996). Mirror neurons in which the effective observed and

effective executed actions correspond in terms of goal (e.g. grasping) and means for

reaching the goal (e.g. precision grip) have been classed as “strictly congruent”. They

represent about one third of F5 mirror neurons. Mirror neurons that, in order to be triggered,

do not require the observation of exactly the same action that they code motorically, have

been classed as “broadly congruent”. They represent about two-third of F5 mirror neurons.

The early studies of mirror neurons concerned essentially the upper sector of F5 where

hand actions are mostly represented. Recently, a study was carried on the properties of

neurons located in the lateral part of F5 (Ferrari et al. 2003), where, in contrast, most

neurons are related to mouth actions. The results showed that about 25% of studied neurons

have mirror properties. According to the visual stimuli effective in triggering the neurons, two

classes of mouth mirror neurons were distinguished: ingestive and communicative mirror

neurons. Ingestive mirror neurons respond to the observation of actions related to ingestive

functions, such as grasping food with the mouth, breaking it, or sucking. Neurons of this

class form about 80% of the total amount of the recorded mouth mirror neurons. Virtually, all

ingestive mirror neurons show a good correspondence between the effective observed and

the effective executed action. In about one third of them, the effective observed and

executed actions are virtually identical (strictly congruent neurons), in the remaining the

effective observed and executed actions are similar or functionally related (broadly congruent

neurons). More intriguing are the properties of the communicative mirror neurons. The most

effective observed action is for them a communicative gesture such as, for example, lip

smacking. However, as the ingestive mirror neurons, they strongly discharge when the

monkey actively performs an ingestive action.

It seems plausible that the visual response of both canonical and mirror neurons

address the same motor vocabulary, the words of which constitute the monkey motor

repertoire. What is different is the way in which “motor words” are selected: in the case of

10

canonical neurons they are selected by object observation, in the case of mirror neurons by

the sight of an action. Thus, in the case of canonical neurons, vision of graspable objects

activates the motor representations more appropriate to interact with those objects. In the

case of mirror neurons, objects alone are no more sufficient to evoke a premotor discharge:

what is necessary is a visual stimulus describing a goal-directed hand action in which both,

an acting hand and a target must be present.

Summarizing the evidence presented above, the monkey precursor of human Broca’s

area is a premotor area, representing hand and mouth goal-directed actions, provided with

strong visual inputs coming from the inferior parietal lobule. These visual inputs originate in

distinct parietal areas and convey distinct visual information: 1) object-related information,

used by canonical neurons to motorically categorize objects and to organize the hand-object

interaction; 2) action-related information driving the response of mirror neurons during

observation of action made by others.

Which are the human areas activated by action observation?

The existence of a mirror-neuron system in humans has been first demonstrated by

electrophysiological experiments. The first pioneer demonstration of a “visuomotor

resonance” has been reported by Gastaut and Bert (1954) and Cohen-Seat et al. (1954).

These authors showed that the observation of actions made by humans exerts a

desynchronizing effect on the EEG recorded over motor areas, similar to that exerted by

actual movements. Recently, more specific evidence in favor of the existence of a human

mirror system arose from transcranial magnetic stimulation (TMS) studies of cortical

excitability. Fadiga et al. (1995) stimulated the left motor cortex of normal subjects using

TMS while they were observing meaningless intransitive arm movements as well as hand

grasping movements performed by an experimenter. Motor evoked potentials (MEPs) were

recorded from various arm and hand muscles. The rationale of the experiment was the

following: if the mere observation of the hand and arm movements facilitates the motor

system, this facilitation should determine an increase of MEPs recorded from hand and arm

muscles. The results confirmed the hypothesis. A selective increase of motor evoked

potentials was found in those muscles that the subjects would have used for producing the

observed movements. Additionally, this experiment demonstrated that both goal-directed and

intransitive arm movements were capable to evoke the motor facilitation. More recently

Strafella and Paus (2000) supported these findings and demonstrated the cortical nature of

the facilitation.

The electrophysiological experiments described above, while fundamental in showing

that action observation elicits a specific, coherent activation of motor system, do not allow the

11

localization of the areas involved in the phenomenon. The first data on the anatomical

localization of the human mirror-neuron system have been therefore obtained using brain-

imaging techniques. PET and fMRI experiments, carried out by various groups,

demonstrated that when the participants observed actions made by human arms or hands,

activations were present in the ventral premotor/inferior frontal cortex (Rizzolatti et al. 1996b;

Grafton et al. 1996; Decety et al. 1997; Grèzes et al. 1998; Iacoboni et al. 1999, Decety and

Chaminade; 2003; Grèzes et al. 2003). As already mentioned for TMS experiments by

Fadiga et al. (1995) both transitive (goal directed) and intransitive meaningless gestures

activate the mirror-neuron system in humans. Grèzes et al. (1998) investigated whether the

same areas became active in the two conditions. Normal human volunteers were instructed

to observe meaningful or meaningless actions. The results confirmed that the observation of

meaningful hand actions activates the left inferior frontal gyrus (Broca’s region), the left

inferior parietal lobule plus various occipital and inferotemporal areas. An activation of the left

precentral gyrus was also found. During meaningless gesture observation there was no

Broca’s region activation. Furthermore, in comparison with meaningful action observations,

an increase was found in activation of the right posterior parietal lobule. More recently, two

further studies have shown that a meaningful hand-object interaction more than pure

movement observation, is effective in triggering Broca’s area activation (Hamzei et al, 2003;

Johnson-Frey et al, 2003). Similar conclusions have been reached also for mouth movement

observation (Campbell et al, 2001).

In all early brain imaging experiments, the participants observed actions made by

hands or arms. Recently, experiments were carried out to learn whether mirror system coded

actions made by other effectors. Buccino et al. (2001) instructed participants to observe

actions made by mouth, foot as well as by hand. The observed actions were: biting an apple,

reaching and grasping a ball or a small cup, and kicking a ball or pushing a brake (object-

related actions). Similar actions but non object-related (such as chewing) were also tested.

The results showed that: (1) During observation of non object-related mouth actions

(chewing), activation was present in areas 6 and in Broca’s area on both sides, with a more

anterior activation (BA45) in the right hemisphere. During object-related action (biting), the

pattern of premotor activation was similar to that found during non object-related action. In

addition, two activation foci were found in the parietal lobe. (2) During the observation of non

object-related hand/arm actions there was a bilateral activation of area 6 that was located

dorsally to that found during mouth movement observations. During the observation of

object-related arm/hand actions (reaching-to-grasp-movements) there was a bilateral

activation of premotor cortex plus an activation site in Broca’s area. As in the case of

observation of mouth movements, two activation foci were present in the parietal lobe. (3)

During the observation of non object-related foot actions there was an activation of a dorsal

12

sector of area 6. During the observation of object-related actions, there was as in the

condition without object, an activation of a dorsal sector of area 6. In addition, there was an

activation of the posterior part of the parietal lobe. Two are the main conclusions that can be

drawn from these data. Fist, the mirror system is not limited to hand movements, Second,

actions performed with different effectors are represented in different regions of the premotor

cortex (somatotopy). Third, in agreement with previous data by Grèzes et al. (1998) and

Iacoboni et al. (1999), the parietal lobe is part of the human mirror systems, and that it is

strongly involved when an individual observes object-directed actions.

The experiments reviewed in this section, tested subjects during action observation

only. Therefore, the conclusion that frontal activated areas such as Broca’s region have

mirror properties was an indirect conclusion based on their premotor nature and, in the case

of Broca’s area, by its homology with monkey’s premotor area F5. However, if one looks at

the results of the brain imaging experiments reviewed in the second section of this chapter

(Motor-related functions of Broca’s area), it appears clearly that Broca’s area is an area

becoming active not only during speech generation, but also during real and imagined hand

movements. If one combines these two observations it appears that Broca’s area might be

the highest-order motor region where an observation/execution matching occurs. Direct

evidence for an observation/execution matching system was recently provided by two

experiments, one employing fMRI technique (Iacoboni et al 1999), the other using event-

related MEG (Nishitani and Hari, 2000).

Iacoboni et al. (1999) instructed normal human volunteers to observe and imitate a

finger movement and to perform the same movement after a spatial or a symbolic cue

(observation/execution tasks). In another series of trials, the same participants were asked to

observe the same stimuli presented in the observation/execution tasks, but without giving

any response to them (observation tasks). The results showed that activation during imitation

was significantly stronger than in the other two observation/execution tasks in three cortical

areas: left inferior frontal cortex, right anterior parietal region, and right parietal operculum.

The first two areas were active also during observation tasks, while the parietal operculum

became active during observation/execution conditions only.

Nishitani and Hari (2000) addressed the same issue using event-related

neuromagnetic recordings. In their experiments, normal human participants were requested,

in different conditions, to grasp a manipulandum, to observe the same movement performed

by an experimenter, and, finally, to observe and simultaneously replicate the observed

action. Their results showed that during execution, there was an early activation in the left

inferior frontal cortex (Broca’s area) with a response peak appearing approximately 250 ms

before the touch of the target. This activation was followed within 100-200 ms by activation of

the left precentral motor area and 150-250 ms later by activation of the right one. During

13

observation and during imitation, pattern and sequence of frontal activations were similar to

those found during execution, but the frontal activations were preceded by an occipital

activation due to visual stimulation occurring in the former conditions.

More recently, two studies of Koski and colleagues (2002, 2003) have made further

steps toward the comprehension of imitation mechanism and its relation to BA44. First,

(Koski et al., 2002) they compared the activity evoked by imitation and observation of finger

movements in the presence or conversely the absence of an explicit goal. They found that

the presence of goal increases the activity observed in BA44 for the imitation task only. They

concluded that imitation may represent a behavior tuned to replicate the goal of an action.

Later, the same authors (Koski et al., 2003) investigated the potential difference between

anatomical (actor and imitator both move the right hand) and specular imitation (the actor

moves the left hand, the imitator moves the right hand as in a mirror). They demonstrated

that the activation of Broca’s area was present only during specular imitation. In sum a

growing amount of studies established the determinant role of Broca’s area in distal and

facial motor functions such as movement execution, observation, simulation and imitation,

emphasizing further the fundamental aspect of the goal of the action.

What links hand actions with speech?

Others’ actions do not generate only visually perceivable signals. Action-generated

sounds and noises are also very common in nature. One could expect, therefore, that also

this sensory information, related to a particular action, could determine a motor activation

specific for that same action. A very recent neurophysiological experiment addressed this

point. Kohler and colleagues (2002) investigated whether there are neurons in area F5 that

discharge when the monkey makes a specific hand action and also when it hears the

corresponding action-related sounds. The experimental hypothesis started from the remark

that a large number of object-related actions (e.g. breaking a peanut) can be recognized by a

particular sound. The authors found that 13% of the investigated neurons discharge both

when the monkey performed a hand action and when it heard the action-related sound.

Moreover, most of these neurons discharge also when the monkey observed the same

action demonstrating that these ‘audio-visual mirror neurons’ represent actions

independently of whether them are performed, heard or seen.

The presence of an audio-motor resonance in a region that, in humans, is classically

considered a speech-related area, evokes the Liberman’s hypothesis on the mechanism at

the basis of speech perception (motor theory of speech perception, Liberman et al., 1967;

Liberman and Mattingly, 1985; Liberman and Wahlen, 2000). The motor theory of speech

perception maintains that the ultimate constituents of speech are not sounds, but articulatory

14

gestures that have evolved exclusively at the service of language. A cognitive translation into

phonology is not necessary because the articulatory gestures are phonologic in nature.

Furthermore, speech perception and speech production processes use a common repertoire

of motor primitives that, during speech production, are at the basis of articulatory gesture

generation, while during speech perception, are activated in the listener as the result of an

acoustically evoked motor “resonance”. Thus, sounds conveying verbal communication are

the vehicle of motor representations (articulatory gestures) shared by both the speaker and

the listener, on which speech perception could be based upon. In other terms, the listener

understands the speaker when his/her articulatory gestures representations are activated by

verbal sounds.

Fadiga et al. (2002), in a TMS experiment based on the paradigm used in 1995 (Fadiga

et al. 1995), tested for the presence in humans of a system that motorically “resonates” when

the individuals listen to verbal stimuli. Normal subjects were requested to attend to an

acoustically presented randomized sequence of disyllabic words, disyllabic pseudo-words

and bitonal sounds of equivalent intensity and duration. Words and pseudo-words were

selected according to a consonant-vowel-consonant-consonant-vowel (cvccv) scheme. The

embedded consonants in the middle of words and of pseudo-words were either a double ‘f’

(labiodental fricative consonant that, when pronounced, requires slight tongue tip

mobilization) or a double ‘r’ (lingua-palatal fricative consonant that, when pronounced,

requires strong tongue tip mobilization). Bitonal sounds, lasting about the same time as

verbal stimuli and replicating their intonation pattern, were used as a control. The excitability

of motor cortex in correspondence of tongue movements representation was assessed by

using single pulse TMS and by recording MEPs from the anterior tongue muscles. The TMS

stimuli were applied synchronously with the double consonant of presented verbal stimuli

(words and pseudo-words) and in the middle of the bitonal sounds. Results (see Figure 3)

showed that during speech listening there is an increase of motor evoked potentials recorded

from the listeners’ tongue muscles when the listened word strongly involves tongue

movements, indicating that when an individual listens to verbal stimuli his/her speech related

motor centers are specifically activated. Moreover, words-related facilitation was significantly

larger than pseudo-words related one.

15

Figure 3. Average value (+ SEM) of intrasubject normalized MEPs total areas for each condition. Data

from all subjects; `rr' and `ff' refer to verbal stimuli containing a double lingua-palatal fricative consonant `r', and

containing a double labio-dental fricative consonant `f', respectively.

These results indicate that the passive listening to words that would involve tongue

mobilization (when pronounced) induces an automatic facilitation of the listener’s motor

cortex. Furthermore, the effect is stronger in the case of words than in the case of pseudo-

words suggesting a possible unspecific facilitation of the motor speech center due to

recognition that the presented material belongs to an extant word.

The presence of “audio-visual” mirror neurons in the monkey and the presence of

“speech-related acoustic motor resonance” in humans, indicate that independently from the

sensory nature of the perceived stimulus, the mirror-resonant system retrieves from action

vocabulary (stored in the frontal cortex) the stimulus-related motor representations. It is

however unclear if the activation of the motor system during speech listening could be

interpreted in terms of an involvement of motor representations in speech processing and,

perhaps, perception. Studies of cortical stimulation during neurosurgery and clinical data

from frontal aphasics suggest that this is the case (see above). However, all these studies

report that comprehension deficits become evident only in the case of complex sentences

processing or complex commands accomplishment. Single words (particularly if nouns) are

almost always correctly understood. To verify this observation, we applied repetitive TMS

(rTMS, that functionally blocks for hundreds of milliseconds the stimulated area) on speech-

16

related premotor centers during single word listening (Fadiga et al., unpublished

observation). TMS was delivered on a site 2 cm anterior to the hot-spot of the hand motor

representation, as assessed during mapping sessions performed on individual subjects. At

the end of each trial, participants were required to identify the listened word in a list

presented on a computer screen. Data analysis showed that rTMS was ineffective in

perturbing subject’s performance. As expected, subjects were perfectly able to report the

listened word independently from the presence or the absence of the stimulation, from the

duration of stimulation itself and from the moment at which the stimulus was delivered with

respect to the beginning of the presented word. If one accepts that Broca’s region is not

concerned with single word perception but at the same time considers that this area has

been classically considered the brain center more involved in phonological processing (at

least in production), a possible contradiction emerges. In order to investigate more rigorously

the perceptual role of Broca’s area we decided therefore to use an experimental paradigm

very sensitive in detecting a possible phonological impairment following Broca’s area

inactivation. It should be noted however that although phonology, among various speech

attributes, is strictly related to the more motor aspects of speech (phono-articulatory), this

doesn’t mean that other speech attributes, such as lexicon and syntax, are totally unrelated

to motor representations. It is well known, in fact, that several Broca’s aphasic patients

present differential deficits in understanding nouns (less impaired) and verbs (more impaired,

particularly in the case of action verbs) (see Bak et al. 2001).

With this aim we applied rTMS on Broca’s region during a phoneme discrimination task

in order to see if rTMS-induced inhibition was able to produce a specific “deafness“ for the

phonologic characteristics of the presented stimuli. Subjects were instructed to carefully

listen to a sequence of acoustically presented pseudo-words and to categorize the stimuli

according to their phonological characteristics by pressing one among four different switches

(Craighero, Fadiga and Haggard, unpublished data). Stimuli consisted of 80 pseudo-words

subdivided into four different categories. Categories were formed according to the phonetic

sequence of the middle part of the stimulus which could be “dada”, “data”, “tada” or “tata”

(e.g.: pipedadali, pelemodatalu, mamipotadame, pimatatape). Subjects had to press as soon

as possible the button corresponding to stimulus category. Participants’ left hemisphere was

magnetically stimulated in three different regions by using rTMS: a) the focus of the tongue

motor representation, b) a region 2 centimeters more anterior (ventral premotor/inferior

frontal cortex), c) a region 2 centimeters more posterior (somatosensory cortex). Repetitive

transcranial magnetic stimulation was delivered at a frequency of 20 Hz in correspondence of

the 2nd critical formant, in correspondence of the 1st and of the 2nd critical formants, and also

during the whole critical portion of the presented word. The hypothesis was that rTMS

delivered in correspondence of the speech-related premotor cortex, by determining the

17

temporary inhibition of the resonance system, should induce slower reaction times and a

significant higher amount of errors in the discrimination task with respect to the sessions in

which other cortical regions were stimulated. Results, however, showed no difference

between the performances obtained during the different experimental conditions (which

included also a control without rTMS).

A possible interpretation of the absence of any effect of interference on phonologic

perception could be that the discrimination task we used doesn’t, indeed, involve a

phonologic perception. The task could be considered a mere discrimination task of the serial

order of two different (not necessarily phonological) elements: the sound “TA” and the sound

“DA”. It is possible that the way in which the subjects solve the task is the same he/she

would use also in the case of two different tones and, possibly, involving structures different

from the Broca’s area. Another possible interpretation is that the used stimuli (pseudo-words)

are treated by the brain as non-speech stimuli, because they are semantically meaningless.

In order to be sure to investigate subjects with a task necessarily involving phonologic

perception, we decided to use a paradigm classically considered a phonological one: the

“phonological priming” task. Phonological priming effect refers to the fact that a target word is

recognized faster when it is preceded by a prime word sharing with it the last syllable

(rhyming effect, Emmorey, 1989).

In a single pulse TMS experiment we therefore stimulated participants’ inferior frontal

cortex while they were performing a phonological priming task. Subjects were instructed to

carefully listen to a sequence of acoustically presented pairs of verbal stimuli (dysillabic

‘cvcv’ or ‘cvccv’ words and pseudo-words) in which final phonological overlap was present

(rhyme prime) or, conversely, not present. Subjects were requested to make a lexical

decision on the second stimulus (target) by pressing with index finger or middle finger one

button if the target was a word and another button if the target was a pseudo-word.

The pairs of verbal stimuli could pertain to four categories which differed for presence

of lexical content in the prime and in the target:

- prime-word/target-word (W-W)

- prime-word/target-pseudoword (W-PW)

- prime-pseudoword/target-word (PW-W)

- prime-pseudoword/target-pseudoword (PW-PW).

Each category contained both rhyming and non-rhyming pairs. In some randomly

selected trials, we administered single pulse TMS in correspondence of left BA44 (Broca’s

region, localized by using “Neurocompass”, a frameless stereotactic system built in our

laboratory) during the interval (20 ms) between prime and target stimuli.

18

In trials without TMS, three are the main results: (i) strong and statistically significant

facilitation (phonological priming effect) when W-W, W-PW, PW-W pairs are presented; (ii)

no phonological priming effect when the PW-PW pair is presented; (iii) faster responses

when the target is a word rather than a pseudoword (both in W-W and PW-W).

Figure 4. Reaction times (RTs) in milliseconds (+ SEM) relative to the lexical decision during a

phonological priming task. Dotted line: phonological overlap present. Solid line: phonological overlap absent. W-

W, prime-word/target-word; W-PW, prime-word/target-pseudoword; PW-W, prime-pseudoword/target-word; PW-

PW, prime-pseudoword/target-pseudoword.

An interesting finding emerges from the analysis of these results: the presence or

absence of lexical content modulates the presence of the phonological priming effect. When

neither the target nor the prime has the access to the lexicon (PW-PW pair), the presence of

the rhyme does not facilitate the recognition of the target. In other words, in order to have a

phonological effect it is necessary to have the access to the lexicon.

In trials during which TMS was delivered, only W-PW pairs were affected by brain

stimulation: the W-PW pair behaving exactly as the PW-PW one. This finding suggests

that the stimulation of the Broca’s region might have affected the lexical property of the

prime. As consequence, the impossibility to have access to the lexicon determines the

absence of the phonological effect. According to our interpretation, the TMS-related effect is

absent in the W-W and PW-W pairs because of the presence of a meaningful (W) target.

950

1000

1050

1100

1150

1200

w-w w-pw pw-w pw-pw

priming

non priming

RTs

19

Being aware that a possible criticism to our data is that the task was implying a lexical

decision, we replicated the experiment by asking subjects to detect if the final vowel of the

target stimulus was ‘A’ or ‘O’. Despite the absence of any lexicon-directed attention, the

results were exactly the same as in the case of the lexical decision paradigm. Further

experiments are now carried out in order to reinforce this observation. In particular, brain

regions other that Broca’s area will be stimulated in order to test the specificity of the

observed effect.

Conclusions

The present chapter reviews some literature data and presents some experimental

results showing that, in addition to speech-related tasks, Broca’s area is significantly involved

also during tasks devoid of verbal content. If in some cases one could advance the criticism

that internal speech might have been at the origin of this activation, in other cases such

possibility is ruled out by appropriate controls. As frequently happens in Biology, when too

much interpretations are proposed for an anatomic-functional correlate, one should strongly

doubt about each of them. We started from this skeptic point of view by making an attempt to

correlate what was found with brain imaging in humans with what is known in monkeys at

single neuron level. The behavioral conditions triggering the response of neurons recorded in

the monkey area which is more closely related to human Broca’s (ventral premotor area F5)

are: 1) Grasping with the hand and grasping with the mouth actions. 2) Observation of

graspable objects. 3) Observation of hand/mouth actions performed by other individuals. 4)

Hearing sounds produced during manipulative actions. The experimental evidence suggests

that, in order to activate F5 neurons, executed/observed/heard actions must be all goal-

directed. Does the cytoarchitectonical homology, linking monkey area F5 with Broca’s area,

correspond to some functional homology? Does human Broca’s area discharge during

hand/mouth action execution/observation/hearing too? Does make difference, in terms of

Broca’s activation, if observed actions are meaningful (goal-directed) or meaningless? A

positive answer to these questions comes from fMRI experiments in which human subjects

are requested to execute goal-directed actions and are presented with the same visual

stimuli effective in triggering F5 neurons’ response (graspable objects or tools and actions

performed by other individuals). Results show that in both cases, Broca’s area become

significantly active. Finally, it is interesting to note that observation of meaningless

movements, while strongly activates human area 6 (bilaterally) is definitely less effective in

activating Broca’s region.

It has been suggested that a motor resonant system, such as that formed by mirror-

neurons, might have given a neural substrate to interindividual communication (see Rizzolatti

20

and Arbib, 1998). According to this view, speech may have evolved from a hand/mouth

gestural communicative system. A complete theory on the origin of speech is however well

beyond the scope of this chapter. Our aim in writing was to suggest to the readers some

stimulating starting points and to make an attempt to conciliate two streams of research,

which start from very different positions: the study of speech representation in humans and

the study of hand action representation in monkeys. These two approaches reach a common

target: the premotor region of the inferior frontal gyrus where Paul Broca first localized its

“frontal speech area”. The data presented in the last section of this chapter go in this

direction and although they represent only an experimental starting point, we think that they

already could allow some preliminary considerations and speculations.

The first important point is that the temporary inactivation of Broca’s region during

phonological tasks is ineffective in perturbing subjects’ performance. This result is definitely

against a “pure” phonological role of Broca’s region. The interpretation we favor is that it is

impossible to dissociate phonology from lexicon at Broca’s level because there “exist” only

words. In other terms, phonologically relevant stimuli are matched on a repertoire of words

and not on individually meaningless, “phonemes assembly”. Consequently, the motor

resonance of tongue representation revealed by TMS during speech listening (Fadiga et al.

2002) is probably a mixed phenomenon which should involve cortical regions others than

Broca’s area (possibly BA6) being this “acoustically-evoked mirror effect” independent from

the meaning of the presented stimuli. The motor representation of hand/mouth actions

present in Broca’s area derives from an ancient execution/observation (hearing) matching

system, already present in monkeys. As a consequence forms of communication other than

the verbal one, although expressions of a residual ancient mechanism, should exert a

significant effect on Broca’s area benefiting of its twofold involvement with motor goals:

during execution of own actions and during perception of others’ ones. We will investigate

this topic in the near future by using brain-imaging techniques.

The intimate motor nature of Broca’s region cannot be neglected when interpreting the

results of experiments testing hypotheses apparently far from pure motor tasks. The

hypothesis we suggest here (being aware of its purely speculative nature) is that the original

role played by this region in generating/extracting action meanings (by organizing/interpreting

motor sequences of individually meaningless movements) might have been generalized

during evolution giving to this area a new capability. The capability to deal with meanings

(and rules) which share with the motor system similar hierarchical and sequential structures

harmonized by a general, supramodal “syntax”.

21

Acknowledgements

The research presented in this chapter is supported in the framework of the European Science

Foundation EUROCORES program “The Origin of Man, Language and Languages”, by EC

“MIRROR” Contract and by Italian Ministry of Education Grants. We deeply thank G. Spidalieri for

his fruitful comments on early versions of the manuscript.

References

Amunts K, Schleicher A, Burgel U, Mohlberg H, Uylings HB, Zilles K (1999) Broca's

region revisited: cytoarchitecture and intersubject variability. Journal of Comparative

Neurology;412:319-341.

Arbib MA (1997) From visual affordances in monkey parietal cortex to hippocampo-

parietal interactions underlying rat navigation. Philos Trans R Soc Lond B Biol Sci;352:1429-

1436.

Bak TH, O'Donovan DG, Xuereb JH, Boniface S, Hodges JR. (2001) Selective

impairment of verb processing associated with pathological changes in Brodmann areas 44

and 45 in the motor neurone disease-dementia-aphasia syndrome. Brain;124:103-120.

Bell BD (1994) Pantomime recognition impairment in aphasia: an analysis of error

types. Brain Lang.;47:269-278.

Bharucha J and Krumhansl C (1983) The representation of harmonic structure in

music: hierarchies of stability as a function of context. Cognition;13:63-102.

Binkofski F, Amunts K, Stephan KM, Posse S, Schormann T, Freund HJ, Zilles K, Seitz

RJ (2000) Broca's region subserves imagery of motion: a combined cytoarchitectonic and

fMRI study. Human Brain Mapping;11:273-285.

Binkofski F, Buccino G, Posse S, Seitz RJ, Rizzolatti G, Freund H-J (1999a) A fronto-

parietal circuit for object manipulation in man: evidence from an fMRI study. European

Journal of Neuroscience;11:3276-3286.

Binkofski F, Buccino G, Stephan KM, Rizzolatti G, Seitz RJ, Freund H-J (1999b) A

parieto-premotor network for object manipulation: evidence from neuroimaging. Experimental

Brain Research;128:21-31.

Broca PP (1861) Loss of Speech, Chronic Softening and Partial Destruction of the

Anterior Left Lobe of the Brain. Bulletin de la Société Anthropologique;2:235-238.

Buccino G, Binkofski F, Fink GR, Fadiga L, Fogassi L, Gallese V, Seitz RJ, Zilles K,

Rizzolatti G, Freund HJ (2001). Action observation activates premotor and parietal areas in a

somatotopic manner: an fMRI study. European Journal of Neuroscience;13:400-404.

22

Campbell R, MacSweeney M, Surguladze S, Calvert G, McGuire P, Suckling J,

Brammer MJ, David AS (2001) Cortical substrates for the perception of face actions: an fMRI

study of the specificity of activation for seen speech and for meaningless lower-face acts

(gurning). Cognitive Brain Research;12:233-243.

Cohen-Seat G, Gastaut H, Faure J and Heuyer G. (1954) Etudes expérimentales de

l'activité nerveuse pendant la projection cinématographique. Revue Internationale de

Filmologie;5:7-64.

Daniloff JK, Noll JD, Fristoe M, Lloyd LL (1982) Gesture recognition in patients with

aphasia. Journal of Speech and Hearing Disorders;47:43-49.

Decety J, Chaminade T (2003) Neural correlates of feeling sympathy.

Neuropsychologia;41:127-138.

Decety J, Grezes J, Costes N, Perani D, Jeannerod M, Procyk E, Grassi F, Fazio F

(1997) Brain activity during observation of actions: Influence of action content and subject's

strategy. Brain;120:1763-1777.

Delazer M, Dohmas F, Bartha L, Brenneis A, Lochy A, Trieb T, Benke T (2003)

Learning complex arithmetic: an fMRI study. Cognitive Brain Research;18:76-88.

Di Pellegrino G, Fadiga L, Fogassi L, Gallese V, Rizzolatti G (1992) Understanding

motor events: a neurophysiological study. Experimental Brain Research;91:176-180.

Duffy RJ and Duffy JR (1975) Pantomime recognition in aphasics. Journal of Speech

and Hearing Research;18:115-132.

Duffy RJ and Duffy JR (1981) Three studies of deficits in pantomimic expression and

pantomimic recognition in aphasia. Journal of Speech and Hearing Research;24:70-84.

Emmorey KD (1989) Auditory morphological priming in the lexicon. Language and

Cognitive Processes;4:73-92.

Ertl J and Schafer EW (1967) Cortical activity preceding speech. Life Sciences;6:473-

479.

Fadiga L, Craighero L, Buccino G, Rizzolatti G (2002) Speech listening specifically

modulates the excitability of tongue muscles: a TMS study. European Journal of

Neuroscience;15:399-402.

Fadiga L, Fogassi L, Pavesi G, Rizzolatti G (1995). Motor facilitation during action

observation: A magnetic stimulation study. Journal of Neurophysiology;73:2608-2611.

Ferrari PF, Gallese V, Rizzolatti G, Fogassi L (2003) Mirror neurons responding to the

observation of ingestive and communicative mouth actions in the monkey ventral premotor

cortex. European Journal of Neuroscience;17:1703-1714.

Gainotti G, Lemmo MS (1976) Comprehension of symbolic gestures in aphasia. Brain

and Language;3:451-460.

23

Gallese V, Fadiga L, Fogassi L, Rizzolatti G (1996) Action recognition in the premotor

cortex. Brain;119:593-609.

Gastaut HJ and Bert J. (1954) EEG changes during cinematographic presentation.

Electroencephalography and Clinical Neurophysiology;6:433-444.

Gerardin E, Sirigu A, Lehéricy S, Poline J-B, Gaymard B, Marsault C, Agid Y (2000)

Partially overlapping neural networks for real and imagined hand movements. Cerebral

Cortex;10:1093-1104.

Gerlach C, Law I, Gade A, Paulson OB (2002) The role of action knowledge in the

comprehension of artefacts: a PETstudy. Neuroimage;15:143-152.

Glosser G, Wiener M, Kaplan E (1986) Communicative gestures in aphasia. Brain and

Language;27:345-359.

Goldenberg G (1996) Defective imitation of gestures in patients with damage in the left

or right hemispheres. Journal of Neurology, Neurosurgery and Psychiatry;61:176-180.

Grafton ST, Arbib MA, Fadiga L, Rizzolatti G (1996) Localization of grasp

representations in humans by PET: 2. Observation compared with imagination. Experimental

Brain Research;112:103-111.

Grezes J, Armony JL, Rowe J, Passingham RE (2003) Activations related to "mirror"

and "canonical" neurones in the human brain: an fMRI study. Neuroimage;18:928-937.

Grezes J, Costes N, Decety J (1998) Top-down effect of strategy on the perception of

human bioogical motion: a PET investigation. Cognitive Neuropsychology;15:553-582.

Grezes J, Decety J (2002) Does visual perception of object afford action? Evidence

from a neuroimaging study. Neuropsychologia;40:212-222.

Gruber O, Inderfey P, Steinmeiz H, Kleinschmidt A (2001) Dissociating neural

correlates of cognitive components in mental calculation. Cerebral Cortex;11:350-359.

Hamzei F, Rijntjes M, Dettmers C, Glauche V, Weiller C, Buchel C (2003) The human

action recognition system and its relationship to Broca's area: an fMRI study.

Neuroimage;19:637-644.

Hepp-Reymond MC, Husler EJ, Maier MA, Ql HX (1994) Force-related neuronal

activity in two regions of the primate ventral premotor cortex. Canadian Journal of Physiology

and Pharmacology;72:571-579.

Iacoboni M, Woods R, Brass M, Bekkering H, Mazziotta JC, Rizzolatti G (1999) Cortical

mechanisms of human imitation. Science;286:2526-2528.

Johnson-Frey SH, Maloof FR, Newman-Norlund R, Farrer C, Inati S, Grafton ST (2003)

Actions or hand-object interactions? Human inferior frontal cortex and action observation.

Neuron;39:1053-1058.

24

Kellenbach ML, Brett M, Patterson K (2003) Actions speak louder than functions: the

importance of manipulability and action in tool representation. Journal of Cognitive

Neuroscience;15:30-45.

Koelsch S, Gunter T, Friederici AD, Schroger E (2000) Brain indices of music

processing: "non-musicians" are musical. Journal of Cognitive Neuroscience;12:520-541.

Kohler E, Keysers CM, Umiltà A, Fogassi L, Gallese V, Rizzolatti G (2002) Hearing

sounds, understanding actions: Action representation in mirror neurons. Science;297:846-

848.

Koski L, Iacoboni M, Dubeau M-C, Woods RP, Mazziotta JC (2003) Modulation of

cortical activity during different imitative behaviors. Journal of Neurophysiology;89:460-471.

Koski L, Wohlschlager A, Bekkering H, Woods RP, Dubeau M-C, Mazziotta JC,

Iacoboni M (2002) Modulation of motor and premotor activity during imitation of target-

directed actions. Cerebral Cortex;12:847-855.

Lacquaniti F, Perani D, Guignon E, Bettinardi V, Carrozzo M, Grassi F, Rossetti Y,

Fazio F (1997) Visuomotor transformations for reaching to memorized targets: a PET study.

Neuroimage;5:129-146.

Lelord G, Laffont F, Sauvage D, Jusseaume P (1973) Evoked slow activities in man

following voluntary movement and articulated speech. Electroencephalography Clinical

Neurophysiology;35:113-124.

Liberman AM and Mattingly IG (1985) The motor theory of speech perception revised.

Cognition;21:1-36.

Liberman AM and Wahlen DH (2000) On the relation of speech to language. Trends in

Cognitive Neuroscience;4:187-196.

Liberman AM, Cooper FS, Shankweiler DP, Studdert-Kennedy M (1967) Perception of

the speech code. Psychological Review;74:431-461.

Liotti M, Gay CT, Fox PT (1994) Functional imaging and language: evidence from

positron emission tomography. Journal of Clinical Neurophysiology;11:175-190.

Maess B, Koelsch S, Gunter TC, Friederici AD (2001) Musical syntax is processed in

Broca's area: an MEG study. Nature Neuroscience;4:540-545.

Matelli M, Luppino G, Rizzolatti G. (1985) Patterns of cytochrome oxidase activity in the

frontal agranular cortex of macaque monkey. Behavioral Brain Research, 18:125-137.

Matsumura M, Kawashima R, Naito E, Satoh K, Takahashi T, Yanagisawa T, Fukuda H

(1996) Changes in rCBF during grasping in humans examined by PET. NeuroReport;7:749-

752.

Mecklinger A, Gruenewald C, Besson M, Magnié M-N, Von Cramon Y (2002)

Separable neuronal circuitries for manipulable and non-manipualble objects in working

memory. Cerebral Cortex;12:1115-1123.

25

Menon V, Rivera SM, White CD, Glover GH, Reiss AL (2000) Dissociating prefrontal

and parietal cortex activation during arithmetic processing. Neuroimage;12:357-365.

Mingazzini G (1913) Lezioni di Anatomia neurologica. Roma.

Muller R-A, Kleinhans N, Courchesne E (2001) Broca's area and the discrimination of

frequency transitions: a functional MRI study. Brain and Language;76:70-76.

Murata A, Fadiga L, Fogassi L, Gallese V, Raos V, Rizzolatti G (1997) Object

representation in the ventral premotor cortex (area F5) of the monkey. Journal of

Neurophysiology;78:2226-2230.

Negawa T, Mizuno S, Hahashi T, Kuwata H, Tomida M, Hoshi H, Era S, Kuwata K

(2002) M pathway and areas 44 and 45 are involved in stereoscopic recognition based on

binocular disparity. Japanese Journal of Physiology;52:191-198.

Nishitani N, Hari R (2000) Temporal dynamics of cortical representation for action.

Proceedings of the National Academy of Sciences;97:913-918.

Papathanassiou D, Etard O, Mellet E, Zago L, Mazoyer B, and Tzourio-Mazoyer N.

(2000) A common language network for comprehension and production: a contribution to the

definition of language epicenters with PET. Neuroimage;11:347-357.

Patel AD, Gibson E, Ratner J, Besson M, Holcomg P (1998) Processing syntactic

relations in language and music: an event-related potential study. Journal of Cognitive

Neuroscience;10:717-733.

Penfield W and Roberts L (1959) Speech and brain mechanisms. Princeton, NJ:

Princeton University Press.

Petrides M and Pandya DN. (1997) Comparative architectonic analysis of the human

and the macaque frontal cortex. In Handbook of Neuropsychology, ed. F Boller, J Grafman,

pp. 17–58. New York: Elsevier. Vol. IX.

Ranganath C, Johnson M, D'esposito M (2003) Prefrontal activity associated with

working memory and episodic long-term memory. Neuropsychologia;41:378-389.

Rickard TC, Romero SG, Basso G, Wharton C, Flitman S, Grafman J (2000) The

calculating brain:an fMRI study. Neuropsychologia;38:325-335.

Rizzolatti G and Arbib MA (1998) Language within our grasp. Trends in

Neuroscience;21:188-194.

Rizzolatti G and Fadiga L (1998) Grasping objects and grasping action meanings: the

dual role of monkey rostroventral premotor cortex (area F5). In: G. R. Bock & J. A. Goode

(Eds.), Sensory Guidance of Movement, Novartis Foundation Symposium (pp. 81-103).

Chichester: John Wiley and Sons.

Rizzolatti G and Gentilucci M (1988) Motor and visual-motor functions of the premotor

cortex. In: P. Rakic & W. Singer (Eds.), Neurobiology of Neocortex (pp.269-284). Chichester:

John Wiley and Sons.

26

Rizzolatti G, Camarda R, Fogassi L, Gentilucci M, Luppino G, Matelli M (1988).

Functional organization of inferior area 6 in the macaque monkey: II. Area F5 and the control

of distal movements. Experimental Brain Research;71:491–507.

Rizzolatti G, Fadiga L, Gallese V, Fogassi L (1996a) Premotor cortex and the

recognition of motor actions. Cognitive Brain Research;3:131-141.

Rizzolatti G, Fadiga L, Matelli M, Bettinardi V, Paulesu E, Perani D, Fazio F (1996b)

Localization of grasp representation in humans by PET: 1. Observation versus execution.

Experimental Brain Research;111:246-252.

Rizzolatti G, Fogassi L and Gallese V (2003) Motor and cognitive functions of the

ventral premotor cortex. Current Opinion in Neurobiology;12:149-154.

Schaffler L, Luders HO, Dinner DS, Lesser RP and Chelune GJ. (1993)

Comprehension deficits elicited by electrical stimulation of Broca’s area. Brain;116:695–715.

Schubotz RI and von Cramon DY (2003) Functional-anatomical concepts of human

premotor cortex: evidence from fMRI and PET studies. Neuroimage;20, Suppl 1:120-131.

Strafella AP and Paus T (2000) Modulation of cortical excitability during action

observation: a transcranial magnetic stimulation study. NeuroReport;11:2289-2292.

UNCORRECTED PROOF

AUTHOR’S QUERY SHEET

Author(s): Fadiga et al. PSNS 197544Article title:Article no:

Dear Author

Some questions have arisen during the preparation of your manuscript fortypesetting. Please consider each of the following points below and make anycorrections required in the proofs.

Please do not give answers to the questions on this sheet. All correctionsshould be made directly in the printed proofs.

AQ1 What does (27) represent here?

AQ2 Chapter title?

UNCORRECTED PROOF

Language in shadow

Luciano Fadiga

University of Ferrara, Ferrara, Italy and Italian Institute of Technology, Genova, Italy

Laila Craighero, Maddalena Fabbri Destro, and Livio Finos

University of Ferrara, Ferrara, Italy

Nathalie Cotillon-Williams and Andrew T. Smith

University of London, London, UK

Umberto Castiello

University of London, London, UK and University of Padova, Padova, Italy

The recent finding that Broca’s area, the motor center for speech, is activated during action observationlends support to the idea that human language may have evolved from neural substrates already involvedin gesture recognition. Although fascinating, this hypothesis has sparse demonstration because whileobserving actions of others we may evoke some internal, verbal description of the observed scene. Herewe present fMRI evidence that the involvement of Broca’s area during action observation is genuine.Observation of meaningful hand shadows resembling moving animals induces a bilateral activation offrontal language areas. This activation survives the subtraction of activation by semantically equivalentstimuli, as well as by meaningless hand movements. Our results demonstrate that Broca’s area plays arole in interpreting actions of others. It might act as a motor-assembly system, which links and interpretsmotor sequences for both speech and hand gestures.

INTRODUCTION

Several theories have been proposed to explain

the origins of human language. They can be

grouped into two main categories. According to

‘‘classical’’ theories, language is a peculiarly hu-

man ability based on a neural substrate newly

developed for the purpose (Chomsky, 1966;

Pinker, 1994). Theories of the second ‘‘evolu-

tionary’’ category consider human language as

the evolutionary refinement of an implicit com-

munication system, already present in lower

primates, based on a set of hand/mouth goal-

directed action representations (Armstrong,

Stokoe, & Wilcox, 1995; Corballis, 2002; Rizzo-

latti & Arbib, 1998). The classical view is sup-

ported by the existence in humans of a cortical

network of areas that become active during

verbal communication. This network includes

the temporal�parietal junction (Wernicke’s

area), commonly thought to be involved in

sensory processing of speech, and the inferior

frontal gyrus (Broca’s area) classically considered

to be the speech motor center.

Correspondence should be addressed to: Luciano Fadiga, Department of Biomedical Sciences and Advanced Therapies, Section

of Human Physiology, University of Ferrara, via Fossato di Mortara 17/19, I-44100 Ferrara, Italy. E-mail: [email protected]

This work was supported by EC grants ROBOT-CUB, NEUROBOTICS and CONTACT to LF and LC by European Science

Foundation (OMLL Eurocores) and by Italian Ministry of Education grants to LF, and by a Research Strategy Fund to UC.

We thank Attilio Mina for helpful bibliographic suggestions on animal hand shadows, Carlo Truzzi for creating the hand shadows

stimuli, and Rosario Canto for technical support.

# 2006 Psychology Press, an imprint of the Taylor & Francis Group, an informa business

SOCIAL NEUROSCIENCE, 2006, 00 (00), 1�13

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:2

www.psypress.com/socialneuroscience DOI:10.1080/17470910600976430

UNCORRECTED PROOF

However, the existence of areas exclusivelydevoted to language has increasingly been chal-lenged by experimental evidence showing thatBroca’s area and its homologue in the righthemisphere also become active during the ob-servation of hand/mouth actions performed byother individuals (Aziz-Zadeh, Koski, Zaidel,Mazziotta, & Iacoboni, 2006; Decety et al., 1997;Decety & Chaminade, 2003; Grafton, Arbib,Fadiga, & Rizzolatti, 1996; Grezes, Costes, &Decety, 1998; Grezes, Armony, Rowe, & Passing-ham, 2003; Iacoboni, Woods, Brass, Bekkering,Mazziotta, & Rizzolatti, 1999; Rizzolatti et al.,1996). This finding seems to favor the evolution-ary hypothesis, supporting the idea that thelinguistic processing that characterizes Broca’sarea may be closely related to gesture processing.Moreover, comparative cytoarchitectonic studieshave shown a similarity between human Broca’sarea and monkey area F5 (Matelli, Luppino, &Rizzolatti, 1985; Petrides, 2006; Petrides & Pan-dya, 1997; von Bonin & Bailey, 1947), a premotorcortical region that contains neurons dischargingboth when the monkey acts on objects andwhen it sees similar actions being made byother individuals (Di Pellegrino, Fadiga, Fogassi,Gallese, & Rizzolatti, 1992; Gallese, Fadiga,Fogassi, & Rizzolatti, 1996; Rizzolatti, Fadiga,Gallese, & Fogassi, 1996). These neurons, called‘‘mirror neurons,’’ may allow the understandingof actions made by others and might provide aneural substrate for an implicit communicationsystem in animals. It has been proposed that thisprimitive ‘‘gestural’’ communication may be atthe root of the evolution of human language(Rizzolatti & Arbib, 1998). Although fascinating,this theoretical framework can be challenged byinvoking an alternative interpretation, more con-servative and fitting with classical theories of theorigin of language. According to this view, hu-mans are automatically compelled to covertlyverbalize what they observe. The involvement ofBroca’s area during action observation maytherefore reflect inner, sub-vocal speech genera-tion (Grezes & Decety, 2001). Clearly, anexperiment determining which of these twointerpretations is correct, could provide a funda-mental insight into the origins of language.

We investigated here the possibility that Bro-ca’s area and its homologue in the right hemi-sphere become specifically active during theobservation of a particular category of handgestures: hand shadows representing animalsopening their mouths. These stimuli have been

selected for two main reasons. First, by usingthese stimuli it was possible to design an fMRIexperiment in which any activation due to covertverbalization could be removed by subtraction:the activation evoked while observing videosrepresenting stimuli belonging to the same se-mantic set (i.e., real animals opening theirmouths), and expected to elicit similar covertverbalization, could be subtracted from activityelicited by hand shadows of animals opening theirmouths. Residual activation in Broca’s area afterthis critical subtraction would demonstrate theinvolvement of ‘‘speech-related’’ frontal areas inprocessing meaningful hand gestures. Second,hand shadows only implicitly ‘‘contain’’ thehand creating them. Thus they are interestingstimuli that might be used to answer the questionof how detailed a hand gesture must be in orderto activate the mirror-neuron system. The resultswe present here support the idea that Broca’sarea is specifically involved during meaningfulaction observation and that this activation isindependent of any internal verbal descriptionof the seen scene. Moreover, they demonstratethat the mirror-neuron system becomes activeeven if the pictorial details of the moving handare not explicitly visible. In the case of ourstimuli, the brain ‘‘sees’’ the performing handalso behind the appearance.

METHODS

Participants were 10 healthy volunteers (6 fe-males and 4 males; age range 19�32, mean 23).All had normal or corrected vision, no pastneurological or psychiatric history and no struc-tural brain abnormality. Informed consent wasobtained according to procedures approved bythe Royal Holloway Ethics Committee. Through-out the experiment, subjects performed the sametask, which was to carefully observe the stimuli,which were back projected onto a screen visiblethrough a mirror mounted on the MRI head coil(visual angle, 158�/208 approximately). Stimuliwere of six types: (1) movies of actual humanhands performing meaningless movements; (2)movies of the shadows of human hands represent-ing animals opening their mouths; (3) movies ofreal animals opening their mouths, plus threefurther movies representing a sequence of stillimages taken from the previously described threevideos. Hand movements were performed by aprofessional shadow artist. All stimuli were

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:3

2 FADIGA ET AL.

UNCORRECTED PROOF

enclosed in a rectangular frame (Figure 1), in a640�/480 pixel array and were shown in greyscale. Each movie lasted 15 seconds, and con-tained a sequence of 7 different moving/staticstimuli (e.g., dog, cow, pig, bird, etc., all openingtheir mouths). The experiment was conducted asa series of scanning sessions, each lasting 4minutes. Each session contained eight blocks. Ineach session two different movie types werepresented in an alternated order (see Figure 1,bottom). Each subject completed six sessions. Theorder of sessions was varied randomly acrosssubjects. The six sessions contrasted the followingpairs of movie types: (1) C1�/moving animalhand shadows, C2�/static animal hand shadows;(2) C1�/moving real hands, C2�/static realhands; (3) C1�/moving real animals, C2�/staticreal animals; (4) C1�/moving animal hand sha-dows, C2�/moving real animals; (5) C1�/movinganimal hand shadows, C2�/moving real hands;(6) C1�/moving real hands, C2�/moving realanimals. Whole-brain fMRI data were acquiredon a 3T scanner (Siemens Trio) equipped with anRF volume headcoil. Functional images wereobtained with a gradient echo-planar T2* se-quence using blood oxygenation level-dependent(BOLD) contrast, each comprising a full-brainvolume of 48 contiguous axial slices (3 mmthickness, 3�/3 mm in-plane voxel size). Volumeswere acquired continuously with a repetition time(TR) of 3 seconds. A total of 80 scans wereacquired for each participant in a single session (4minutes), with the first 2 volumes subsequentlydiscarded to allow for T1 equilibration effects.Functional MRI data were analyzed using statis-tical parametric mapping software (SPM2, Well-come Department of Cognitive Neurology,London). Individual scans were realigned, spa-tially normalized and transformed into a standardstereotaxic space, and spatially smoothed by a6 mm FWHM Gaussian kernel, using standardSPM methods. A high-pass temporal filter (cut-off 120 seconds) was applied to the time series.Considering the relatively low number of partici-pants, a high-threshold, corrected, fixed effectsanalysis, was first performed by each experimen-tal condition. Pixels were identified as signifi-cantly activated if PB/.001 (FDR corrected formultiple comparisons) and the cluster size ex-ceeded 20 voxels. The activated voxels survivingthis procedure were superimposed on the stan-dard SPM2 inflated brain (Figure 1). Clusters ofactivation were anatomically characterized ac-cording to their centers of mass activity with the

aid of Talairach co-ordinates (Talairach & Tour-noux, 1988), of the Muenster T2T converter(http://neurologie.uni-muenster.de/T2T/t2tconv/conv3d.html) and by taking into account theprominent sulcal landmarks. Furthermore, as faras Broca’s region is concerned, a hypothesis-driven analysis was performed for sessions (3)�(6). In this analysis, a more restrictive statisticalcriterion was used (group analysis on individualsubjects analysis, small volume correction ap-proach, cluster size �/20 voxels). Only significantvoxels (PB/.005) within the most permissiveborder of cytoarchitectonically defined probabil-ity maps (Amunts, Schleicher, Burgel, Mohlberg,Uylings, & Zilles, 1999) were considered. This lastanalysis was performed with the aid of theAnatomy SPM toolbox (Eickhoff et al., 2005).Subjects’ lips were video-monitored during thewhole scanning procedure. No speech-relatedmuscle activity was detectable during video pre-sentation. The absence of speech-related motoractivity during video presentation was assessed ina pilot experiment, on a set of different subjects,looking at the same videos presented in thescanner while electromyography of tongue mus-cles was recorded according to the technique usedby Fadiga, Craighero, Buccino, and Rizzolatti(2002).

RESULTS

During the fMRI scanning volunteers observedvideos representing: (1) the shadows of humanhands depicting animals opening and closing theirmouths; (2) human hands executing sequences ofmeaningless finger movements; or (3) real ani-mals opening their mouths. Brain activations werecompared between pairs of conditions in a blockdesign. In addition (4, 5, 6), each condition wascontrasted with a ‘‘static’’ condition, in which thesame stimuli presented in the movie were shownas static pictures (e.g., stills of animals presentedfor the same time as the corresponding videos).The comparison between the first three ‘‘moving’’conditions with each corresponding ‘‘static’’ one,controls for nonspecific activations and empha-sizes the action component of the gesture.Figure 1 shows, superimposed, the results of themoving vs. static contrasts for animal handshadows and real animals conditions (red andgreen spots, respectively). In addition to largelyoverlapping occipito-parietal activations, a speci-fic differential activation emerged in the anterior

LANGUAGE AREAS ACTIVATED BY HAND SHADOWS 3

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:3

UNCORRECTED PROOF

Figure 1 (Continued)

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:4

4 FADIGA ET AL.

UNCORRECTED PROOF

part of the brain. Animal hand shadows stronglyactivated left parietal cortex, pre- and post-central gyri (bilaterally), and, more interestinglyfor the purpose of the present study, bilateralinferior frontal gyrus (BA 44 and 45). Conversely,the only frontal activation reaching significance inthe moving vs. static contrast for real animals waslocated in bilateral BA 6, close to the premotoractivation shown in an fMRI experiment byBuccino et al. (2004) when subjects observedmouth actions performed by monkeys and dogs.This location may therefore correspond to apremotor region where a species-independentmirror-neuron system for mouth actions is presentin humans. A discussion of non-frontal activationsis beyond the scope of the present paper, howeverthe above threshold activation foci are listed inTable 1. The results shown in Figure 1 on one sideseem to rule out the possibility that the inferiorfrontal activity induced by action viewing is dueto covert speech, on the other side indicate thatthe shadows of animals opening their mouths,although clearly depicting animals and not hands,convey implicit information about the humanbeing moving her hand in creating them. Indeed,they evoke an activation pattern superimposableon that evoked by hand action observation(Buccino et al., 2001; Grafton et al., 1996; Grezeset al., 2003; see Figure 1 and Table 1). To interpretthis result, it may be important to stress thepeculiar nature of hand shadows: although theyare created by moving hands, the professionalismof the artist creating them is such that the hand isnever obvious. Nevertheless, the mirror-neuronsystem is activated. The possibility we favor isthat the brain ‘‘sees’’ the hand behind theshadow. This possibility is supported by recentdata demonstrating that monkey mirror neuronsbecome active even if the final part of thegrasping movement is performed behind a screen(Umilta et al., 2001). Consequently, the humanmirror system (or at least part of it) seems to actmore as an active interpreter than as a passiveperceiver.

The bilateral activation of inferior frontalgyrus shown in Figure 1 during observation ofanimal hand shadows cannot yet be attributed to

covert verbalization. This is because it survivesthe subtraction of still images representing thesame stimuli presented in the moving condition,which might also evoke internal naming. It couldbe argued, however, that videos of movinganimals and animal shadows are dynamic andricher in details than their static controls, andmight more powerfully evoke a semantic repre-sentation of the observed scene, but this cannotbe stated with confidence. We therefore made adirect comparison between moving animal handshadows and moving real animals. We narrowedthe region of interest from the whole brain (as inFigure 1) to bilateral BA 44, the main target ofour study. This hypothesis-driven analysis wasperformed by looking at voxels within the mostpermissive borders of the probabilistic map of thisarea provided by Amunts et al. (1999) by takingas significance threshold the p value of .005(random effect analysis). The results of thiscomparison are shown in Figure 2B. As alreadysuggested by Figure 1 and Table 1, right and(more interestingly) left frontal clusters survivedthis subtraction. The first one was located in rightBA 44 (X�/58, Y�/12, Z�/24), an area known tobe involved during observation of biologicalaction, either meaningful or meaningless (Grezeset al., 1998; Iacoboni et al., 1999). The second oneis symmetrically positioned on the left side (X�/

�/50, Y�/4, Z�/22). Finally, two additionalclusters were present in Broca’s region. One wasmore posterior, in that part of the inferior frontalgyrus classically considered as speech related(X�/�/58, Y�/12, Z�/14; pars opercularis) andone more anterior, within area 45 according tothe Talairach and Tournoux atlas (1988), (X�/

�/45, Y�/32, Z�/14; pars triangularis). Thisfinding agrees with our hypothesis and demon-strates that the activation of Broca’s area duringaction understanding is independent of internalverbalization: if an individual is compelled toverbalize internally when a hand-shadow repre-senting an animal is presented, the same indivi-dual should also verbalize during the observationof real animals.

The finding that Broca’s area involvementduring observation of hand shadows is not

Figure 1. Cortical activation pattern during observation of animal hand shadows and real animals. Significantly activated voxels

(P B/.001, fixed effects analysis) in the moving animal shadows and moving real animals conditions after subtraction of the static

controls. Activity related to animal shadows (red clusters) is superimposed on that from real animals (green clusters). Those brain

regions activated during both tasks are depicted in yellow. In the lowermost part of the figure the experimental time-course for each

contrast is shown (i.e., C1, moving; C2, static). Note the almost complete absence of frontal activation for real animals in comparison

to animal shadows, which bilaterally activate the inferior frontal gyrus (arrows).

LANGUAGE AREAS ACTIVATED BY HAND SHADOWS 5

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:5

UNCORRECTED PROOF

TABLE 1

Montreal Neurological Institute (MNI) and Talairach (TAL) co-ordinates and T-values of the foci activated during observation of

moving animal hand shadows, real hands, and real animals, after subtraction of static conditions

MNI TAL

x y z x y z T-value

Animal hand shadows

Inferior frontal gyrus

BA 44

R 62 8 24 61 9 22 4.71

L �/62 8 24 �/61 9 22 4.58

BA 45

R 58 22 14 57 22 12 3.37

Precentral gyrus

BA 6

R 54 0 36 53 2 33 5.29

60 2 34 59 4 31 4.90

L �/60 �/2 36 �/59 0 33 5.09

�/60 0 18 �/59 1 17 3.56

�/60 �/12 38 �/59 �/10 36 5.24

BA 4

R 54 �/18 36 53 �/16 34 7.97

Postcentral gyrus

BA 40

L �/58 �/22 16 �/57 �/21 16 5.92

BA 3

L �/32 �/38 52 �/32 �/34 50 6.53

BA 2

R 32 �/40 66 32 �/36 63 5.18

Superior parietal lobule

BA 7

R 26 �/48 64 26 �/44 61 6.24

Cuneus

BA 18

R 16 �/100 6 16 �/97 10 10.43

20 �/96 8 20 �/93 12 12.36

L �/16 �/104 2 �/16 �/101 7 5.09

Middle occipital gyrus

BA 18

R 42 �/90 8 42 �/87 12 7.78

L �/28 �/96 2 �/28 �/93 6 6.95

BA 19

L �/40 �/86 0 �/40 �/83 4 7.28

Insula

BA 13

R 54 �/40 20 54 �/38 20 6.35

Middle temporal gyrus

BA 37

R 54 �/70 2 53 �/68 5 10.95

Temporal fusiform gyrus

BA 37

R 42 �/44 �/16 42 �/43 �/11 7.52

L �/44 �/44 �/16 �/44 �/43 �/11 6.44

Cerebellum

R 10 �/80 �/44 8 �/80 �/42 6.53

L �/8 �/78 �/44 �/8 �/77 �/33 5.61

Amygdala

R 22 �/2 �/22 22 �/3 �/18 3.85

L �/18 �/2 �/22 �/18 �/3 �/18 4.15

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:6

6 FADIGA ET AL.

UNCORRECTED PROOF

TABLE 1 (Continued )

MNI TAL

x y z x y z T-value

Real hands

Inferior frontal gyrus

BA 44

R 62 16 28 61 17 25 5.67

L �/60 10 26 �/60 12 23 3.64

Middle frontal gyrus

BA 6

R 34 �/6 62 34 �/3 57 5.30

L �/24 �/8 52 �/24 �/5 48 4.18

Superior frontal gyrus

BA 10

R 4 58 28 4 58 23 6.22

L �/26 54 �/2 �/27 52 �/4 4.14

Postcentral gyrus

BA 3

R 52 �/20 40 51 �/18 38 4.71

BA 7

R 28 �/50 66 28 �/46 63 12.50

BA 40

R 46 �/32 54 46 �/29 51 4.58

L �/66 �/22 14 �/65 �/21 14 3.37

Inferior parietal lobule

BA 40

R 62 �/30 28 61 �/28 27 4.02

32 �/42 58 32 �/38 55 8.23

L �/50 �/30 32 �/50 �/28 31 5.09

�/34 �/42 58 �/34 �/38 55 6.51

Superior parietal lobule

BA 7

L �/32 �/56 62 �/33 �/51 60 4.94

Cuneus

BA 18

L �/20 �/94 6 �/20 �/91 10 7.13

Middle occipital gyrus

BA 19

L �/54 �/76 2 �/53 �/74 6 7.42

�/36 �/62 14 �/36 �/59 16 7.60

Lingual gyrus

BA 17

R 10 �/96 �/8 10 �/93 �/2 5.74

Inferior occipital gyrus

BA 18

R 30 �/94 �/12 30 �/92 �/6 6.10

L �/26 �/96 �/8 �/26 �/93 �/2 6.69

Middle temporal gyrus

BA 37

R 52 �/68 2 51 �/66 5 7.78

Temporal fusiform gyrus

BA 37

R 46 �/42 �/18 46 �/41 �/13 5.45

L �/46 �/44 �/16 �/46 �/43 �/11 4.59

Cerebellum

R 20 �/82 �/28 20 �/81 �/20 4.11

L �/28 �/56 �/50 �/28 �/56 �/39 5.10

Parahippocampal gyrus

BA 34

R 30 6 �/18 30 5 �/15 6.75

Insula

R 52 �/22 18 51 �/20 18 6.05

LANGUAGE AREAS ACTIVATED BY HAND SHADOWS 7

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:7

UNCORRECTED PROOF

TABLE 1 (Continued )

MNI TAL

x y z x y z T-value

Globus pallidus

R 16 2 2 16 2 2 4.10

L �/16 0 �/2 �/16 0 �/2 3.59

Real animals

Precentral gyrus

BA 4

L �/62 �/14 38 �/61 �/12 36 6.61

BA 6

R 36 �/2 34 36 0 31 5.93

32 �/12 52 32 �/9 48 4.48

L �/30 0 38 �/30 2 35 10.43

Middle frontal gyrus

BA 47

L �/46 48 �/2 �/46 46 �/4 3.92

Superior frontal gyrus

BA 6

R 30 �/8 72 30 �/4 67 3.66

BA 8

R 20 30 56 20 32 50 6.13

Postcentral gyrus

BA 2

L �/38 �/40 70 �/38 �/36 66 4.56

BA 3

R 34 �/36 54 34 �/32 51 3.26

L �/18 �/44 76 �/18 �/39 72 3.35

Precuneus

BA 7

L �/8 �/54 46 �/8 �/50 45 6.08

Middle occipital gyrus

BA 18

R 20 �/98 14 20 �/94 18 11.75

L �/20 �/94 10 �/20 �/91 14 14.91

BA 19

R 38 �/80 2 38 �/77 6 7.21

L �/40 �/86 �/2 �/40 �/83 2 9.88

�/54 �/76 2 �/53 �/74 6 8.43

Lingual gyrus

BA 18

R 14 �/86 �/14 14 �/84 �/8 5.40

L �/6 �/88 �/10 �/6 �/86 �/4 5.24

Limbic lobe-uncus

BA 28

L �/28 8 �/24 �/28 7 �/21 7.00

Superior temporal gyrus

BA 41

L �/46 �/40 12 �/46 �/38 13 3.77

Middle temporal gyrus

BA 21

L �/54 �/24 �/8 �/53 �/24 �/6 3.54

BA 37

R 52 �/70 4 51 �/68 7 8.03

Temporal fusiform gyrus

BA 37

R 44 �/42 �/20 44 �/41 �/15 6.50

L �/46 �/50 �/20 �/46 �/49 �/14 4.66

Cerebellum

R 46 �/58 �/32 46 �/58 �/24 3.55

L �/48 �/58 �/32 �/48 �/58 �/24 3.78

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:7

8 FADIGA ET AL.

UNCORRECTED PROOF

explainable in terms of internal speech suggeststhat, in agreement with other data in the litera-ture, this area may play a crucial role in actionunderstanding. However, it remains the fact thatBroca’s area is known as a speech area and itsinvolvement during overt and covert speechproduction has clearly been demonstrated re-cently (Palmer, Rosen, Ojemann, Buckner, Kel-ley, & Petersen, 2001). The hypothesis we favor isthat this area participates in verbal communica-tion because it represents the product of theevolutionary development of a precursor alreadypresent in monkeys: the mirror neurons area F5,that portion of the ventral premotor cortex wherehand/mouth actions are represented (see Petrides,2006). Accordingly, in agreement with the char-acteristics of area F5 neurons, Broca’s area shouldrespond much better to goal-directed action thanto simple, meaningless movements. To test thispossibility, we performed two further compari-sons: (1) The observation of human hands per-forming meaningless finger movements versus theobservation of moving real animals opening theirmouths, to determine how much of the parsopercularis activation was due to the observationof meaningless hand movements; (2) The obser-vation of animal hand shadows versus the ob-servation of meaningless finger movements, to pitthe presence of hands against the presence ofmeaning. The results are shown in Figure 2, C andD, respectively. After comparison (1), althoughthe more dorsal, bilateral, BA 44 activation wasstill present (Figure 2C, left: X�/�/56, Y�/10,Z�/26; right: X�/58, Y�/10, Z�/24), no voxelsabove significance were located in the parsopercularis of Broca’s area. This demonstratesthat finger movements per se do not activatespecifically that part of Broca’s area. In contrast,after comparison (2) a significant activation waspresent in the left pars opercularis (Figure 2D;

X�/�/58, Y�/6, Z�/4), demonstrating the invol-vement of Broca’s area pars opercularis in pro-cessing actions of others, particularly whenmeaningful and thus, implicitly, communicative.

DISCUSSION

The finding that animal hand shadows but notreal animals or meaningless finger movementsactivate that part of Broca’s region most inti-mately involved in verbal communication supporta similarity between these stimuli and spokenwords. Animal hand shadows are formed bymeaningless finger movements combined toevoke a meaning in the observer through theshape appearing on a screen. Thus, when onelooks at them, the representation of an animalopening its mouth is evoked. Words that formsentences are formed by individually meaninglessmovements (phonoarticulatory acts), which ap-propriately combined and segmented conveymeanings and representations. Does this twofoldinvolvement of Broca’s area reflect a specific roleplayed by it in decoding actions and particularlycommunicative ones? A positive answer to thisquestion arises, in our view, from the finding thatwhen observation of meaningless finger move-ments is subtracted from observation of animalhand-shadows, an activation of the left parsopercularis persists.

The activation of Broca’s area during gesturalcommunication has already been shown indeaf signers, both during production and percep-tion. This was interpreted as a vicariant involve-ment of Broca’s area because of its verbalspecialization (Horwitz et al., 2003). In otherterms, according to this interpretation, Broca’sarea is activated because, by signing, deafpeople express linguistic concepts. In our study

TABLE 1 (Continued )

MNI TAL

x y z x y z T-value

Amygdala

L �/34 �/6 �/16 �/34 �/6 �/13 3.70

Globus Pallidus

L �/24 �/10 �/4 �/24 �/10 �/3 4.93

Anterior Cingulate

BA 24

L �/4 36 8 �/4 35 6 4.70

Note : BA�/Brodmann area; R�/right hemisphere; L�/left hemisphere; x, y, z�/co-ordinates.

LANGUAGE AREAS ACTIVATED BY HAND SHADOWS 9

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:7

UNCORRECTED PROOF

participants were presented with communicative

hand gestures but, in contrast to studies investi-

gating deaf people, the gestures were non-sym-

bolic even if able to address in an unambiguous

way a specific concept (e.g., a barking dog). Thus,

we show here, the involvement of Broca’s region

can not be explained in terms of linguistic

decoding of the gesture meaning. Conversely,

the results indicate that Broca’s region is involved

in the understanding of communicative gestures.

Figure 2 (Continued)

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:8

10 FADIGA ET AL.

UNCORRECTED PROOF

How can this ‘‘perceptual’’ function be recon-ciled with the universally accepted motor role ofBroca’s area for speech? One possible interpreta-tion is that Broca’s area, due to its premotororigin, is involved in the assembly of meaninglesssequence of action units (whether finger orphonoarticulatory movements) into meaningfulrepresentations. This elaboration process mayproceed in two directions. In production, Broca’sarea recruits movement units to generate words/hand actions. In perception, Broca’s area, beingthe human homologue of monkey area F5,addresses the vocabulary of speech/hand actions,which form the template for action recognition.Our hypothesis is that, in origin, Broca’s areaprecursor was involved in generating/extractingaction meanings by organizing/interpreting motorsequences in terms of goal. Subsequently, thisability might have been generalized during theevolution that gave this area the capability to dealwith meanings (and rules), which share similarhierarchical and sequential structures with themotor system (Fadiga, Craighero, & Roy, 2006).

This proposal is in agreement with fMRIinvestigations that indicate that Broca’s area isnot always activated during speech listening. In arecent experiment Wilson, Saygin, Sereno, andIacoboni (2004) carried out an fMRI study inwhich subjects (1) passively listened to mono-syllables and (2) produced the same speechsounds. Results showed a substantial bilateraloverlap between regions activated during thetwo conditions, mainly in the superior part ofventral premotor cortex. Conversely, the activa-tion of Broca’s region was present only in some ofthe studied subjects, in our view because the taskdid not require any meaning extraction. Thisinterpretation is in line with brain imaging studiesindicating that, in speech comprehension, Broca’sarea is mainly activated during processing ofsyntactic aspects (Bookheimer, 2002). Luria(1966) had already noticed that Broca’s areapatients made comprehension errors in syntacti-cally complex sentences such as passive construc-

tions. Finally, data coming from corticalstimulation of collaborating patients undergoingneurosurgery, showed that the electrical stimula-tion of the Broca’s area produced comprehensiondeficits, particularly evident in the case of ‘‘com-plex auditory verbal instructions and visual se-mantic material’’ (Schaffler, Luders, Dinner,Lesser, & Chelune, 1993). The data of the presentexperiment, together with the series of evidencepresented above, are in agreement with thosetheories on the origins of human language thatconsider it as the evolutionary refinement of animplicit communication system based on hand/mouth goal-directed action representations(Armstrong et al., 1995; Corballis, 2002; Rizzolatti& Arbib, 1998). This possibility finds furthersupport from a recent experiment based on theanalysis of brain MRIs of three great ape species(Pan troglodytes, Pan paniscus and Gorilla gor-illa) showing that the extension of BA 44 is largerin the left hemisphere than in the right. While asimilar asymmetry in humans has been correlatedwith language dominance (Cantalupo & Hopkins,2001), this hypothesis does not fit in the case ofapes. It might be, however, indicative of an initialspecialization of BA 44 for communication. Infact, in captive great apes manual gestures areboth referential and intentional, and are prefer-entially produced by the right hand. Moreover,this right-hand bias is consistently greaterwhen gesturing is accompanied by vocalization(Hopkins & Leavens, 1998).

In conclusion, our results support a commonorigin for human speech and gestural commu-nication in non-human primates. It has beenproposed that the development of human speechis a consequence of the fact that the precursor ofBroca’s area was endowed, before the emergenceof speech, with a gestural recognition system(Rizzolatti & Arbib, 1998). Here we have takena step forward, empirically showing for thefirst time that human Broca’s area is not anexclusive ‘‘speech’’ center but, most probably, amotor assembly center in which communicative

Figure 2. Results of the analysis focused on bilateral area 44. (A) Cytoarchitectonically defined probability map of the location of

left and right area 44, drawn on the Colin27T1 standard brain on the basis of Juelich-MNI database (27 ). The white cross

superimposed on each brain indicates the origin of the co-ordinates system (x�/y�/z�/0). The correspondence between colors and

percent probability is given by the upper color bar. (B), (C) and (D), significant voxels (P B/.005, random effects analysis) falling

inside area 44, as defined by the probability map shown in (A), in the three contrasts indicated in the Figure. Color bar: T-values.

Note the similar pattern of right hemisphere activation in (B) and (C), the similar location of the posterior-dorsal activation of the

left hemisphere in (B) and (C), and the two additional foci in the pars opercularis and pars triangularis of Broca’s area in (B). Note,

in (D), the survival of the activation in pars opercularis, after subtraction of real hands from animal hand shadows. When the reverse

contrasts were tested (real animals vs. either animal shadows or real hands), the results failed to show any significant activation

within area 44.

AQ1

LANGUAGE AREAS ACTIVATED BY HAND SHADOWS 11

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:10

UNCORRECTED PROOF

gestures, whether linguistic or otherwise, areassembled and decoded (Fadiga et al., 2006). Itstill remains unclear whether hand/speech motorrepresentations are mapped in this area accordingto a somatotopic organization, or if Broca’s areaworks in a supramodal way, by dealing witheffector-independent motor rules.

Manuscript received 000

Manuscript accepted 000

First published online day/month/year

REFERENCES

Amunts, K., Schleicher, A., Burgel, U., Mohlberg, H.,Uylings, H. B., & Zilles, K. (1999). Broca’s regionrevisited: Cytoarchitecture and intersubject varia-bility. Journal of Comparative Neurology, 412, 319�341.

Armstrong, A. C., Stokoe, W. C., & Wilcox, S. E.(1995). Gesture and the nature of language. Cam-bridge, UK: Cambridge University Press.

Aziz-Zadeh, L., Koski, L., Zaidel, E., Mazziotta, J., &Iacoboni, M. (2006). Lateralization of the humanmirror neuron system. Journal of Neuroscience,26(11), 2964�2970.

Bookheimer, S. (2002). Functional MRI of language:New approaches to understanding the corticalorganization of semantic processing. Annual Reviewof Neuroscience, 25, 151�188.

Buccino, G., Binkofski, F., Fink, G. R., Fadiga, L.,Fogassi, L., Gallese, V., et al. (2001). Actionobservation activates premotor and parietal areasin a somatotopic manner: An fMRI study. EuropeanJournal of Neuroscience, 13, 400�404.

Buccino, G., Lui, F., Canessa, N., Patteri, I., Lagravi-nese, G., Benuzzi, F., et al. (2004). Neural circuitsinvolved in the recognition of actions performed bynonconspecifics: An fMRI study. Journal of Cogni-tive Neuroscience, 16, 114�126.

Cantalupo, C., & Hopkins, W. D. (2001). AsymmetricBroca’s area in great apes. Nature, 414(6863), 505.

Chomsky, N. (1966). Cartesian linguistics. New York:Harper & Row.

Corballis, M. C. (2002). From hand to mouth. Theorigins of language. Princeton, NJ: Princeton Uni-versity Press.

Decety, J., & Chaminade, T. (2003). Neural correlatesof feeling sympathy. Neuropsychologia, 41, 127�138.

Decety, J., Grezes, J., Costes, N., Perani, D., Jeannerod,M., Procyk, E., et al. (1997). Brain activity duringobservation of actions: Influence of action contentand subject’s strategy. Brain, 120, 1763�1777.

Di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., &Rizzolatti, G. (1992). Understanding motor events:A neurophysiological study. Experimental BrainResearch, 91, 176�180.

Eickhoff, S. B., Stephan, K. E., Mohlberg, H., Grefkes,C., Fink, G. R., Amunts, K., et al. (2005). A newSPM toolbox for combining probabilistic cytoarch-

itectonic maps and functional imaging data. Neuro-image, 25, 1325�1335.

Fadiga, L., Craighero, L., & Roy, A. C. (2006). Broca’sarea: A speech area? In Y. Grodzinsky & K. Amunts(Eds.), Broca’s region. New York: Oxford UniversityPress.

Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G.(2002). Speech listening specifically modulates theexcitability of tongue muscles: A TMS study. Eur-opean Journal of Neuroscience, 15, 399�402.

Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G.(1996). Action recognition in the premotor cortex.Brain, 119, 593�609.

Grafton, S. T., Arbib, M. A., Fadiga, L., & Rizzolatti, G.(1996). Localization of grasp representations inhumans by PET: 2. Observation compared withimagination. Experimental Brain Research, 112,103�111.

Grezes, J., & Decety, J. (2001). Functional anatomy ofexecution, mental simulation, observation, and verbgeneration of actions: A meta-analysis. HumanBrain Mapping, 12, 1�19.

Grezes, J., Armony, J. L., Rowe, J., & Passingham, R. E.(2003). Activations related to ‘‘mirror’’ and ‘‘cano-nical’’ neurones in the human brain: An fMRI study.Neuroimage, 18, 928�937.

Grezes, J., Costes, N., & Decety, J. (1998). Top-downeffect of strategy on the perception of humanbiological motion: A PET investigation. CognitiveNeuropsychology, 15, 553�582.

Hopkins, W. D., & Leavens, D. A. (1998). Hand use andgestural communication in chimpanzees (Pan tro-glodytes). Journal of Comparative Psychology,112(1), 95�99.

Horwitz, B., Amunts, K., Bhattacharyya, R., Patkin, D.,Jeffries, K., Zilles, K., et al. (2003). Activation ofBroca’s area during the production of spoken andsigned language: A combined cytoarchitectonicmapping and PET analysis. Neuropsychologia, 41,1868�1876.

Iacoboni, M., Woods, R. P., Brass, M., Bekkering, H.,Mazziotta, J. C., & Rizzolatti, G. (1999). Corticalmechanisms of human imitation. Science, 286, 2526�2528.

Luria, A. (1966). The higher cortical function in man.New York: Basic Books.

Matelli, M., Luppino, G., & Rizzolatti, G. (1985).Patterns of cytochrome oxidase activity in thefrontal agranular cortex of macaque monkey. Beha-vioral Brain Research, 18, 125�137.

Palmer, E. D., Rosen, H. J., Ojemann, J. G., Buckner,R. L., Kelley, W. M., & Petersen, S. E. (2001). Anevent-related fMRI study of overt and covert wordstem completion. Neuroimage, 14, 182�193.

Petrides, M. (2006). Broca’s area in the human andnonhuman primate brain. In Y. Grodzinsky & K.Amunts (Eds.), Broca’s region. New York: OxfordUniversity Press.

Petrides, M., & Pandya, D. N. (1997). In F. Boller & J.Grafman (Eds.), Handbook of neuropsychology(Vol. IX, pp. 17�58). New York: Elsevier.

Pinker, S. (1994). The language instinct: How the mindcreates language. New York: William Morrow.

AQ2

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:12

12 FADIGA ET AL.

UNCORRECTED PROOF

Rizzolatti, G., & Arbib, M. A. (1998). Language withinour grasp. Trends in Neuroscience, 21, 188�194.

Rizzolatti, G., Fadiga, L., Gallese, V., & Fogassi, L.(1996). Premotor cortex and the recognition ofmotor actions. Cognitive Brain Research, 3, 131�141.

Rizzolatti, G., Fadiga, L., Matelli, M., Bettinardi, V.,Paulesu, E., Perani, D., et al. (1996). Localization ofgrasp representations in human by PET: 1. Observa-tion versus execution. Experimental Brain Research,111, 246�252.

Schaffler, L., Luders, H. O., Dinner, D. S., Lesser, R. P.,& Chelune, G. J. (1993). Comprehension deficitselicited by electrical stimulation of Broca’s area.Brain, 116, 695�715.

Talairach, J., & Tournoux, P. (1988). Co-planar stereo-tactic atlas of the human brain. New York: GeorgThieme Verlag.

Umilta, M. A., Kohler, E., Gallese, V., Fogassi, L.,Fadiga, L., Keysers, C., et al. (2001). I know whatyou are doing: A neurophysiological study. Neuron,31, 155�165.

Von Bonin, G., & Bailey, P. (1947). The neocortex ofMacaca mulatta. Urbana, IL: University of IllinoisPress.

Wilson, S. M., Saygin, A. P., Sereno, M. I., & Iacoboni,M. (2004). Listening to speech activates motor areasinvolved in speech production. Nature Neuroscience,7(7), 701�702.

LANGUAGE AREAS ACTIVATED BY HAND SHADOWS 13

C:/3B2WIN/temp files/PSNS197544_S100.3d[x] Tuesday, 5th September 2006 15:25:12

Infants predict other people’saction goalsTerje Falck-Ytter, Gustaf Gredeback & Claes von Hofsten

Do infants come to understand other people’s actions through

a mirror neuron system that maps an observed action onto

motor representations of that action? We demonstrate that

a specialized system for action perception guides proactive

goal-directed eye movements in 12-month-old but not in

6-month-old infants, providing direct support for this view.

The activation of this system requires observing an interaction

between the hand of the agent and an object.

Neurophysiological1,2 and brain-imaging3 studies indicate that a mir-ror neuron system (MNS) mediates action understanding in humanadults and monkeys. Mirror neurons were first discovered in thepremotor area (F5) of the macaque, and they respond both when theanimal performs a particular object-related action and when the animalobserves another individual perform a similar action. Thus, mirrorneurons mediate a direct matching process, in which observed actionsare mapped onto motor representations of that action4,5.

When performing visually guided actions, action plans encodeproactive goal-directed eye movements, which are crucial for planningand control6,7. Adults also perform such eye movements when theobserved actions are performed by others8, indicating that action plansguide the oculomotor system when people are observing others’ actionsas well. As the MNS mediates such matching processes1,3,4, there is adirect link between MNS activity and proactive goal-directed eyemovements during action observation8.

Recently, a strong claim has been presented about the role of theMNS in human ontogeny. According to the MNS hypothesis of socialcognition, the MNS constitutes a basis for important social compe-tences such as imitation, ‘theory of mind’ and communication bymeans of gesture and language4,9. If this hypothesis is correct, the MNSshould be functional simultaneous with or before the infant’s devel-opment of such competencies, which emerge around 8–12 months oflife10. Furthermore, according to the MNS theory, proactive goal-directed eye movements during the observation of actions reflect thefact that the observer maps the observed actions onto the motorrepresentations of those actions8. This implies that the developmentof such eye movements is dependent on action development. There-fore, infants are not expected to predict others’ action goals before theycan perform the actions themselves. Infants begin to master the actionshown in our study at around 7–9 months of life11. Thus, if the MNS isa basis of early social cognition, proactive goal-directed eye movementsshould be present at 12 but not at 6 months.

Although habituation studies indicate that young infants distinguishbetween means and ends when observing actions12,13, no one hastested the critical question of when infants come to predict the goals ofothers’ actions. Using a gaze-recording technique, we tested the MNShypothesis by comparing the eye movements of adults (n ¼ 11),12-month-old infants (n ¼ 11) and 6-month-old infants (n ¼ 11)during video presentations showing nine identical trials in which threetoys were moved by an actor’s hand into a bucket (Test 1). Wecompared gaze behavior, consisting of (i) timing (ms) of gaze arrivalat the goal (the bucket) relative to the arrival of the moving targetand (ii) ratio of looking time in the goal area to total looking timein combined goal and trajectory areas during object movement(see Supplementary Methods online and Fig. 1a).

In adults, the MNS is only activated when someone is seeing an agentperform actions, not when objects move alone8. However, according tothe teleological stance theory14, seeing a human agent is not necessaryfor infants to ascribe goal-directedness to observed events. This theorystates that objects that are clearly directed toward a goal and moverationally within the constraints of the situation are perceived as goal-directed by 12-month-old infants. This implies that seeing an interac-tion between the actor’s hand and the toys is not necessary for elicitingproactive, goal-directed eye movements. By comparing the gaze per-formance of adults (n ¼ 33) and 12-month-olds (n ¼ 33) in threeconditions, we evaluated this alternative hypothesis (Test 2). The firstcondition, ‘human agent’, was identical to the one in Test 1 (same dataused in both tests). In the ‘self-propelled’ condition (Fig. 1a), themotion was identical to that in the human agent condition except thatno hand moved the toys. We also included a ‘mechanical motion’

a

b

Figure 1 Sample pictures of stimulus videos. (a) Stimulus in the human

agent and self-propelled conditions with areas of interest (AOIs; black

rectangles) and trajectories for each object (colored lines) superimposed.

Left AOI was labeled ‘‘goal AOI,’’ right AOI was labeled ‘‘object AOI,’’ and

upper AOI was labeled ‘‘trajectory AOI.’’ (b) Stimulus in the mechanical

motion condition.

Received 22 February; accepted 26 May; published online 18 June 2006; doi:10.1038/nn1729

Department of Psychology, Uppsala University, Box 1225, 751 42 Uppsala, Sweden. Correspondence should be addressed to T.F.-Y. ([email protected]).

878 VOLUME 9 [ NUMBER 7 [ JULY 2006 NATURE NEUROSCIENCE

BR I E F COMMUNICAT IONS©

2006

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

eneu

rosc

ienc

e

condition (Fig. 1b). We assigned subjects randomly to the conditions,and sample sizes were equal.

Parents and adult subjects provided written consent according to theguidelines specified by the Ethical Committee at Uppsala University,and the study was conducted in accordance with the standards specifiedin the 1964 Declaration of Helsinki. Preliminary analyses of the datadistributions confirmed normality and homogeneity of variance. Weused statistical tests (one-way ANOVAs and single-sample t-tests) withtwo-tailed probabilities (a ¼ 0.05) unless otherwise stated.

In Test 1, we found that in the human agent condition, there was astrong effect of age on predictive eye movements to the action goal(F2,30 ¼ 19.845, P o 0.001; Fig. 2a and Supplementary Table 1online). Post hoc testing (Bonferroni) showed that there was nosignificant difference between the adults and the 12-month-olds,whereas the differences between both these groups and the 6-month-olds were significant (P o 0.001). Furthermore, both adults (t10 ¼5.498, Po 0.001) and 12-month-olds (t10 ¼ 2.425, P ¼ 0.036) shiftedtheir gaze to the goal of the action before the hand arrived, whereas6-month-olds shifted their gaze to the goal after the hand arrived(t10 ¼ 3.165, P ¼ 0.01). The two predictive groups were furtheranalyzed to check for learning effects using Bonferroni correctedpaired-samples t-tests. There was no effect of trial number (n ¼ 9)on timing. From the first to the second action of each trial, gaze leadtimes increased from 247 to 465 ms in adults (t10 ¼ 3.186, P ¼ 0.01)and from –18 to 334 ms in 12-month-olds (t10¼ 3.516, P ¼ 0.006; forfurther details see Supplementary Table 2 online).

In the human agent condition, the subjects in the different age groupsdistributed their fixations differently across the movement trajectory(F2,30¼ 12.015, Po 0.001; Fig. 2b and Supplementary Table 3 online).Post hoc testing (Bonferroni) showed that there was no significantdifference between the adults and the 12-month-olds, whereas boththese groups differed from the 6-month-olds (P o 0.001). Adults(t10 ¼ 8.073, P o 0.001) and 12-month-olds (t10 ¼ 8.625, P o 0.001)looked significantly longer at the goal area during target movement thanwould be expected if subjects tracked the target, whereas the 6-month-olds did not (see Supplementary Methods online).

In Test 2, we found that in adults and 12-month-olds, predictive eyemovements were only activated by a human hand moving the objects(F2,30 ¼ 7.637, P ¼ 0.002 and F2,30 ¼ 7.180, P ¼ 0.003, respectively;Fig. 2a). Post hoc testing (Bonferroni) showed that for adults, thehuman agent condition was significantly different from the self-

propelled (P¼ 0.004) and mechanical motion (P ¼ 0.009) conditions.12-month-olds showed the same pattern (P ¼ 0.019 and 0.004,respectively). Gaze did not arrive significantly ahead of the movingobject in the two control conditions for adults and in the self-propelledcondition for the 12-month-olds. The gaze of the 12-month-oldsarrived at the goal significantly after the object in the mechanicalmotion condition (t10 ¼ 4.253, P ¼ 0.002).

Spatial distribution of fixations was different between the conditions(Fig. 2b) in both adults (F2,30¼ 17.782, Po 0.001) and 12-month-olds(F2,30¼ 36.055, Po 0.001). Post hoc testing (Bonferroni) showed thatboth adults and 12-month-olds were more goal-directed in the humanagent condition than in the self-propelled and mechanical motionconditions (P o 0.001 in all cases). Furthermore, in the two controlconditions, the distributions of fixations for both adults and12-month-olds did not differ significantly from what would beexpected if subjects tracked the objects.

The present study supports the view that action understanding inadults results from a mirror neuron system that maps an observedaction onto motor representations of that action1,3,5,8. More impor-tantly, we have demonstrated that when observing actions, 12-month-old infants focus on goals in the same way as adults do, whereas6-month-olds do not (Test 1). The performance of the 6-month-oldscannot originate from a general inability to predict future events withtheir gaze, because 6-month-olds predict the reappearance of tempora-rily occluded objects15. Finally, in terms of proactive goal-directed eyemovements, we found no support for the teleological stance theoryclaiming that 12-month-old infants perceive self-propelled objects asgoal-directed (Test 2).

In conclusion, proactive goal-directed eye movements seem torequire observing an interaction between the hand of an agent andan object, supporting the mirror neuron account in general. Further-more, infants develop such gaze behavior during the very limited timeframe derived from the MNS hypothesis of social cognition. During thesecond half of their first year, infants come to predict others’ actions.The mirror neuron system is likely to mediate this process.

Note: Supplementary information is available on the Nature Neuroscience website.

ACKNOWLEDGMENTSThis research was supported by grants to C.v.H. from the Tercentennial Fund ofthe Bank of Sweden (J2004-0511), McDonnell Foundation (21002089), andEuropean Union (EU004370: robot-cub).

COMPETING INTERESTS STATEMENTThe authors declare that they have no competing financial interests.

Published online at http://www.nature.com/natureneuroscience

Reprints and permissions information is available online at http://npg.nature.com/

reprintsandpermissions/

1. Fadiga, L., Fogassi, L., Pavesi, G. & Rizzolatti, G. J. Neurophysiol. 73, 2608–2611(1995).

2. Kohler, E. et al. Science 297, 846–848 (2002).3. Buccino, G. et al. Eur. J. Neurosci. 13, 400–404 (2001).4. Rizzolatti, G. & Craighero, L. Annu. Rev. Neurosci. 27, 169–192 (2004).5. Keysers, C. & Perrett, D.I. Trends Cogn. Sci. 8, 501–507 (2004).6. Land, M.F. & Furneaux, S. Phil. Trans. R. Soc. Lond. B 352, 1231–1239 (1997).7. Johansson, R.S., Westling, G.R., Backstrom, A. & Flanagan, J.R. J. Neurosci. 21, 6917–

6932 (2001).8. Flanagan, J.R. & Johansson, R.S. Nature 424, 769–771 (2003).9. Gallese, V., Keysers, C. & Rizzolatti, G. Trends Cogn. Sci. 8, 396–403 (2004).10. Tomasello, M. The Cultural Origins of Human Cognition (Harvard Univ. Press, London,

1999).11. Bruner, J.S. in Mechanisms of Motor Skill Development (ed. Connolly, K.J.) 63–91

(Academic, London, 1970).12. Sommerville, J.A., Woodward, A.L. & Needham, A. Cognition 96, B1–B11 (2005).13. Luo, Y. & Baillargeon, R. Psychol. Sci. 16, 601–608 (2005).14. Gergely, G. & Csibra, G. Trends Cogn. Sci. 7, 287–292 (2003).15. Rosander, K. & vonHofsten, C. Cognition 91, 1–22 (2004).

HA SPM

M HA HASPM

M HA SPM

M HA HASPM

M–300

–200

–100

0

100

Gaz

e ar

rival

at g

oal r

elat

ive

toar

rival

of m

ovin

g ta

rget

(m

s)

Rat

io o

f loo

king

tim

e (g

oal)

tolo

okin

g tim

e (g

oal +

traj

ecto

ry)

200

300

400Adults 12-month-

olds6-month-

oldsAdults 12-month-

olds6-month-

olds

MM = mechanical motionSM = self-propelledHA = human agent

00.10.20.30.40.50.60.70.8

a b

Figure 2 Gaze performance during observation of actions and moving objects.

Statistics (means and s.e.m.) are based on all data points for adults (left),

12-month-old infants (middle) and 6-month-old infants (right), respectively.

(a) Timing (ms) of gaze arrival at the goal relative to the arrival of the moving

target. Target arrival is represented by the horizontal line at 0 ms. Positive

values correspond to early arrival of gaze at the goal area. (b) Ratios of

looking time at the goal area to total looking time in both goal and trajectory

areas during object movement. The horizontal line at 0.2 represents the ratioexpected if subjects tracked the moving target.

NATURE NEUROSCIENCE VOLUME 9 [ NUMBER 7 [ JULY 2006 879

BR I E F COMMUNICAT IONS©

2006

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

eneu

rosc

ienc

e

Shared Challenges in Object Perception

for Robots and Infants†

Paul Fitzpatrick∗ Amy Needham∗∗ Lorenzo Natale∗∗∗ Giorgio Metta∗

∗LIRA-Lab, DISTUniversity of GenovaViale F. Causa 13

16145 Genova, Italy

∗∗ Duke University9 Flowers Drive

Durham, NC 27798North Carolina, USA

∗∗∗ MIT CSAIL32 Vassar St

Cambridge, MA 02139Massachusetts, USA

Abstract

Robots and humans receive partial, fragmentary hints about the world’s state through their respective

sensors. In this paper, we focus on some fundamental problems in perception that have attracted the

attention of researchers in both robotics and infant development: object segregation, intermodal inte-

gration, and the role of embodiment. We concentrate on identifying points of contact between the two

fields, and also important questions identified in one field and not yet addressed in the other. For object

segregation, both fields have examined the idea of using “key events” where perception is in some way

simplified and the infant or robot acquires knowledge that can be exploited at other times. We examine

this parallel research in some detail. We propose that the identification of the key events themselves

constitutes a point of contact between the fields. And although the specific algorithms used in robots

are not easy to relate to infant development, the overall “algorithmic skeleton” formed by the set of

algorithms needed to identify and exploit key events may in fact form a basis for mutual dialogue.

Keywords: Infant development, robotics, object segregation, intermodal integration, embodiment.

Address correspondence to the first author, Paul Fitzpatrick, at:

email [email protected] +39-010-3532946fax +39-010-3532144address LIRA-Lab, DIST, University of Genova, Viale F. Causa 13, 16145 Genova, Italy

†Due to time constraints, the final draft of this paper was not seen by all authors at the time of submission. Since the size limit on thepaper required much review material to be discarded and arguments to be shortened, any factual errors or misrepresentations introducedare the responsibility the first author, and should not be attributed to any other author.

1. Introduction

Imagine if your body’s sensory experience were presented to you as column after column of numbers. One

number might represent the amount of light hitting a particular photoreceptor, another might be related to the

pressure on a tiny patch of skin. Imagine further if you could only control your body by putting numbers in a

spreadsheet, with different numbers controlling different muscles and organs in different ways.

This is how a robot experiences the world. It is also a (crude) model of how humans experience the world.

Of course, our sensing and actuation are not encoded as numbers in the same sense, but aspects of the world

and our bodies are transformed to and from internal signals that, in themselves, bear no trace of the signals’

origin. For example, a neuron firing selectively to a red stimulus is not itself necessarily red. The feasibility

of telepresence and virtual reality shows that this kind of model is not completely wrong-headed; if we place

digital transducers between a human and the world, they can still function well.

Understanding how to build robots requires understanding in a very concrete and detailed way how it is

possible to sense and respond to the world. Does this match the concerns of psychology? For work concerned

with modeling phenomena that exist at a high level of abstraction, or deeply rooted in culture, history, and

biology, the answer is perhaps “no”. But for immediate perception of the environment, the answer must be

at least partially “yes”, because the problems of perception are so great that intrinsic, physical, non-cultural,

non-biological, non-historical constraints must play a role. We expect that there will be commonality between

how infants and successful robots operate at the information-processing level, because natural environments

are of mixed, inconstant observability – there are properties of the environment that can be perceived easily

under some circumstances and with great difficulty (or not at all) under others. This network of opportunities

and frustrations should place limits on information processing that apply both to humans and robots with

human-like sensors.

In this paper, we focus on early perceptual development in infants. The perceptual judgements infants make

change over time, showing a changing sensitivity to various cues. This progression may be at least partially

due to knowledge gained from experience. We identify opportunities that can be exploited by both robots and

infants to perceive properties of their environment that cannot be directly perceived in other circumstances.

We review some of what is known of how robots and infants can exploit such opportunities to learn to make

reasonable inferences about object properties that are not directly given in the display through correlations

with observable properties, grounding those inferences in prior experience. The topics we focus on are object

segregation, intermodal integration, and embodiment.

2. Object segregation

[Figure 1 about here.]

The world around us has structure, and to an adult appears to be made up of more-or-less well-defined

objects. Perceiving the world this way sounds trivial, but from an engineering perspective, it is heart-breakingly

complex. As Spelke wrote in 1990:

... the ability to organize unexpected, cluttered, and changing arrays into objects is mysterious: so

mysterious that no existing mechanical vision system can accomplish this task in any general manner.

(Spelke, 1990)

This is still true today. This ability to assign boundaries to objects in visually presented scenes (called “object

segregation” in psychology or “object segmentation” in engineering) cannot yet be successfully automated for

arbitrary object sets in unconstrained environments (see Figure 1).

There has been algorithmic progress; for example, given local measures of similarity between each neighboring

element of a visual scene, a globally appropriate set of boundaries can be inferred in efficient and well-founded

ways (see for example Felzenszwalb and Huttenlocher (2004)). Just as importantly, there is also a growing

awareness of the importance of collecting and exploiting empirical knowledge about the statistical combinations

of materials, shapes, lighting, and viewpoints that actually occur in our world (see for example Martin et al.

(2004)). Of course, such knowledge can only be captured and used effectively because of algorithmic advances

in machine learning, but the knowledge itself is not specified by an algorithm.

Empirical, non-algorithmic knowledge of this kind now plays a key role in machine perception tasks of

all sorts. For example, face detection took a step forward with Viola and Jones (2004); the success of this

work was due both to algorithmic innovation and better exploitation of knowledge (features learned from 5000

hand-labelled face examples). The success of automatic speech recognition is successful largely because of the

collection and exploitation of extensive corpuses of clearly labelled phoneme or phoneme-pair examples that

cover well the domain of utterances to be recognized.

These two examples clarify ways “knowledge” can play a role in machine perception. The bulk of the

“knowledge” used in such systems takes the form of labelled examples – examples of input from the sensors (a

vector of numbers), and the corresponding desired output interpretation (another vector of numbers). More-

or-less general purpose machine learning algorithms can then approximate the mapping from sensor input to

desired output interpretation based on the examples (called the training set), and apply that approximation

to novel situations (called the test set). Generally, this approximation will be very poor unless we transform

the sensory input in a manner that highlights properties that the programmer believes may be relevant. This

transformation is called preprocessing and feature selection. There is a great deal of infrastructure wrapped

around the learning system to break the problem down into tractable parts and apply the results of learning

back to the original problem. For this paper, we will group all this infrastructure and call it the algorithmic

skeleton. This set of carefully interlocking algorithms is designed so that, when fed with appropriate training

data, it fleshes out into a functional system. Without the algorithmic skeleton, there would be no way to make

sense of the training data, and without the data, perception would be crude and uninformed.

What, then, is the right algorithmic skeleton for object segregation? What set of algorithms, coupled with

what kind of training data, would lead to best performance? We review suggestive work in infant development

research and robotics.

[Figure 2 about here.]

2.1 Segregation skills in infants

By 4 to 5 months of age, infants can parse simple displays like the one in Figure 2 into units, based on something

like a subset of static Gestalt principles – see for example Needham (1998, 2000). Initial studies indicated that

infants use a collection of features to parse the displays (Needham, 1998; Needham and Baillargeon, 1997, 1998);

subsequent studies suggested that object shape is the key feature that young infants use to identify boundaries

between adjacent objects (Needham, 1999). Compared to adult judgements, we would expect such strategies

to lead to many incorrect parsings, but they will also provide reasonable best guess interpretations of uniform

objects in complex displays.

Infants do not come prepared from birth to segregate objects into units that match adult judgement. It

appears that infants learn over time how object features can be used to predict object boundaries. More than

twenty years ago, Kellman and Spelke (1983) suggested that infants may be born with knowledge about solid,

three-dimensional objects and that this knowledge could help them interpret portions of a moving object as

connected to other portions that were moving in unison. This assertion was put to the test by Slater and his

colleagues (Slater et al., 1990), a test that resulted in a new conception of the neonate’s visual world. Rather

than interpreting common motion as a cue to object unity, neonates appeared to interpret the visible portions

of a partly occluded object as clearly separate from each other, even when undergoing common motion. This

finding was important because it revealed one way in which learning likely changes how infants interpret their

visual world.

Although segregating adjacent objects present a very similar kind of perceptual problem (“are these surfaces

connected or not”), the critical components of success might be quite different. Early work with adjacent objects

indicated that at 3 months of age, infants tend to group all touching surfaces into a single unit (Kestenbaum

et al., 1987). Subsequent experiments have revealed that soon after this point in development, infants begin to

analyze the perceptual differences between adjacent surfaces and segregate surfaces with different features (but

not those with similar features) into separate units (Needham 2000). 8.5 month old infants have been shown to

also use information about specific objects or classes of objects to guide their judgement (Needham, Cantlon,

& Ormsby, 2005).

It might be that extensive amounts of experience are required to ‘train up’ this system. However, it might

also be that infants learn on the basis of relatively few exposures to key events (Baillargeon, 1999). This

possibility was investigated within the context of object segregation by asking how infants’ parsing of a display

would be altered by a brief prior exposure to one of the objects in the test display.

In this paradigm, a test display was used that was ambiguous to 4.5-month-old infants who had no prior

experience with the display. Prior experience was then given that would help disambiguate the display for

infants. This experience consisted of a brief prior exposure (visual only) to a portion of the test display. If

infants used this prior experience to help them parse the test display, they should see the display as two separate

objects and look reliably longer when they moved as a whole than when they move separately. Alternately,

if the prior experience was ineffective in altering infants’ interpretation of the display, they should look about

equally at the display, just as the infants in the initial study with no particular prior experience did (Needham

and Baillargeon, 1998). Prior experiences with either portion of the test display were effective in facilitating

infants’ parsing of the test display.

2.2 Segregation skills in robots

This idea that exposure to key events could influence segregation is intuitive, and evidently operative in infants.

Yet it is not generally studied or used in mechanical systems for object segregation. In this section, we attempt

to reformulate robotics work by the authors in these terms.

For object segregation in robotics, we will interpret “key events” as moments in the robot’s experience where

the true boundary of an object can be reliably inferred. They offer an opportunity to determine correlates of

the boundary that can be detected outside of the limited context of the key events themselves. Thus, with an

appropriate algorithmic skeleton, information learned during key events can be applied more broadly.

Key events in infants include seeing an object in isolation or seeing objects in relative motion. In the authors’

work, algorithmic skeletons have been developed for exploiting constrained analogues of these situations.

In Natale et al. (2005b), a very simple key event is used to learn about objects – holding an object up to the

face. The robot can be handed an object or happen to grasp it, and will then hold it up close to its cameras. This

gives a good view of its surface features, allowing the robot to do some learning and later correctly segregate the

object out from the background visually even when out of its grasp (see Figure 3). This is similar an isolated

presentation of an object, as in Needham’s experiments. In real environments, true isolation is very unlikely,

and actively moving an object so that it dominates the scene can be beneficial.

In Fitzpatrick and Metta (2003), the “key event” used is hitting an object with the hand/arm. This is a

constrained form of relative object motion. In the real world, all sorts of strange motions happen which can

be hard to parse, so it is simpler at least to begin with to focus on situations the robot can initiate and at

least partially control. Motion caused by body impact has some technical advantages; the impactor (the arm)

is understood and can be tracked, and since the moment and place of impact can be detected quite precisely,

unrelated motion in the scene can be largely filtered out.

The algorithmic skeleton in (Fitzpatrick and Metta, 2003) processes views of the arm moving, detects

impacts, and outputs boundary estimates of whatever the arm collides with based on motion. These boundaries,

and what they contain, are used as training data for another algorithm, whose purpose is to estimate boundaries

from visual appearance when motion information is not available. See Fitzpatrick (2003) for technical details.

As a basic overview, the classes of algorithms involved are these:

1. Behavior system: an algorithm that drives the robot’s behavior, so that it is likely to hit things. This system

has a broader agenda than just this specific outcome; if

2. Key event detection: an algorithm that detects the key event of when the arm/hand hits an object.

3. Training data extraction: an algorithm that, within a key event, extracts boundary information from object

motion caused by hitting.

4. Machine learning: an algorithm that discovers features predictive of boundaries that can be extracted in

other situations without motion (in this case edge and color combinations).

5. Application of learning: an algorithm that actually uses those features to predict boundaries. This must

be integrated with the very first algorithm, to influence the robot’s behavior in useful ways. In terms of

observable behavior, the robot’s ability to attend and fixate specific objects increases, since they become

segregated from the background.

This skeleton gets fleshed out through learning, once the robot actually starts hitting objects and extracting

specific features predictive of the boundaries of specific objects. A set of different algorithms performing

analogous roles are used for the Natale et al. (2005b) example. At the algorithmic level, the technical concerns

in either case are hard to relate to infant development, although in some cases there are correspondences. But

at the skeletal level, the concerns are coming closer to those of infant development.

[Figure 3 about here.]

2.3 Specificity of knowledge gained from experience

In the robotic learning examples in the previous section (Fitzpatrick, 2003; Natale et al., 2005b), information

learned by the robot is intended to be specific to one particular object. The specificity could be varied algorith-

mically, by adding or removing parts of a feature’s “identity”. Too much specificity, and the feature will not be

recognized in another context. Too little, and it will be “hallucinated” everywhere.

We return to Needham’s experiments, which probed the question of generalization in the same experimental

scenario described in Section 2.1.

When changes were introduced between the object seen during familiarization and that seen as part of the

test display, an unexpected pattern emerged. Nearly any change in the object’s features introduced between

familiarization and test prevented infants from benefiting from this prior experience. So, even when infants saw

a blue box with yellow squares prior to testing, and the box used in testing had white squares but was otherwise

identical, they did not apply this prior experience to the parsing of the test display. However, infants did benefit

from the prior exposure when it was not in the features of the object but rather in its orientation (Needham,

2001). A change in the orientation of the box from horizontally to vertically oriented led to the facilitation in

parsing seen in some prior experiments. Thus, infants even as young as 4.5- to 5-months of age know that to

probe whether they have seen an object before, they must attend to the object’s features rather than its spatial

orientation (Needham, 2001).

These results also support two additional conclusions. First, infants’ object representations include detailed

information about the object’s features. Because infants’ application of their prior experience to the parsing of

the test display was so dependent on something close to an exact match between the features, one much conclude

that a highly detailed representation is formed on the initial exposure and maintained during the inter-trial-

interval. Because these features are remembered and used in the absence of the initial item and in the presence

of a different item, this is strong evidence for infants’ representational abilities. Secondly, 4.5-month-old infants

are conservative generalizers – they do not extend information from one object to another very readily. But

would they extend information from a group of objects to a new object that is a member of that group?

2.4 Generalization of of knowledge gained from experience

This question was investigated by Needham et al. (2005) in a study using the same test display and a similar

procedure as in Needham (2001). Infants were given prior experiences with collections of objects, no one of

which was an effective cue to the composition of the test display when seen prior to testing. A set of three similar

objects seen simultaneously prior to test did facilitate 4.5-month-old infants segregation of the test display. But

no subset of these three objects seen prior to testing facilitated infants’ segregation of the test display. Also,

not just any three objects functioned in this way – sets that had no variation within them or that were too

different from the relevant test item provided no facilitation. Thus, experience with multiple objects that are

varied but that are similar to the target item is important to infants’ transfer of their experience to the target

display.

This finding was brought into the “real” world by investigating infants’ parsing of a test display consisting

of a novel key ring (Needham et al., submitted). According to a strict application of organizational principles

using object features, the display should be seen as composed of (at least) two separate objects – the keys on

one side of the screen and the separate ring on the other side. However, to the extent that infants recognize the

display as a member of a familiar category – key rings – they should group the keys and ring into a single unit

that should move as a whole. The findings indicate that by 8.5 months of age, infants parse the display into a

single unit, expecting the keys and ring to move together. Younger infants do not see the display as a single unit,

and instead parse the keys and ring into separate units. Infants of both ages parsed an altered display, in which

the identifiable portions of the key ring were hidden by patterned covers, as composed of two separate units.

Together, these findings provide evidence that the studies of controlled prior exposure described in the previous

section are consistent with the process as it occurs under natural circumstances. Infants’ ordinary experiences

present them with multiple similar exemplars of key rings, and these exposures build a representation that can

then be applied to novel (and yet similar) instances of the key ring category, altering the interpretation that

would come from feature-based principles alone.

These results from infant development suggest a path for robotics to follow. There is currently no serious

robotics work to point to in this area, despite its importance. Robotics work in this area could potentially aid

infant psychologists since there is a strong theoretical framework in machine learning for issues of generalization.

3. Intermodal integration

We have talked about “key events” in which object boundaries are easier to perceive. In general, the ease with

which any particular object property can be estimated varies from situation to situation. Robots and infants

can exploit the easy times to learn statistical correlates that are available in less clear-cut situations. For

example, cross-modal signals are a particularly rich source of correlates, and have been investigated in robotics

and machine perception. Most events have components that are accessible through different senses: A bouncing

ball can be seen as well as heard; the position of the observer’s own hand can be seen and felt as it moves

through the visual field. Although these perceptual experiences are clearly separate from each other, composing

separate ‘channels’, we also recognize meaningful correspondences between the input from these channels. How

these channels are related in humans is not entirely clear. Different approaches to the development of intermodal

perception posit that infants’ sensory experiences are (a) unified at birth and must be differentiated from each

other over development, or (b) separated at birth and must be linked through repeated pairings. Although the

time frame over which either of these processes would occur has not been well defined, research findings do

suggest that intermodal correspondences are detected early in development.

On what basis do infants detect these correspondences? Some of the earliest work on this topic revealed

that even newborn infants look for the source of a sound (Butterworth and Castillo, 1976) and by 4 months

of age have specific expectations about what they should see when they find the source of the sound (Spelke,

1976). More recent investigations of infants’ auditory-visual correspondences have identified important roles for

synchrony and other amodal properties of objects – properties that can be detected across multiple perceptual

modalities. An impact (e.g., a ball bouncing) provides amodal information because the sound of the ball hitting

the surface is coincident with a sudden change in direction of the ball’s path of motion. Some researchers

have argued (Bahrick) that detection and use of amodal object properties serves to bootstrap the use of more

idiosyncratic properties (e.g., the kind of sound made by an object when it hits a surface). Bahrick & Lickliter

have shown that babies (and bobwhite quail) learn better and faster from multimodal stimulation (see their

Intermodal Redundancy Hypothesis (Bahrick and Lickliter, 2000)).

In robotics, amodal properties such as location have been used – for example, sound localization can aid

visual detection of a talking person. Timing has also been used. Prince and Hollich (2005) develop specific

models of audio-visual synchrony detection and evaluates compatibility with infant performance. Arsenio and

Fitzpatrick (2005) exploit the specific timing cue of regular repetition to form correspondences across sensor

modalities. From an engineering perspective, the redundant information supplied by repetition makes this form

of timing information easier to detect reliably than synchrony of a single event in the presence of background

activity. However, in the model of Lewkowicz (2000), the ordering during infant development is opposite. This

suggests the engineers just haven’t found a good enough algorithm, or regular repetition just doesn’t happen

enough to be important.

4. The role of embodiment

The study of perception in biological systems cannot neglect the role of the body in the generation of the sensory

information reaching the brain. The morphology of the body affects perception in numerous ways. In primates,

for example, the visual information that reaches the visual cortex has a non-uniform distribution, due to the

peculiar arrangement and size of the photoreceptors in the retina. In many species the outer ears and the head

filter the sound in a way that facilitate auditory localization. The intrinsic elasticity of the muscles acts as a

low pass filter during the interaction between the limbs and the environment.

Through the body the brain performs actions to explore the environment and collect information about

its properties and rules. In their experiments with human subjects Lederman and Klazky (Lederman and

Klatzky, 1987) have identified a set of stereotyped hand movements (exploratory procedures) that adults use when

haptically exploring objects to determine properties like weight, shape, texture and temperature. Lederman

and Klatzky show that to each property can be associated a preferential exploratory procedure which is, if not

required, at least optimal for its identification.

These observations support the theory that motor development and the body play an important role in

perceptual development in infancy (Bushnell and Boudreau, 1993). Proper control of at least the head, the arm

and the hand is required before infants can reliably and repetitively engage in interaction with objects. During

the first months of life the inability of infants to perform skillful movements with the hand would prevent them

from haptically exploring the environment and perceive properties of objects like weight, volume, hardness and

shape. But, even more surprisingly, motor development could affect the developmental course of object visual

perception (like three dimensional shape). Further support to this theory comes from the recent experiment by

Needham and colleagues (A. Needham and Peterman, 2002), where the ability of pre-reaching infants to grasp

objects was artificially anticipated by means of mittens with palms covered with velcro that stuck to some toys

prepared by the experimenters. The results showed that those infants whose grasping ability had been enhanced

by the glove, were more interested in objects than a reference group of the same age that developed ’normally’.

This suggests that, although artificial, the boost in motor development produced by the glove anticipated the

infants’ interest towards objects.

Exploiting actions for learning and perception requires the ability to match actions with the agents that

caused it. The sense of agency (Jeannerod, 2002) gives humans the sense of ownership of their actions and

implies the existence of an internal representation of the body. Although some sort of self-recognition is already

present at birth, at least in the form of a simple hand-eye coordination (van der Meer et al., 1995), it is during

the first months of development that infants learn to recognize their body as a separate entity acting in the

world (Rochat and Striano, 2000). It is believed that to develop this ability infants exploit correlations across

different sensorial channels (combined double touch/correlation between proprioception and vision).

In robotics we have the possibility to study the link between action and perception, and its implications

on the realization of artificial systems. Robots, like infants, can exploit the physical interaction with the

environment to enrich and control their sensorial experience. However these abilities do not come for free. Very

much like an infant, the robot must first learn to identify and control its body, so that the interaction with the

environment is meaningful and, at least to a certain extent, safe. Indeed, motor control is challenging especially

when it involves the physical interaction between the robot and the world.

Inspired by the developmental psychology literature, roboticists have begun to investigate the problem of self-

recognition in robotics (Gold and Scassellati, 2005; Metta and Fitzpatrick, 2003; Natale et al., 2005b; Yoshikawa

et al., 2003). Although different in several aspects, in all these work the robot looks for intermodal similarities

and invariances to identify its body from the rest of the world. In the work of Yoshikawa (Yoshikawa et al.,

2003) the rationale is that for any given posture the body of the robot is invariant with respect to the rest

of the world. The correlation between visual information and proprioceptive feedback is learned by a neural

network which is trained to predict the position of the arms in the visual field. Gold and Scassellati (Gold

and Scassellati, 2005) solve the self-recognition problem by exploiting knowledge of the time elapsing between

the actions of the robot and the associated sensorial feedback. In the work of Metta and Fitzpatrick (Metta

and Fitzpatrick, 2003) and Natale et al. (Natale et al., 2005b) actions are instead used to generate visual

motion with a known pattern. Similarities in the proprioceptive and visual flow are searched to visually identify

the hand of the robot. Periodicity in this case enhances and simplifies the identification. The robot learns a

multimodal representation of its hand that allows a robust identification in the visual field.

In our experience with robots we identified three scenarios in which the body proved to be useful in solving

perceptual tasks:

1. direct exploration: the body in this case is the interface to extract information about the objects. For

example in (Natale et al., 2004) haptic information was employed to distinguish object of different shape, a

task that would be much more difficult if performed visually. In (Torres-Jara et al., 2005) the robot learned

to recognize a few objects by using the sound they generate upon contact with the fingers.

2. controlled exploration: use the body to perform actions to simplify perception. The robot can deliberately

generate redundant information by performing periodic actions in the environment.

3. the body as a reference frame: during action the hand is the place where important events are most likely

to occur. The ability to direct the attention of the robot towards the hand is particularly helpful during

learning; in (Natale et al., 2005b) we show how this ability allows the robot to learn a visual model of the

objects it manages to grasp by simply inspecting the hand when touch is detected on the palm (see Figure 3).

In similar situations the same behavior could allow the robot to direct the gaze to the hand if something

unexpected touches it. Eye-hand coordination seems thus important to establish a link between different

sensory channels like touch and vision.

5. Conclusions

In the field of humanoid robotics, researchers have a special respect and admiration for the abilities of infants.

They watch their newborn children with particular interest, and their spouses have to constantly be alert for the

tell-tale signs of them running an ad-hoc experiment. It can be depressing to compare the outcome of a five-year,

multi-million-euro/dollar/yen project with what an infant can do after four months. Infants are so clearly doing

what we want to robots to do; is there any way to learn from research on infant development? Conversely, can

infant development research be illuminated by the struggles faced by robotics, perhaps pointing out fundamental

perceptual difficulties taken for granted? Is there a way to create a model of development which applies both

to infants and robots? It seems possible that similar sensorial constraints and opportunities will mold both

the unfolding of an infant’s sensitivities to different cues, and the organization of the set of algorithms used by

robots to achieve sophisticated perception. So, at least at the level of identifying “key events” and mutually

reinforcing cues, a shared model is possible. Of course, there is a lot that would not fit in the model, and this

is as it should be. It would be solely concerned with the class of functional, information-driven constraints. We

have not in this paper developed such a model; that would be premature. We have at identified some points

of connection that could grow into much more. We hope the paper will serve as one more link in the growing

contact between the fields.

References

A. Needham, T. B. and Peterman, K. (2002). A pick-me-up for infants’ exploratory skills: Early simulated

experiences reaching for objects using ’sticky mittens’ enhances young infants’ object exploration skills. Infant

Behavior and Development, 25:279–295.

Arsenio, A. M. and Fitzpatrick, P. M. (2005). Exploiting amodal cues for robot perception. International

Journal of Humanoid Robotics, 2(2):125–143.

Bahrick, L. E. and Lickliter, R. (2000). Intersensory redundancy guides attentional selectivity and perceptual

learning in infancy. Developmental Psychology, 36:190–201.

Bushnell, E. and Boudreau, J. (1993). Motor development and the mind: the potential role of motor abilities

as a determinant of aspects of perceptual development. Child Development, 64(4):1005–10021.

Butterworth, G. and Castillo, M. (1976). Coordination of auditory and visual space in newborn human infants.

Perception, 5(2):155–160.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International

Journal of Computer Vision, 59(2):167–181.

Fitzpatrick, P. (2003). Object Lesson: discovering and learning to recognize objects. In Proceedings of the 3rd

International IEEE/RAS Conference on Humanoid Robots, Karlsruhe, Germany.

Fitzpatrick, P. and Metta, G. (2003). Grounding vision through experimental manipulation. Philosophical

Transactions of the Royal Society: Mathematical, Physical, and Engineering Sciences, 361(1811):2165–2185.

Gold, K. and Scassellati, B. (2005). Learning about the self and others through contingency. In Developmental

Robotics AAAI Spring Symposium, Stanford, CA.

Jeannerod, M. (2002). The mechanism of self-recognition in humans. Behavioural Brain Research, 142:1–15.

Kellman, P. J. and Spelke, E. S. (1983). Perception of partly occluded objects in infancy. Cognitive Psychology,

15:483–524.

Kestenbaum, R., Termine, N., and Spelke, E. S. (1987). Perception of objects and object boundaries by three-

month-old infants. British Journal of Developmental Psychology, 5:367–383.

Lederman, S. J. and Klatzky, R. L. (1987). Hand movements: A window into haptic object recognition. Cognitive

Psychology, 19(3):342–368.

Lewkowicz, D. J. (2000). The development of intersensory temporal perception: an epigenetic sys-

tems/limitations view. Psych. Bull., 126:281–308.

Martin, D., Fowlkes, C., and Malik, J. (2004). Learning to detect natural image boundaries using local bright-

ness, color and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5):530–549.

Metta, G. and Fitzpatrick, P. (2003). Early integration of vision and manipulation. Adaptive Behavior,

11(2):109–128.

Natale, L., Metta, G., and Sandini, G. (2004). Learning haptic representation of objects. In International

Conference on Intelligent Manipulation and Grasping, Genoa, Italy.

Natale, L., Metta, G., and Sandini, G. (2005a). A developmental approach to grasping. In Developmental

Robotics AAAI Spring Symposium, Stanford, CA.

Natale, L., Orabona, F., Metta, G., and Sandini, G. (2005b). Exploring the world through grasping: a devel-

opmental approach. In Proceedings of the 6th CIRA Symposium, Espoo, Finland.

Needham, A. (1998). Infants’ use of featural information in the segregation of stationary objects. Infant Behavior

and Development, 21:47–76.

Needham, A. (1999). The role of shape in 4-month-old infants’ segregation of adjacent objects. Infant Behavior

and Development, 22:161–178.

Needham, A. (2000). Improvements in object exploration skills may facilitate the development of object segre-

gation in early infancy. Journal of Cognition and Development, 1:131–156.

Needham, A. (2001). Object recognition and object segregation in 4.5-month-old infants. Journal of Experi-

mental Child Psychology, 78(1):3–22.

Needham, A. and Baillargeon, R. (1997). Object segregation in 8-month-old infants. Cognition, 62:121–149.

Needham, A. and Baillargeon, R. (1998). Effects of prior experience in 4.5-month-old infants’ object segregation.

Infant Behavior and Development, 21:1–24.

Needham, A., Dueker, G., and Lockhead, G. (2005). Infants’ formation and use of categories to segregate

objects. Cognition, 94(3):215–240.

Prince, C. G. and Hollich, G. J. (2005). Synching models with infants: a perceptual-level model of infant

audio-visual synchrony detection. Journal of Cognitive Systems Research, 6:205–228.

Rochat, P. and Striano, T. (2000). Perceived self in infancy. Infant Behavior and Development, 23:513–530.

Slater, A., Morison, V., Somers, M., Mattock, A., Brown, E., and Taylor, D. (1990). Newborn and older infants’

perception of partly occluded objects. Infant Behavior and Development, 13(1):33–49.

Spelke, E. S. (1976). Infants’ intermodal perception of events. Cognitive Psychology, 8:553–560.

Spelke, E. S. (1990). Principles of object perception. Cognitive Science, 14:29–56.

Torres-Jara, E., Natale, L., and Fitzpatrick, P. (2005). Tapping into touch. In Fith International Workshop on

Epigenetic Robotics (forthcoming), Nara, Japan. Lund University Cognitive Studies.

van der Meer, A., van der Weel, F., and Weel, D. (1995). The functional significance of arm movements in

neonates. Science, 267:693–5.

Viola, P. and Jones, M. (2004). Robust real-time object detection. International Journal of Computer Vision,

57(2):137–154.

Yoshikawa, Y., Hosoda, K., and Asada, M. (2003). Does the invariance in multi-modalities represent the body

scheme ? - a case study with vision and proprioception -. In 2nd Intelligent Symposium on Adaptive Motion

of Animals and Machines, Kyoto, Japan.

List of Figures

1 Object segregation is difficult for machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Object segregation is not always well-defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Exploitation of key moments in robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Figure 1: An example from Martin et al. (2004), to highlight the difficulties of bottom-up segmentation. For the

image shown on the left, humans see the definite boundaries shown in white in the middle image. The best machine

segmentation of a set of algorithms gives the result shown on this right – a mess. This seems a very difficult scene to

segment without having some training at least for the specific kinds of materials in the scene.

Figure 2: Object segregation is not necessarily well-defined. On the left, there is a simple scenario, taken from Needham

(2001), showing a rectangle attached to a yellow tube. Two plausible ways to segregate this scene are shown in the

middle, depending on whether the tube and rectangle make up a single object. For comparison, automatically acquired

boundaries are shown on the right, produced using the algorithm in Felzenszwalb and Huttenlocher (2004). This algorithm

does image segmentation, seeking to produce regions that correspond to whole objects (such as the yellow tube) or at

least to object parts (such all the blue rectangle and all the small white patches on its surface, and various parts of the

background). Ideally, regions that extend across object boundaries are avoided. Image segmentation is less ambitious

than object segregation, and allows context information to be factored in as a higher level process operating on a region

level rather than pixel level.

Figure 3: The upper row shows object segregation by the robot “Babybot” based on prior experience. The robot explores

the visual appearances of an object that it has grasped; the information collected in this way is used later on to segment

the object (Natale et al., 2005b). Left: the robot. Middle: the robot’s view when holding up an object. Right: later

segmentation of the object. The lower row shows the robot “Cog” detecting object boundaries experimentally by poking

(Fitzpatrick, 2003). During object motion, if finds features of the object that contrast with other objects, and that are

stable with respect to certain geometric transformations. These features are then used to jointly detect and segment

the object in future views. Left: the robot. Middle: segmentations of a poked object. Right: later segmentation of the

object on a similarly-colored table.

RESEARCH ARTICLE

Gustaf Gredeback Æ Helena Ornkloo

Claes von Hofsten

The development of reactive saccade latencies

Received: 17 October 2005 / Accepted: 19 January 2006 / Published online: 18 February 2006� Springer-Verlag 2006

Abstract Saccadic reaction time (SRT) of 4-, 6- and 8-month-old infants’ was measured during tracking ofabruptly changing trajectories, using a longitudinaldesign. SRTs decreased from 595 ms (SE=30) at4 months of age to 442 ms (SE=13) at 8 months ofage. In addition, SRTs were lower during high veloci-ties (comparing 4.5 and 9�/s) and vertical (compared tohorizontal) saccades.

Keywords Infant Æ Saccade Æ Reaction time Æ SRT ÆTracking Æ ASL

Introduction

A newborn child is unable to discriminate motiondirection (Atkinson 2000) and track objects with smoothpursuit. They can, however, perform saccadic eyemovements (Aslin 1986). This means that newborns canredirect gaze swiftly to new locations within the visualfield. As infants grow older they become more proficientin combining saccades, smooth pursuit, and headmovements to track moving targets (von Hofsten 2004).

Several studies have investigated the dynamics ofsaccade development, often during continuous trackingof sinusoidal trajectories (Gredeback et al. 2005; Rich-ards and Holley 1999; Phillips et al. 1997; Rosander andvon Hofsten 2002; von Hofsten and Rosander 1996,1997) In such tasks saccadic performance is most oftendescribed in terms of frequency and amplitude of sac-cades. Latencies of individual saccades cannot bereported since saccade initiation is not triggered by aspecific external event but occurs as a consequence ofinsufficient smooth pursuit or in anticipation of futureevents (Leigh and Zee 1999).

Within this paradigm numerous studies havereported that infants produce more saccades as taskdemands increase. von Hofsten and Rosander (1996)demonstrated that infants produced more saccades asthe frequency of the stimulus motion increased(�1 saccade/s at 0.1 Hz and �1.75 saccades/s at0.3 Hz). Similar effects of frequency, as well as ampli-tude and velocity, have been reported elsewhere (Gre-deback et al. 2005; Phillips et al. 1997; von Hofsten andRosander 1997). In a similar fashion, increasing thetarget frequency, amplitude, or velocity will enhance theamplitude (von Hofsten and Rosander 1997) andvelocity (Phillips et al. 1997) of initiated saccades.

In addition, reliance on saccadic tracking appears tochance with age and be different for horizontal andvertical trajectories. Comparative studies of children andadults saccadic performance indicate that adult levelsare not reached until children are between 10 and12 years old (Yang et al. 2002). In younger infants sac-cade frequency has been reported to drop from�1.75 saccades/s at 6.5 weeks to �0.75 saccades/s at12 weeks of age in response to small target whereaslarger targets resulted in fewer saccades/s (target sizeranged from 2.5 to 35�; Rosander and von Hofsten2002). von Hofsten and Rosander (1997) also reported adecrease in saccade amplitude between 2 and 3 monthsof age (average saccade amplitude=4.4� at 2 monthsand 3.4� at 3 months1) whereas Richards and Holley(1999) described an increase in saccadic amplitudebetween 2 and 6.5 months of age (from �8 to �16�) anda shift from smooth tracking to saccade reliance at highvelocities. The later study presented infants with a ver-tical bar (2� horizontal ·6� vertical extension) moving onboth horizontal and vertical trajectories.

Another study that investigated two-dimensionaltracking in older infants reported an increase in saccadefrequency between 6 months (from �0 saccades/s at0.1 Hz to �0.5 saccades/s at 0.4 Hz) and 12 monthsG. Gredeback (&) Æ H. Ornkloo Æ C. von Hofsten

Department of Psychology, Uppsala University, Box 1225,75142, Uppsala, SwedenE-mail: [email protected]: +46-18-4712123

1Reported means are calculated from the group means presented inTable 2 of von Hofsten and Rosander (1997).

Exp Brain Res (2006) 173: 159–164DOI 10.1007/s00221-006-0376-z

(�0.1 saccades/s at 0.1 Hz to �1.2 saccades/s at 0.4 Hz)of age (Gredeback et al. 2005). The same study alsoreported more vertical than horizontal saccades.

Some of the above mentioned effects converge; thosethat describe an increased reliance on saccades with in-creased task demands (defined by increased targetvelocity, amplitude, and frequency). Other results areless consistent, particularly effects of age and dimen-sionality (comparing horizontal and vertical tracking).Given the composite nature of the task (to use saccades,smooth pursuit, and head movements to keep gaze ontarget) it is difficult to evaluate whether these effects canbe attributed to properties of the saccade system or ifthey arise as compensatory actions governed by, forexample, changes in smooth pursuit proficiency.

Alternative approaches in which infants are presentedwith sequences of pictures that alternate between twolocations engage the saccadic system in a more directmanner and enable reports of individual saccade laten-cies. In a series of studies, Canfield et al. (1997) inves-tigated 2- to 12-month-old infants’ saccadic reactiontime (SRT) to static stimuli that alternated betweendifferent locations. In this study infants were presentedwith different sequences of left–right located pictures.These sequences ranged from highly predictable oneswere the pictures alternated between two positions (L–R), sequences that included repetitions (L–L–R), andirregular presentations without a set sequence. Theirconclusion was that infants average SRT decreased from440 ms at 2 months to 285 ms at 12 months of age (seealso, Rensink et al. 2000).

Substantially longer latencies for reactive saccadeshave been reported elsewhere (Aslin and Salapatek 1975;Bronson 1982). Bronson (1982) examined SRTs toperipheral targets and found that the SRTs equaled 1 sat 2 months of age and 0.5 s at 5 months of age. Aslinand Salapatek (1975) measured the latency and ampli-tude of 1- and 2-month-old infants’ reactive saccades.They found that the SRT in both age groups dependedon the size of the target offset. At 1 month of age, themedian SRT ranged from 800 ms (10� offset) to1,480 ms (30� offset). One month later a significant de-crease was observed with corresponding SRTs being 480and 1,280 ms. It thus appears that SRTs vary substan-tially over different tasks.

Measuring SRTs to repetitive stimuli, as describedabove, induce its own set of confounds. Richards (2000)demonstrated that priming effects emerge between 3 and6 months of age in response to repetitive presentationsof target locations. In addition, Wentworth and Haith(1992) reported that predictable cues were followed by ahigher degree of anticipation and speedier reactions. Assuch, there appear to be two slightly different effects thatboth affect SRTs in response to repetitive static stimuli,both priming effects and an increased tendency to pre-dict the next stimulus location. These effects might havelowered the average SRT in some studies.

The current study attempts to bridge the gapbetween these two paradigms by presenting 4-, 6-, and

8-month-old infants with targets that move on lineartrajectories that suddenly alter their direction ofmotion. This gives us the opportunity to observe hor-izontal and vertical saccades during ongoing tracking[in accordance with the paradigm used by Gredebacket al. (2005), von Hofsten and Rosander (1997) andRichards and Holley (1999)] and to measure SRTsof individual saccades [in accordance to the para-digm used by Canfield et al. (1997) and Bronson (-1982)]. Abruptly changing the trajectory makes itpossible to evaluate individual saccades fairly inde-pendent of smooth pursuit performance and since eachtrajectory is unique the risk of priming effects andpredictive saccades are brought to a minimum.

Similar tasks have been successfully carried out withadult participants. In an influential study by Engel et al.(1999), adult SRTs were measured during continuoustracking of linear trajectories that suddenly altered theirdirection of motion. The target moved with one of twovelocities (15 or 30�/s) and changed trajectory at a ran-dom position near the center of the screen to one of11 alternative trajectories. In their study adult SRTsaveraged 197 ms (SD=28 ms).

Method

Participants

Sixteen infants participated in this longitudinal study (10males and 6 females). They visited the lab at 4 months(126±4 days), 6 months (178±4 days), and 8 monthsof age (234±8 days). Participants were contacted bymail based on birth records; before sending out the let-ters these were also checked by the health authorities toensure that only families with healthy infants werecontacted. The study was approved by the ethics com-mittee at the Research Council in the Humanities andSocial Sciences and therefore in accordance with theethical standards specified in the 1964 Declaration ofHelsinki. Before each session participating families wereinformed about the purpose of the study and signed aconsent form. As compensation for their participationeach family received either two movie tickets or eightbus tickets with a total value of �20 €.

Apparatus

Infants gaze was measured with an infrared cornealreflection technique (ASL, Bedford, MA, USA) incombination with a magnetic head tracker (Flock ofBirds, Ascension, Burlington, VT, USA). The ASL 504calculates gaze from the reflection of near infrared lightin the pupil and on the cornea (sampling frequency60 Hz, precision 0.5�, accuracy <1�). Head position wascalculated from changes in an electromagnetic fieldgenerated by a magnet and detected by a small sensor

160

(miniBird, 18·8·8 mm) attached to the Flock of Birdscontrol unit with a cable. The ASL 504 camera movedwith two degrees of freedom (azimuth and elevation)and was accessible via remote control and keyboard.During slow translations of the head and torso an autocorrection function rotated the camera to the center ofthe eye. During fast translations and prolonged blinking,the ASL 504 utilized information from the Flock ofBirds to relocate the eye.

Infants were presented with a 3D ‘happy face’ thatmoved over a blue colored screen. Its motion was con-trolled by two servomotors (one for horizontal and onefor vertical motion). The motors controlled the targetusing two magnets, one attached to the engines on theback of the screen and the other to the target at the frontof the screen. Both magnets were covered in cloth toreduce friction and ensure a minimum of movementrelated noise. A similar device has previously been usedto produce moving objects on a screen (Hespos et al.2002; von Hofsten and Spelke 1985; von Hofsten.et al.1998). The steel sheath covered an area of 1 m2 and0.42 cm2 could be used to produce stimulus motions.This presentation device was located 128 cm in front ofthe experimental booth. The side panels between thescreen and the experimental booth was covered withthick black fabric to reduce external light sources.

Infants were seated inside a semi-enclosed experi-mental booth (106·122·204 cm). The stimulus wasvisible through an opening (30·30 cm) in one of theshort walls of this booth; its lower edge was 100 cmfrom the floor. A shelf below the opening (82 cm fromfloor) held the ASL 504 camera unit. Outside theexperimental booth a loudspeaker was located on eachside of the opening. On the backside of the booth aPlexiglas frame held the Flock of Birds magnet. TheminiBird sensor cable entered the booth through anopening in the same wall below the Flock of Birdsmagnet. Light-blocking curtains covered the entranceto the experimental booth to eliminate outside lightfrom entering the booth.

Visual stimuli

A combination of a bell, a bright colored face, and a redlight attached to a rod was used to attract the infants’attention during calibration. This stimulus was movedback and forth between the upper left and lower rightcorners of the screen during calibration. To test cali-bration quality this target moved around the screen,stopping at each of its four corners.

The experimental stimulus consisted of a head (yel-low), a nose (blue), a painted mouth (red), and twoglowing eyes (red LEDs) under black eye brows.Underneath the target and above the magnet was apaper circle (yellow and black) that effectively doubledthe size of the target (radius=1 visual degree). Thecurrent setup only used the lower half of the displayscreen (35·65 cm).

The target moved with one of two velocities (4.5 and9�/s) starting from the upper right or the lower leftcorner of the screen. The initial trajectory was alwaysdiagonal and directed towards the center of the stimuluspresentation area. From its starting position the targetmoved on the same diagonal trajectory for either 8.9 or12.4 visual degrees, thereafter the target turned to astraight vertical or horizontal trajectory. Trajectoriesstarting at the upper right corner of the screen shifted toeither a linear upward or rightward motion. Stimulistarting in the lower left corner turned to either a left-ward or downward motion. The distance between theturning point and the end location of the trajectory was4.6, 6.4, 7.6, or 10.6 visual degrees (see Fig. 1).

After each trial the target moved to the new startinglocation of the next presentation. The next trial did notstart until the infant attentively focused on the target.All in all, the target moved on 16 unique trajectories (2starting locations · 2 velocities · 2 turning loca-tions · 2 post-turning trajectories; horizontal and ver-tical) that were presented to each infant only once.

Procedure

Upon arrival at the lab parents were briefed about thepurpose of the study. They were provided with a basicdescription on how the eye and head trackers worked,after which a consent form was signed. One of the par-ents was seated inside the experimental booth, facing thestimuli presentation device. Infants were fastened in aninfant car safety-seat that was placed on the parent’s lapfacing the same direction. During the entire experimen-tal procedure parents were encouraged to talk and singto their infants. As a whole this procedure provided asafe and reassuring atmosphere for the infants with bothparental closeness and an unobstructed visual field.Once the infant and parent were comfortably seatedinfants were dressed in a baby cap that held the mini-Bird. This head tracker was placed above the infants’right eye.

After this initial procedure, a number of steps weretaken before the actual stimuli presentations could be-gin. The distance between the miniBird and the infant’seye as well as thresholds for cornea and pupil detectionhad to be set. This was followed by a two-point cali-bration procedure (upper left and lower right corner ofthe screen) and a four-point calibration test (each of thefour corners of the screen). If the eye tracker did notlocate the fixation inside the extended calibration stim-ulus at each of the four corners of the screen the entirecalibration procedure was redone.

This entire procedure rarely took more than a min-ute. With the exception of the calibration procedure allabove-mentioned preparations were accompanied by areal-life puppet show, intended to focus the infantsattention forward. For a more thorough description ofthe general methods used, see Gredeback and vonHofsten (2004).

161

During the stimuli presentations each of the 16 tra-jectories were presented to the infants in a randomizedorder. Lack of attention to any stimulus resulted in anadditional presentation of that stimulus after the entireseries. Each presentation lasted less than 5 s and allpresentations rarely took more than 10 min.

Data analysis

Saccades were identified as eye movements with veloci-ties of 30�/s or more. They were separately analyzed forvertical and horizontal dimensions. SRTs were based onthe onset of the first saccade in the direction of thetarget’s new trajectory following its directional shift (oneither vertical or horizontal direction) with amplitudeextending at least 1�.

The inclusion criteria limited the analyzed dataset tothose trials in which target related smooth pursuit bothpreceded and succeeded the saccade. By this criterion wereduced the number of non-target related saccades andreduced noise.

Only horizontal components were analyzed for hori-zontal stimuli and vertical components for verticalstimuli. A multiple regression analysis of SRTs wereperformed with age (4, 6, and 8 months of age), velocity(4.5 and 9�/s), and dimension (horizontal and verticalsaccades) as regressors. In the results, the mean SRT isreported, not the median. This is based on recommen-dations by Miller (1988) who argues against the medianas a central measurement for small positively skewedsamples, particularly if groups differ in size.

Results

One infant was excluded from the analysis due to lack ofattention at both 4 and 6 months of age. Among the

remaining 15 participants the independent variables(age, velocity, and dimension) were significantly associ-ated with SRT, radj

2 =0.21, F(4,142)=13.87, P<0.00001.All regressors made significant individual contributions;SRTs decreased from 595 ms at 4 months of age(SE=30; n=76) to 485 ms at 6 months of age (SE=23;n=105) and 442 ms at 8 months of age (SE=13;n=136), (=�0.27, t(142)=12.4, P<0.00001.

In addition, both higher target velocities, (=�0.31,t(142)=�4.3, P<0.00005, and vertical saccades,(=�0.22, t(142)=�3.0, P<0.005, are related to lowersaccadic latencies. These effects are visible in Fig. 2a, b.Figure 2c displays individual SRTs at each age. Thecorrelation between SRTs at 4 and 6 months equaled0.51, this correlation decreased to 0.22 when comparing4- and 8-month-old infants and to �0.15 when com-paring 6-and 8-month-old infants.

Discussion

When infants track a moving target that abruptlychanges its direction of motion they produce a correctivesaccade in order to fixate the target and continue smoothpursuit tracking. The current paper describes how theability to reorient gaze in this way develops between 4and 8 months of age, with a focus on saccadic latency.All in all, SRTs decreased as infants got older. Inaddition, both the velocity of the target and whetherinfants performed a horizontal or vertical saccadeinfluenced SRT. Each of these effects will be discussedseparately below, starting with age differences.

Age differences

Tracking studies have reported many changes in saccadeusage, as infants grow older. Some studies have reported

Fig. 1 Trajectories presented tothe infants; 1–8 endpoints of themotion and s the starting pointsof the motion

162

a decrease in frequency and amplitude of saccadesduring the first few months of life (von Hofsten andRosander 1997; Rosander and von Hofsten 2002).Others have reported the opposite results (Gredebacket al. 2005; Richards and Holley 1999). It is difficult torelate these diverse findings to the current study. It isperhaps more fruitful to relate the current age effectsto those found by Canfield et al. (1997). In their studySRTs decreased from 400 ms at 2 months to 285 ms at12 months. Both of these values are lower than theaverage performance reported above (595 ms at4 months and 442 ms at 8 months).

It is possible that the reported differences can beattributed to differences in the predictability of the tar-gets. The current study presented infants with 16 noveltrajectories without repetition whereas Canfield et al.presented pictures that alternated between two locations(L–R). During highly predictable presentations thelocation of the next picture could be determined from theprevious locations (L–R). Even during irregular se-quences any random saccade would have had a 0.5 suc-cess rate, because there were only two stimulus locations.

In fact, a number of studies have reported that rep-etitions decrease SRT. Richards (2000) demonstratedthat attention to locations other than those fixated (asmeasured by saccade response facilitation) emerges be-tween 3 and 6 months of age. In a study by Johnsonet al. (1994) SRTs decreased with experience in 4-month-old infants. Their interpretation was based on a facili-tation of covertly attended locations, places that heldcues for future target appearances. Wentworth andHaith (1992) demonstrated that a picture, whichstrongly predicted the location of the next picture, wasfollowed by a higher rate of anticipation and speedierreactions than demonstrated at baseline in 2- and 3-month-old infants. Likewise, Canfield and Haith (1991)found similar effects; they reported that infants at3 months of age decrease their SRT over presentations.

This difference between repetitive and non-repetitivestimuli provides a good illustration of infants’ predic-tive abilities and how important it is for researchers torelate their findings to infants’ expectations and thedevelopment of predictive mechanisms.

Horizontal and vertical saccade latencies

The current study reports that infants’ SRTs are lowerfor vertical than horizontal saccades. Richards andHolley (1999) reported more horizontal than verticalsaccades but used a non-uniform stimulus that couldaccount for differences between vertical and horizontaldimensions. Few other studies have reported on verticaland horizontal tracking in infancy. Gredeback et al.(2003) measured gaze in response to circular trajectories.Nine-month-old infants displayed a higher gain (withhigher variance) and larger lags with vertical gaze.Gredeback et al. (2005) reported similar effects. In thisstudy, infants also produced more vertical that hori-zontal saccades. These saccades were responsible for41% of gain in vertical components of a circular tra-jectory, compared to 21% for horizontal components;see Collewijn and Tamminga (1984) for similar effects inadults.

The argument has been brought forward that dif-ferences in timing of both gaze and smooth pursuittracking is due to differences in experience. Accordingto this argument, (1) horizontal motion is more com-mon than vertical, (2) infants produce more smoothhorizontal tracking as a consequence thereof, and (3)this increased experience give rise to enhanced perfor-mance (Gredeback et al. 2003; Gredeback and vonHofsten 2004).

The same argument could possibly be applied to ahigher reliance on saccades during vertical tracking. (1)We know that an infants’ smooth vertical tracking is

Fig. 2 Saccadic latency at eachage (4, 6, and 8 months) dividedinto a horizontal (closed circles)and vertical (open squares)corrective saccades, b targetvelocity (4.5�/s=closed circlesand 9�/s=open squares), and cindividual data (averagesaccadic latency at each age)

163

inferior to smooth horizontal tracking, (2) this results ina high dependency on vertical saccades (Gredeback et al.2005). (3) This increased experience might give rise toenhanced performance, as defined by facilitated SRTs.

If this (admittedly speculative) argument is valid thenthe current results provide a nice illustration of theimportance of experience to ocular performance. It alsohighlights the strong functional dependency that eachcomponent of gaze tracking (in this case smooth pursuitand saccades) has on the development and execution ofother components.

Velocity

It is difficult to discuss velocity related effects sincenumerous variables by definition co-vary with differentvelocities. It is highly likely that there exists an optimaltracking velocity and that performance decreases onboth sides of this point (curvilinear effect). Findingshorter latencies at 9�/s (compared to 4.5�/s) couldindicate that the higher velocity is closer to this optimalvelocity.

On the other hand, it is equally likely that this dif-ference is due to changes in the attractiveness or salienceof the stimuli. In fact, Hainline et al. (1984) demon-strated that the attentional value of presented stimulihas a clear effect on saccade dynamics. Regardless ofwhich of these accounts produced the high velocityproficiency found in the current paper we can concludethat it is essential to include multiple velocities in studiesof infants’ saccadic latencies.

References

Atkinson J (2000) The developing visual brain. Oxford MedicalPublication, Oxford

Aslin R (1986) Anatomical constraints on oculomotor develop-ment: implications for infant perception. In: Yonas A (ed) TheMinnesota symposium on child psychology, Hillsdale NJ

Aslin RN, Salapatek P (1975) Saccadic localization of visual targetsby the very young human infant. Percept Psychophys 17(3):293–302

Bronson GW (1982) The scanning patterns of human infants:implication for visual learning. Ablex Publishing, Norwood NJ

Canfield RL, Haith MM (1991) Young infants’ visual expectationsfor symmetric and asymmetric stimulus sequences. Dev Psychol27(2):198–208

Canfield RL, Smith EG, Brezsnyak MP, Snow KL (1997) Infor-mation processing through the first year of life: a longitudinal

study using the visual expectation paradigm. Monogr Soc ResChild Dev 62(2):1–145

Collewijn H, Tamminga EP (1984) Human smooth and saccadiceye movements during voluntary pursuit of different targetmotions on different backgrounds. J Physiol 351:217–250

Engel KC, Anderson JH, Soechting JF (1999) Oculomotor trackingin two dimensions. J Neurophysiol 81(4):1597–1602

Gredeback G, von Hofsten C (2004) Infants’ evolving representa-tion of moving objects between 6 and 12 months of age. Infancy6(2):165–184

Gredeback G, von Hofsten C, Boudreau JP (2003) Infants’ visualtracking of continuous circular motion under conditions ofocclusion and non-occlusion. Infant Behav Dev 25:161–182

Gredeback G, von Hofsten C, Karlsson J, Aus K (2005) Thedevelopment of two-dimensional tracking: a longitudinal studyof circular pursuit. Exp Brain Res 163(2):204–213

Hainline L, Turkel J, Abramov I, Lemerise E, Harris CM (1984)Characteristics of saccades in human infants. Vis Res24(12):1771–1780

Hespos SJ, von Hofsten C, Spelke ES, Gredeback G (2002) Objectrepresentation and predictive reaching: evidence for continuityfrom infants and adults. Poster presented at ICIS, Toronto

von Hofsten C (2004) An action perspective on motor develop-ment. Trends Cogn Sci 8:266–272

von Hofsten C, Rosander K (1996) The development of gazecontrol and predictive tracking in young infants. Vis Res 36:81–96

von Hofsten C, Rosander K (1997) Development of smooth pursuittracking in young infants. Vis Res 37:1799–1810

von Hofsten C, Spelke ES (1985) Object perception and object-directed reaching in infancy. J Exp Psychology: General114:198–212

von Hofsten C, Vishton P, Spelke ES, Feng Q, Rosander R (1998)Predictive action in infancy: tracking and reaching for movingobjects. Cognition 67:255–285

Johnson MH, Posner MI, Rothbart MK (1994) Facilitation ofsaccades towards a covertly attended location in early infancy.Psychol Sci 5(2):90–93

Leigh RJ, Zee DS (1999) The neurology of eye movements, 3rdedn. Oxford University Press, New York

Miller J (1988) A warning about median reaction time. J ExpPsychol Human Percept Perform 14(3):539–543

Phillips JO, Finoccio DV, Ong L, Fuchs AF (1997) Smooth pursuitin 1–4-month-old human infants. Vis Res 37:3009–3020

Rensink RA, Chawarska K, Betts S (2000) The development ofvisual expectations in the first year. Child Dev 71(5):1191–1204

Richards JE (2000) Localizing the development of covert attentionin infants with scalp event-related potentials. Dev Psychol36(1):91–108

Rosander K, von Hofsten C (2002) Development of gaze trackingof small and large objects. Exp Brain Res 146:257–264

Richards JE, Holley FB (1999) Infants’ attention and the devel-opment of smooth pursuit tracking. Dev Psychol 35:856–867

Wentworth N, Haith MM (1992) Event-specific expectations of 2-and 3-month old infants. Dev Psychol 28(5):842–850

Yang Q, Bucci M P, Kapoula Z (2002) The latency of saccades,vergence, and combined eye movements in children and adults.Invest Ophthalmol Vis Sci 43: 2939–2939

164

Directional Hearing in a Humanoid Robot Evaluation of Microphones Regarding HRTF and Azimuthal Dependence Lisa Gustavsson, Ellen Marklund, Eeva Klintfors and Francisco Lacerda Department of Linguistics/Phonetics, Stockholm University {lisag|ellen|eevak|frasse}@ling.su.se

Abstract As a first step of implementing directional hearing in a humanoid robot two types of microphones were evaluated regarding HRTF (head related transfer function) and azimuthal dependence. The sound level difference between a signal from the right ear and the left ear is one of the cues humans use to localize a sound source. In the same way this process could be applied in robotics where the sound level difference between a signal from the right microphone and the left microphone is calculated for orienting towards a sound source. The microphones were attached as ears on the robot-head and tested regarding frequency response with logarithmic sweep-tones at azimuth angles in 45º increments around the head. The directional type of microphone was more sensitive to azimuth and head shadow and probably more suitable for directional hearing in the robot.

1 Introduction As part of the CONTACT project1 a microphone evaluation regarding head related transfer function (HRTF), and azimuthal2 dependence was carried out as a first step in implementing directional hearing in a humanoid robot (see Figure 1). Sound pressure level by the robot ears (microphones) as a function of frequency and azimuth in the horizontal plane was studied.

The hearing system in humans has many features that together enable fairly good spatial perception of sound, such as timing differences between left and right ear in the arrival of a signal (interaural time difference), the cavities of the pinnae that enhance certain frequencies depending on direction and the neural processing of these two perceived signals (Pickles, 1988). The shape of the outer ears is indeed of great importance in localization of a sound source, but as a first step of implementing directional hearing in a robot, we want to start up by investigating the effect of a spherical head shape between the two microphones and the angle in relation to the sound source. So this study was done with reference to the interaural level difference (ILD)3 between two ears (microphones, no outer ears) in the sound signal that is caused by the distance between the ears and HRTF or head shadowing effects (Gelfand, 1998). This means that the ear furthest away from the sound source will to some extent be blocked by the head in such a way that the shorter wavelengths (higher frequencies) are reflected by the head (Feddersen et al., 1957). Such frequency-dependent differences in intensity associated with different sound source locations will be used as an indication to the robot to turn his head in the horizontal plane. The principle here is to make the robot look in the direction that minimizes the ILD4. Two types of microphones, mounted on the robot head, 1 "Learning and development of Contextual Action" European Union NEST project 5010 2 Azimuth = angles around the head 3 The abbreviation IID can also be found in the literature and stands for Interaural Intensity Difference. 4 This is done using a perturbation technique. The robot’s head orientation is incrementally changed in order to detect the direction associated with a minimum of ILD.

were tested regarding frequency response at azimuth angles in 45º increments from the sound source (Shaw & Vaillancourt, 1985; Shaw, 1974).

The study reported in this paper was carried out by the CONTACT vision group (Computer Vision and Robotics Lab, IST Lisbon) and the CONTACT speech group (Phonetics Lab, Stockholm University) assisted by Hassan Djamshidpey and Peter Branderud. The tests were performed in the anechoic chamber at Stockholm University in December 2005.

2 Method The microphones evaluated in this study were wired Lavalier microphones of the Microflex MX100 model by Shure. These microphones were chosen because they are small electret condenser microphones designed for speech and vocal pickup. The two types tested were omni-directional (360º) and directional (cardoid 130º). The frequency response is 50 to 17000 Hz and its max SPL is 116 dB (omni-directional), 124 dB (directional) with a s/n ratio of 73 dB (omni-directional), 66 dB (directional). The robotic head was developed at Computer Vision and Robotics Lab, IST Lisbon (Beira et al., 2006).

Figure 1. Robot head. 2.1 Setup The experimental setup is illustrated in Figure 2a. The robot-head is attached to a couple of ball bearing arms (imagined to correspond to a neck) on a box (containing the motor for driving head and neck movements). The microphones were attached and tilted by about 30 degrees towards the front, with play-dough in the holes made in the skull for the future ears of the robot. The wires run through the head and out to the external amplifier. The sound source was a Brüel&Kjær 4215, Artificial Voice Loud Speaker, located 90 cm away from the head in the horizontal plane (Figures 2a and 2b). A reference microphone was connected to the loudspeaker for audio level compression (300 dB/sec).

Figure 2a and 2b. a) Wiring diagram of experimental setup (left). b) Azimuth angles in relation to robot head and loudspeaker (right). 2.2 Test Sweep-tones were presented through the loud-speaker at azimuth angles in 45º increments obtained by turning the robot-head (Figure 2b). The frequency range of the tone was between 20 Hz5 and 20 kHz with a logarithmic sweep control and writing speed of 160mm/sec (approximate averaging time 0.03 sec). The signal response of the microphones was registered and printed in dB/Hz diagrams (using Brüel&Kjær 2307, Printer/Level Recorder) and a back-up recording was made on a hard-drive. The dB values as a function of frequency were also plotted in Excel diagrams for a better overview of superimposed curves of different azimuths (and for presentation in this paper). 5 Because the compression was unstable up until about 200 Hz, the data below 200 Hz will not be reported here. Furthermore, the lower frequencies are not affected that much in terms of ILD.

3 Results The best overall frequency response of both microphones was at angles 0º, -45º and -90º that is the (right) microphone is to some extent directed towards the sound source. The sound level decreases as the microphone is turned away from the direction of the sound source. The omni-directional microphones have an overall more robust frequency response than the directional microphones. As expected, the difference in sound level between the azimuth angles are most significant in higher frequencies since the head as a blockage will have a larger impact on shorter wavelengths than on longer wavelengths. An example of ILD for the directional microphones is shown in Figure 3, where sound level as a function of frequency is plotted for the ear near the sound source and ear furthest away from the sound source at azimuth 45º, 90º and 135º. While the difference ideally should be zero6 at azimuth 0 it is well above 15 dB at many higher frequencies in azimuth 45º, 90º and 135º.

0

5

10

15

20

25

30

35

40

45

0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,5 2 3 4 5 6 7 8 9 10 15 20

Frequency (kHz)

dB

near ear 45

near ear 90

near ear 135

far ear 45

far ear 90

far ear 135

Figure 3. Signal response of the directional microphones. Sound level as a function of frequency and azimuth 45º, 90º and 135º for the ear near the sound source and the ear furthest away from the sound source.

4 Discussion In line with the technical description of the microphones our results show that the directional microphones are more sensitive to azimuth than the omnidirectional microphones and will probably make the implementation of sound source localization easier. Also disturbing sound of motors and fans inside the robot’s head might be picked up easier by an omnidirectional microphone. A directional type of microphone would therefore be our choice of ears for the robot. However, decisions like this are not made without some hesitation since we do not want to manipulate the signal response in the robot hearing mechanism beyond what we find motivated in terms of the human physiology of hearing. Deciding upon what kind of pickup angle the microphones should have forces us to consider what implications a narrow versus a wide pickup angle will have in further implementations of the robotic hearing. At this moment we see no problems with a narrow angle but if problems arise we can of course switch to wide angle cartridges.

The reasoning in this study holds for locating a sound source only to a certain extent. By calculating the ILD the robot will be able to orient towards a sound source in the frontal horizontal plane. But if the sound source is located straight behind the robot the ILD would also equal zero and according to the robot’s calculations he is then facing the sound source. Such front-back errors are in fact seen also in humans since there are no physiological 6 A zero difference in sound level at all frequencies between the two ears requires that the physical surroundings at both sides of the head are absolutely equal.

attributes of the ear that in a straightforward manner differentiate signals from the front and rear. Many animals have the ability to localize a sound source by wiggling their ears, humans can instead move themselves or the head to explore the sound source direction (Wightman & Kistler, 1999). As mentioned earlier the outer ear is however of great importance for locating a sound source, the shape of the pinnae does enhance sound from the front in certain ways but it takes practice to make use of such cues. In the same way the shape of the pinnae can be of importance for locating sound sources in the medial plane (Gardner & Gardner, 1973; Musicant & Butler, 1984). Subtle movements of the head, experience of sound reflections in different acoustic settings and learning how to use pinnae related cues are some solutions to the front-back-up-down ambiguity that could be adopted also by the robot. We should not forget though, that humans always use multiple sources of information for on-line problem solving and this is most probably the case also when locating sound sources. When we hear a sound there is usually an event or an object that caused that sound, a sound source that we easily could spot with our eyes. So the next question we need to ask is how important vision is in localizing sound sources or in the process of learning how to trace sound sources with our ears and how vision can be used in the implementation of directional hearing of the robot.

5 Concluding remarks Directional hearing is only one of the many aspects of human information processing that we have to consider when mimicking human behaviour in an embodied robot system. In this paper we have discussed how the head has an impact on the intensity of signals at different frequencies and how this principle can be used also for sound source localization in robotics. The signal responses of two types of microphones were tested regarding HRTF at different azimuths as a first step of implementing directional hearing in a humanoid robot. The next steps are designing outer ears and formalizing the processes of directional hearing for implementation and on-line evaluations (Hörnstein et al., 2006).

References Beira, R., Lopes, M., Miguel, C., Santos-Victor, J., Bernardino, A., Metta, G. et al. Design of

the robot-cub (icub) head. IEEE ICRA, (in press). Feddersen, W. E., Sandel, T. T., Teas, D. C., & Jeffress, L. A. (1957). Localization of High-

Frequency Tones. The Journal of the Acoustical Society of America, 29, 988-991. Gardner, M. B. & Gardner, R. S. (1973). Problem of localization in the median plane: effect

of pinnae cavity occlusion. The Journal of the Acoustical Society of America, 53, 400-408. Gelfand, S. (1998). An introduction to psychological and physiological acoustics. In ( New

York: Marcel Dekker, Inc. Hörnstein, J., Lopes, M., & Santos-Victor, J. (2006). Sound localization for humanoid robots

- building audio-motor maps based on the HRTF. CONTACT project report . Musicant, A. D. & Butler, R. A. (1984). The influence of pinnae-based spectral cues on sound

localization. The Journal of the Acoustical Society of America, 75, 1195-1200. Pickles, J. O. (1988). An Introduction to the Physiology of Hearing. (Second ed.) London:

Academic Press. Shaw, E. A. G. (1974). Transformation of sound pressure level from the free field to the

eardrum in the horizontal plane. The Journal of the Acoustical Society of America, 56. Shaw, E. A. G. & Vaillancourt, M. M. (1985). Transformation of sound-pressure level from

the free field to the eardrum presented in numerical form. The Journal of the Acoustical Society of America, 78, 1120-1123.

Wightman, F. L. & Kistler, D. J. (1999). Resolution of front--back ambiguity in spatial hearing by listener and source movement. The Journal of the Acoustical Society of America, 105, 2841-2853.

Sound localization for humanoid robots - buildingaudio-motor maps based on the HRTF

Jonas Hörnstein, Manuel Lopes, José Santos-VictorInstituto de Sistemas e Robótica

Instituto Superior TécnicoLisboa, Portugal

Email: {jhornstein,macl,jasv}@isr.ist.utl.pt

Francisco LacerdaDepartment of Linguistics

Stockholm UniversityStockholm, Sweden

Email: [email protected]

Abstract— Being able to localize the origin of a sound isimportant for our capability to interact with the environme nt.Humans can localize a sound source in both the horizontal andvertical plane with only two ears, using the head related transferfunction HRTF, or more specifically features like interaural timedifference ITD, interaural level difference ILD, and notches inthe frequency spectra. In robotics notches have been left outsince they are considered complex and difficult to use. As theyare the main cue for humans’ ability to estimate the elevationof the sound source this have to be compensated by addingmore microphones or very large and asymmetric ears. In thispaper, we present a novel method to extract the notches thatmakes it possible to accurately estimate the location of a soundsource in both the horizontal and vertical plane using only twomicrophones and human-like ears. We suggest the use of simplespiral-shaped ears that has similar properties to the humanearsand make it easy to calculate the position of the notches. Finallywe show how the robot can learn its HRTF and build audio-motor maps using supervised learning and how it automaticallycan update its map using vision and compensate for changes inthe HRTF due to changes to the ears or the environment.

I. I NTRODUCTION

Sound plays an important role in directing humans’ attentionto events in their ecological setting. The human ability to lo-cate sound sources in potentially dangerous situations, like anapproaching car, or locating and paying attention to a speakerin social interaction settings, is a very important componentof human behaviour. In designing a humanoid robot that isexpected to mimic human behaviour, the implementation of ahuman-like sound location capability as a source of integratedinformation is therefore an important goal.

Humans are able of locating the sound sources in boththe horizontal and vertical plane from exploring acousticinformation conveyed by the auditory system, but in a robotthat uses two simple microphones as ears there is not enoughinformation to do the same. Typically the robot would beable to calculate or learn the positions of the sound source inthe plane of the microphones, i.e. the azimuth which usuallycorresponds to the horizontal plane. This can be done bycalculating the difference in time between the signal reachingthe left and the right microphone respectively. This is calledthe interaural time difference (ITD) or the interaural phasedifference (IPD) if we have a continuous sound signal andcalculate the phase difference of the signal from the two

microphones by cross-correlation. However, the ITD/IPD doesnot give any information about the elevation of the soundsource. Furthermore it cannot tell whether a sound comes fromthe front or the back of the head. In robotics this is usuallysolved by adding more microphones. The SIG robot [1] [2] hasfour microphones even though two are mainly used to filter thesound caused by the motors and the tracking is mainly donein the horizontal plane. In [3] eight microphones are used, andin [4][5] a whole array of microphones is used to estimate thelocation of the sound.

While adding more microphones simplifies the task of soundlocalization, humans and other animals manage to localize thesound with only two ears. This comes from the fact that theform of our head and ears change the sound as a function ofthe location of the sound source, a phenomenon known as thehead related transfer function (HRTF). The HRTF describeshow the free field sound is changed before it hits the eardrum,and is a functionH(f, θ, φ) of the frequency, f, the horizontalangle,θ, and the vertical angle,φ, between the ears and soundsource. The IPD is one important part of the HRTF. Anotherimportant part is that the level of the sound is higher whenthe sound is directed straight into the ear compared to soundcoming from the sides or behind. Many animals, like cats,have the possibility to turn their ears around in order to getabetter estimate of the localization of the sound source. Evenwithout turning the ears, it is possible to estimate the locationof the sound by calculating the difference in level intensitybetween the two ears. This is referred to as the interaurallevel difference (ILD). However, if the ears are positionedon each side of the head as for humans, ILD will mainlygive us information about on which side of the head that thesound source is located, i.e. information about the azimuthwhich we already have from the ITD/IPD. In order to get newinformation from the ILD we have to create an asymmetryin the vertical plane rather than in the horizontal. This canbedone by putting the ears on top of the head and letting one earbe pointing up while the other is pointing forwards as done in[6]. The problem with this approach is that a big asymmetry isneeded to get an acceptable precision and ILD of human-likeears does not give sufficient information about the elevationof the sound source.

For humans it has been found that the main cue for

jasv2
Text Box
IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, Oct. 9-15, 2006.

estimating the elevation of the sound source comes fromresonances and cancellation (notches) of certain frequenciesdue to the pinna and concha of the ear. This phenomenonhas been quite well studied in humans both in neuroscienceand in the field of audio reproduction for creating 3D-stereosound [7][8][9][10][11][12][13], but has often been left out inrobotics due to the complex nature of the frequency responseand the difficulty to extract the notches. In this paper wesuggest a simple and effective way of extracting the notchesfrom the frequency response and we show how a robot canuse information about ITD/IPD, ILD, and notches in order toaccurately estimate the location of the sound source in bothvertical and horizontal space.

Knowing the form of the head and ears it is possible tocalculate the relationship between the features (ITD, ILD,andthe frequencies for the notches) and the position the soundsource, or even estimate the complete HRTF. However, herewe are only interested in the relationship between the featuresand the position. Alternatively we can get the relationshipbymeasuring the value of the features for some known positionsof the sound source and let the robot learn the maps. Sincethe HRTF changes if there is some changes to the ears ormicrophones or if some object like for example a hat isput close to the ears, it is important to be able to updatethe maps. Indeed, although human ears undergo big changesfrom birth to adulthood, humans are capable of adapting theirauditory maps to compensate for acoustic consequences ofthe anatomical changes. It has been shown that vision is animportant cue for updating the maps [14], and it can also beused as a mean for the robot to update its maps [6].

In the rest of this paper, Section 2 will show how we candesign a simple head and ears that give a HRTF similiar tothe human head and ears’ HRTF. In Section 3 we discuss howto extract features such as ITD/IPD, ILD, and notches fromthe signals provided from the ears. In Section 4 we show howthe robot can use the features to learn its audio-motor-map.In Section 5 we show some experimental results. Conclusionsand directions for future work are given in Section 6.

II. D ESIGN OFHEAD AND EARS

In this section we want to describe the design of a robot’shead and ears with a human-like HRTF, and hence withacoustic properties for the ITD, ILD, and frequency notchessimilar to those observed in humans. The HRTF dependson the form of both the head and the ears. The ITD/IPDdepends on distance between the ears and the ILD is primarilydependent on the form of the head and to less extent also theform of the ears, while the peaks and notches in the frequencyresponse mainly are related to the form of the ears.

For the sake of calculating the HRTF, a human head can bemodeled by a spheroid [15] [16]. The head used in this workis the iCUB head which is close enough to a human head toexpect the same acoustic properties. The detailed design ofthehead is described in [17] but here we can simply consider ita sphere with a diameter of 14 cm.

The ears are more complex. Each of the robot’s ears wasbuilt by a microphone placed on the surface of the head anda reflector simulating the pinna/concha, as will be describedin detail below. The shape of human ears differs substantiallybetween individuals, but a database with HRTF for differentindividuals [18] provides some general information on thefrequency spectra created for various positions of the soundsource by human ears. Obviously one way to create ears fora humanoid robot would be simply to copy the shape of apair of human ears. That way we can assure that they willhave similar properties. However we want to find a shapethat is easier to model and produce while preserving the mainacoustic characteristics of the human ear. The most importantproperty of the pinna/concha, for the purpose of locating thesound source, is to give different frequency responses fordifferent elevation angels. We will be looking for notches inthe frequency spectra, created by interferences between theincident waves, reaching directly the microphone, and theirreflections by the artificial concha, and want the notches tobe produced at different frequencies for different elevations.A notch is created when a quarter of the wavelength of thesound,λ, (plus any multiple ofλ/2) is equal to the distance,d, between the concha and the microphone:

n ∗λ

2+

λ

4= d (n = 0, 1, 2, ...)

For these wavelengths, the sound wave that reaches themicrphone directly are cancelled by the wave reflected by theconcha. Hence the frequency spectra will have notches for thecorresponding frequencies:

f =c

λ=

(2 ∗ n + 1) ∗ v

4 ∗ d(1)

c = {speed of sound} ≈ 340 m/s (2)

To get the notches at different frequencies for all elevationswe want an ear-shape that has different distance between themicrophone and the ear for all elevations. Lopez-Poveda andMeddis suggest the use of a spiral shape to model human earsand simulate the HRTF [19]. In a spiral the distance betweenthe microphone, placed in the center of the spiral, and theear increases linearly with the angle. We can therefore expectthe position of the notches in the frequency response to alsochange linearly with the elevation of the sound source.

We used a spiral with the distance to the center varying from2 cm below to 4 cm in the top, Figure 1. That should give usthe first notch at around 2800 Hz for sound coming straightfrom the front and with the frequency increasing linearly astheelevation angle increases, Figure 2. When the free field soundis white noise as in the figure it is easy to find the notchesdirectly in the frequency spectra of either ear. However, soundlike spoken language will have its own maxima and minimain the frequency spectra depending on what is said. It is notclear how humans separate what is said from where it is said[20]. One hypothesis is that we perform a binaural comparisonof the spectral patterns, as have also been suggested for owls[21]. Both humans and owls have small asymmetries between

Fig. 1. Pinna and concha of a human ear (right), and the artificialpinna/concha (left)

Fig. 2. Example of the HRTF for a sound source at a) 50 degrees above, b)front, and c) 50 degrees below

the left and right ear that can give excellent cues to verticallocalization in the higher frequencies. These relatively smallasymmetries that provide different spectral behaviour betweenthe ears should not be confused with the large asymmetriesneeded to give any substantial difference for the ILD. Here weonly need the difference in distance between the microphoneand the ear for the right and left ear to be enought to separatethe spectral notches. In the optimal case we would like theright ear to give a maximum for the same frequency that theleft ear has a notch and hence amplify that notch. This can bedone by choosing the distance for the right ear,dr, as:

dr(φ) = 2mr + 1

2 ∗ nl + 1∗ dl(φ)

wheremr=maxima number for right ear,nl=notch number forleft ear, anddl=distance between the microphone and ear forleft ear.

If we for example want to detect the third notch of the leftear, and want the right ear to have its second maxima for thesame frequency, we should chose the distance between themicrophone and ear for the right ear as:

dr(φ) = 22 + 1

2 ∗ 3 + 1∗ dl(φ) = 6/7 ∗ dl(φ)

In the case of two identical ears we can not have a maximaof the right ear at the same place as the left ear has a notchfor all elevations. The best we can do is to choose the anglebetween the ear so that the right ear has a maxima for thewanted notch when the sound comes from the front. In thespecific case of the ears in Figure 1 the optimal angle becomes18 degrees.

The ears were constructed from a thin metal band whichwas easy to bend into a spiral shape. Metal also has theproperty that it almost completely reflects the sound whichmakes it a good material for making ears. The metal bandwas then fixed onto a of thick paper with holes in the centerfor the microphone. For the first experiments presented in thispaper we used professional omnidirectional condenser micro-phones fixed directly to the head pointing sideways/forwardat an angle of 45 degrees with the tip of the microphoneapproximately 7 mm outside the head. We then changed tocheap web-microphones pointing straight to the side withoutany notable effect on the results. The final setup is shown inFigure 3.

Fig. 3. icub with ears

III. F EATURES

As discussed in the introduction, humans mainly depend onthree different features for localizing the sound source: theinteraural time difference ITD, the interaural level differenceILD, and the notches in the frequency response of each ear.In this section we show how to extract the features given thesignalssl(t) andsr(t) from the left and right ear respectively.

The first step is to sample the sound. We sample the soundfor 1/10 s using the sample frequencyFs=44100 Hz. Thisgives usk=4410 samples. We then calculate the mean energyas a sum of square of the given samples divided by the numberof samples:

k

(

s2l (k) + s2

r(k))

/k. A simple threshold valueis used to decide if the sound has enough energy to make itmeaningful to try to extract the features and try to localizethe sound source. The calculation of the individual features isexplained below.

A. Interaural time difference, ITD

The interaural phase difference is calculated by doing across-correlation between the signals arriving to the leftandright ear/microphone. If the signals have the same shape wecan expect to find a peak in the cross-correlation for thenumber of samples that corresponds to the interaural timedifference, i.e. the difference in time at which the signal arrivesat the microphones. We can easily find this by searching forthe maximum in the cross correlation function. Knowing thesampling frequencyFs and the number of samplesn thatcorresponds to the maximum in the cross-correlation functionwe can calculate the interaural time difference as:

ITD =n

Fs

If the distance to the sound source is big enough in compar-ison to the distance between the ears,l, we can approximatethe incoming wave front with a straight line and the differencein distance∆l traveled by the wave for the left and right earcan easily be calculated as:

∆l = l sin(θ)

Fig. 4. Interaural time difference

where θ is the horizontal angle between the head’s midsagittal plane and the sound source, Figure 4. Knowing thatthe distance traveled is equal to the time multiplied with thevelocity of the sound we can now express the angle directlyas a function of the ITD:

θ = arcsin(

ITD ∗c

l

)

However for the sake of controlling the robot we are notinterested in the exact formula since we want the robot tobe able to learn the relationship between the ITD and theangle rather than hard coding this into the robot. The importantthing is that there exists a relationship that we will be abletolearn. We therefore measured the ITD for a number of differentangles in an anechoic room, figure 5.

−0.5−0.4

−0.3−0.2

−0.10

0.10.2

0.30.4

0.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

azimuth (rad)

ITD

elevation (rad)

time

(ms)

Fig. 5. ITD for different positions of the sound source

B. Interaural level difference, ILD

The interaural level difference ILD, is calculated as afunction of the avarage power of the sound signals reachingthe left and right ear.

ILD = 10 ∗ log10

(∑

k s2l (k)

k s2r(k)

)

Sometimes the ILD is calculated from the frequency re-sponse rather than directly from the temporal signal. It iseasy to go from the temporal signal to the frequency responseby applying a fast Fourier transform FFT. The reason forworking with the frequency response instead of the temporalsignal is that it makes it easy to apply a high-pass, low-pass, or band-filter on the signal before calculating its averagepower. Different frequencies have different properties. Lowfrequencies typically pass more easily through the head andears while higher frequencies tend to be reflected and theirintensity more reduced. One type of filtering that is oftenused is dBA which corresponds to the type of filtering thatgoes on in human ears and which mainly takes into accountthe frequencies between 1000 Hz and 5000 Hz. In [6] a band-pass filter between 3-10 kHz have been used which gives thema better calculation of ILD. Different types of head and earsmay benefit from enhancing different frequencies. Here wecalculate the ILD directly from the temporal signal which is

equivalent to considering all frequencies. The response for asound source placed at different angles from the head is shownin figure 6.

−0.5

0

0.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6−600

−400

−200

0

200

400

600

800

azimuth (rad)

ILD

elevation (rad)

leve

l diff

eren

ce (

dB 2 )

Fig. 6. ILD for different positions of the sound source

C. Spectral notches

Existing methods for extracting spectral notches such as[22][23][24] focus on finding the notches in spectral diagramsobtained in anechoic chambers using white noise, such as thediagrams presented in Figure 2. For a humanoid robot that hasto be able to turn towards any type of sound these methodsare not suitable. In this section we suggest a novel method toextract the frequency notches that is reasonable fast and simpleto implement while giving better accuracy for calculating theelevation of the sound source than methods based on ILD. Themethod makes use of the fact that we have a slight asymmetrybetween the ears and has the following steps:

1) Calculate the power spectra density for each ear2) Calculate the interaural spectral differences3) Fit a curve to the resulting differences4) Find minima for the fitted curve

To calculate the power spectra we use the Welch spectra[25]. Typical results for the power spectra density,Hl(f) andHr(f), for the left and right ear respectively are shown infigure 7. As seen the notches disappears in the complex spectraof the sound, which makes it very hard to extract them directlyfrom the power spectra. To get rid of the maxima and minimacaused by the form of the free field sound, i.e. what is said,we calculate the interaural spectral difference as:

∆H(f) = 10 ∗ log10 Hl(f) − 10 ∗ log10 Hr(f) =

= 10 ∗ log10

(

Hl(f)

Hr(f)

)

Finally we fit a 12 degree polynomial to the interauralspectral difference. The interaural spectral difference and the

fitted polynomial are shown in Figure 7. From the fittedfunction it is easy to extract the minima. In this work we havechosen to use the first minima over 5000 Hz as a feature.In this specific setup that corresponds to the second notch.However the position of the notches depends on the designof the ears and for other types of ears other frequencies willhave to be used.

0 2.800 8.500 14.100 19.800−190

−180

−170

−160

−150

−140

−130

−120

−110

−100

−90

Hz

dB

0 2.800 8.500 14.100 19.800−10

−5

0

5

10

15

20

25

Hz

dB

Fig. 7. Above: The power spectra for left ear (solid line) andright ear(dotted line). Note that the spectras are shown in dB, i.e.10 ∗ log(Hx(f)).The vertical lines represent the expected frequencies for the notches. Below:The interaural spectral difference (dotted line) and the fitted curve (solid line)

As seen in Figure 7 the minima more or less corresponds tothe expected frequencies of the notches for the left ear. Thisis because we carefully designed the ears so that the notchesfrom the two ears would not interfere with each other. In thiscase we could actually calculate the relationship between thefrequency of the notch and the position of the sound source.However, in the general case it is better to let the robot learn

the HRTF than to actually calculate it since the position of thenotches is critically affected by small changes in the shapeofthe conchas or the acoustic environment. Also, if we learnthe relationship rather than calculating it we do not have toworry about the fact that the minimas that we find do notdirectly correspond to the notches as long as they change withthe elevation of the sound source. In figure 8 we show theextracted feature with the sound source placed at a number ofdifferent positions in relation to the head.

−0.5−0.4

−0.3−0.2

−0.10

0.10.2

0.30.4

0.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

5500

6000

6500

7000

7500

8000

8500

9000

9500

azimuth (rad)

notch

elevation (rad)

notc

h fr

eque

ncy

(Hz)

Fig. 8. Notch frequencies for different position of the sound source

IV. A UDIO-MOTOR MAPS

In this section we are going to map sound features to thecorresponding3D localization of sound sources. We presentsolutions to: i) learn this map, ii) use it to control a robottoward a target and iii) to improve its quality online.

As seen before, differences in the head size, ear shapes oreven a hat in the robot can dramatically change the frequencyresponse of the microphones. The coordinate transformationfrom the ears to the vision and the control motors is difficultto calibrate. Because of this, the system should be able toadapt to these variations. The exact solution we have for thepan localization is only valid when the sound source is in thesame plane as the receptors, by having a learning mechanismwe can combine the vertical and horizontal information to havea better localization.

A map m from the sound featuresC to its localizationwritten in head spherical coordinatesSH can be used to directthe head toward a sound source. This map can be representedby SH = m(C). A simple way to move the head toward thetarget is to move the head pan and tilt by an increment∆θequal to the position of the sound source, i.e.∆θ = SH .

Although the function is not linear, if we restrict to the spaceof motion of the head in can be considered as such, we canobserve this by noting that the features in Figures 5 and 8 arealmost planar. Because of this, the nonlinear function can beapproximated by a linear function:

SH = MC

This simpler model allows a faster and more robust learning.To estimate the value ofM a linear regression with thestandard error criteria was selected:

M = arg minM

i=1

‖∆θ − MCi‖2 (3)

This solution gives an offline batch estimate, for onlineestimation a Broyden update rule [26] was used. This ruleis very robust, fast, has just one parameter and only keeps inmemory the actual estimative ofM . Its structure is as follows:

M(t + 1) = M(t) + α

(

∆θ − M(t)C)

C>

CTC(4)

whereα is the learning rate. This method is useful to be usedin a supervised learning method, where the positions corre-sponding to a certain sound are given. This is not the case foran autonomous system. A robot needs an automatic feedbackmechanism to learn the audio-motor relation autonomously.Vision can provide information in a robust and non-intrusiveway. As the goal of this robot is to interact with humans thetest sound will be produced by humans and so the robot knowsthe visual appearance of the sound source. A face detectionalgorithm based on [27] and [28] was used and Figure 9presents the result of the algorithm. After hearing a soundthe robot moves to the position given by the map. If thehuman head is not centered in the image then a visual servoingbehavior is activated in order to bring the observed face to theimage center, for this the robot is controlled with the followingrule:

θ = J+

H Fp

whereθ is the velocity for the head motors,J+

H is the pseudo-inverse of the head jacobian andFp is the desired motion ofthe face in the image. When the face is in the center of theimage the audio-motor map can be updated using the learningrule of Eq. 4. Table I presents the final algorithm.

TABLE I

ALGORITHM FOR AUTONOMOUSLY LEARNING A AUDIO-MOTOR MAP BY

INTERACTING WITH A HUMAN

1) listen to sound2) move head toward the sound using the map3) locate human face in the image4) if face not close enough to the center

a) do a visual servoing loop to center the faceb) update the map

Fig. 9. Result of the face detection algorithm used as a feedback signal forthe sound localization algorithm.

V. EXPERIMENTAL RESULTS

We acquired a dataset in a silent room with a white-noise sound-source located1.5m from the robot. We recorded1 second of sound in132 different head positionsθ by movingthe head with its own motors. A set of features was evaluatedfrom this data consisting ofITD and the notch frequencyevaluate on0.1 second. Figure 8 shows the resulting featuresurfaces after averaging them on11 samples. This is used asthe training dataset.

A second dataset was created to test the learning method.The procedure was similar to the previous one but the sound-source was replaced by a human voice sound. This wasdone because the system should operate in an human-robotinteraction and also to evaluate the generalization propertiesof the method.

The map was then estimated with the optimization problemin Eq. 3, Figure 10 presents the reconstruction error byshowing for each head position the corresponding error inreconstruction. We can see that the worst case corresponds tothe joint limits of the head but it is always less than0.1 rad,which is very small. As a comparison we can note that0.1 radis the size of a adult human face when seen from1.5 m ofdistance. The error increase near the limits is due to the non-linearity of the features being approximated by a linear model,however with this small error the computational efficiency androbustness makes us choose this model.

Finally we have done a test in a real world environmentand interaction. The test was done in our offices and afterhearing a sound the head moves to the corresponding position.The previously learned function was used as a bootstrap. Themap quality would always guaranty that the sound source waslocated in the camera images, even though it was not learnedneither in the same environment nor considering the eye-neckcoordinate transformation. In order to improve even furtherthe results we followed the steps of the algorithm presentedin Table I. Figure 11 presents the evolution of the errorduring the experiment. The error represents the difference, inradians, from the position mapped by the model and the reallocalization of the person. We can see that this error decreased

−0.5−0.4

−0.3−0.2

−0.10

0.10.2

0.30.4

0.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.60

0.01

0.02

0.03

0.04

0.05

0.06

0.07

pan

pan error

tilt

−0.5−0.4

−0.3−0.2

−0.10

0.10.2

0.30.4

0.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.60

0.02

0.04

0.06

0.08

0.1

0.12

pan

tilt error

tilt

−0.5−0.4

−0.3−0.2

−0.10

0.10.2

0.30.4

0.5

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.60

0.02

0.04

0.06

0.08

0.1

0.12

pan

error

tilt

Fig. 10. Audio-motor map reconstruction error for each headposition (Top:error in pan, Middle: error in tilt, Bottom: total error)

towards less than one degree.

VI. CONCLUSION

In this paper we have presented a novel method for estimat-ing the location of a sound source using only two ears. Themethod is inspired by the way humans estimate the locationfrom features such as interaural time difference ITD, interaurallevel difference ILD, and spectral notches. We suggest the useof spiral formed ears of slightly different size or with differentinclination to make it easy to extract the notches. The spiral

0 5 10 15 20 25 30−0.2

−0.1

0

0.1

0.05

−0.05

−0.15

trial

erro

r (r

ad)

pan

tilt

Fig. 11. Convergence rate of the audio-motor map learning algorithm whenrunning online with feedback given by a face detector.

form also makes it easy to mathematically derive the HRTFfor the robot, thus making it possible to simulate the featuresand/or building controllers based on the HRTF.

We have also shown that the robot can learn the HRTFeither by supervised learning or by using vision. Initial audio-motor maps either calculated or learnt combined with an onlinevision-based update of the maps are suggested for the controlof the robot. This makes it possible for the robot to compensatefor small changes in the HRTF caused by dislocations of theears, exchange of microphones, or placement of objects likeahat close to the ear.

The suggested method has good accuracy within the pos-sible movements of the head used in the experiments. Theerror in the estimated azimuth and elevation is less than 0.1radians for all angles, and less than 0.02 radians for the centerposition.

The method is especially suitable for humanoid robotswhere we want ears that both look like human ears andperform like them. The precision in the estimate of the positionis more than enough for typical applications for a humanoidlike turning towards the sound.

Future work includes resolving front-back ambiguities andlearning the HRTF for wider angles than the movements of thehead which demands other approximations of the audio-motormap than linear.

ACKNOWLEDGMENT

EU - Contact, FCT, Lisa Gustavsson, Eeva Klintfors, EllenMarklund, Peter Branderud, Hassan Djamshidpey

REFERENCES

[1] H. Okuno, K. Nakadai, and H. Kitano, “Social interactionof humanoidrobot based on audio-visual tracking,” inProc. of 18th Intern. Conf. onIndustrial and Engineering Applications of Artificial Intelligence andExpert Systems (IEA/AIE-2002), June 2002.

[2] H. Okuno, “Human-robot interaction through real-time auditory andvisual multipletalker tracking,” inIROS, 2001.

[3] J. Valin, F. Michaud, J. Rouat, and D. Létourneau, “Robust sound sourcelocalization using a microphone array on a mobile robot,” inProceedingsInternational Conference on Intelligent Robots and Systems, 2003.

[4] D. Sturim and H. Silverman, “Tracking multiple talkers usingmicrophone-array measurements,”Journal of the Acoustical Societyof America, (In preparation), –:–, – 1996. [Online]. Available:citeseer.csail.mit.edu/sturim96tracking.html

[5] K. Guentchev and J. Weng, “Learning based three dimensional soundlocalization using a compact non-coplanar array of microphones,” inAAAI Spring Symp. on Int. Env., Stanford CA, March 1998.

[6] S. G. Natale L., Metta G., “Development of auditory-evoked reflexes:Visuo-acoustic cues integration in a binocular head,”Robotica andAutonomous Systems, vol. 39, pp. 87–106, 2002.

[7] V. Algazi, R. Duda, R. Morrison, and D. Thompson, “Structural com-position and decomposition of hrtfs,” inProc. IEEE WASPAA01, NewPaltz, NY, 2001, pp. 103–106.

[8] B. Gardner and K. Martin, “Hrtf measurements of a kemar dummy-headmicrophone,” MIT Media Lab Perceptual Computing, Tech. Rep. 280,May 1994.

[9] S. Hwang, Y. Park, and Y. Park, “Sound source localization using hrtfdatabase,” inICCAS2005, Gyenggi-Do, Korea, June 2005.

[10] L. Savioja, J. Houpaniemi, T. Huotilainen, and T. Takala, “Real-timevirtual audio reality,” inProceedings of ICMC, 1986, pp. 107–110.

[11] J. Huopaniemi, L. Savioja, and T. Takala, “Diva virtualaudio realitysystem,” in Proc. Int. Conf. Auditory Display (ICAD’96), Palo Alto,California, Nov 1996, pp. 111–116.

[12] C. Avendano, R. Duda, and V. Algazi, “Modeling the contralateralhrtf,” in Proc. AES 16th International Conference on Spatial SoundReproduction, Rovaniemi, Finland, 1999, pp. 313–318.

[13] B. C. J. Moore, “Interference effects and phase sensitivity in hearing,”Philosophical Transactions: Mathematical, Physical and EngineeringSciences, vol. 360, no. 1794, pp. 833 – 858, May 2002.

[14] M. Zwiers, A. V. Opstal, and J. Cruysberg, “A spatial hearing deficit inearly-blind humans,”JNeurosci, vol. 21, 2001.

[15] V. Algazi, C. Avendano, and R. Duda, “Estimation of a spherical-headmodel from anthropometry,”J. Aud. Eng. Soc., 2001.

[16] R. Duda, C. Avendano, and V. Algazi, “An adaptable ellipsoidal headmodel for the interaural time difference,” inProc. ICASSP, 1999.[Online]. Available: citeseer.csail.mit.edu/duda99adaptable.html

[17] R. Beira, M. Lopes, c. Miguel Pra J. Santos-Victor, A. Bernardino,G. Metta, F. Becchi, and R. Saltaren, “Design of the robot-cub (icub)head,” in IEEE ICRA, 2006, p. to appear.

[18] V. Algazi, R. Duda, and D. Thompson, “The cipic hrtf database,” inIEEE ASSP Workshop on Applications of Signal Processing to Audioand Acoustics, NY, USA, 2001.

[19] E. A. Lopez-Poveda and R. Meddis, “Sound diffraction and reflectionsin the human concha,”J. Acoust. Soc. Amer., vol. 100, pp. 3248–3259,1996.

[20] C. Jin, A. Corderoy, S. Carlile, and A. van Schaik, “Contrastingmonaural and inteaural spectral cues for human sound localization,” J.Acoust. Soc. Am., vol. 6, no. 115, pp. 3124–3141, June 2004.

[21] R. A. Norberg, “Skull asymmetry, ear structure and function, andauditory localization in tengmalm’s owl, aegolius funereus (linne),”Philosophical Transactions of the Royal Society of London.Series B,Biological Sciences, vol. 282, no. 991, pp. 325–410, Mar 1978.

[22] S. G. R. M. A.Ramírez, “Extracting and modeling approximated pinna-related transfer functions from hrtf data,” inProceedings of ICAD 05-Eleventh Meeting of the International Conference on Auditory Display,Limerick, Ireland, July 2005.

[23] V. C. Raykar, R. Duraiswami, L. Davis, and B. Yegnanarayana, “Ex-tracting significant features from the hrtf,” inProceedings of the 2003International Conference on Auditory Display, Boston, MA, USA, July2003.

[24] V. C. Raykar and R. Duraiswami, “Extracting the frequencies of thepinna spectral notches in measured head related impulse responses,”J.Acoust. Soc. Am., vol. 118, no. 1, pp. 364–374, July 2005.

[25] P. Welch, “The use of fast fourier transform for the estimation ofpower spectra: A method based on time averaging over short, modifiedperiodograms,”IEEE Trans. Audio Electroacoust., vol. AU-15, pp. 70–73, June 1967.

[26] R. Fletcher,Practical Methods of Optimization, 2nd ed. Chichester,1987.

[27] P. Viola and M. J. Jones, “Rapid object detection using aboosted cascadeof simple features,”IEEE CVPR, 2001.

[28] R. Lienhart and J. Maydt, “An extended set of haar-like features forrapid object detection,” inIEEE ICIP, Sep 2002, pp. 900–903.

James: A Humanoid Robot Actingover an Unstructured World

Lorenzo Jamone, Giorgio Metta, Francesco Nori and Giulio SandiniLira-Lab, DIST

University of Genoa, Italy{lorejam,pasa,iron,giulio}@liralab.it

Abstract— The recent trend of humanoid robotics researchhas been deeply influenced by concepts such as embodiment,embodied interaction and emergence. In our view, these concepts,beside shaping the controller, should guide the very designprocess of the modern humanoid robotic platforms. In this paper,we discuss how these principles have been applied to the designof a humanoid robot called James. James has been designedby considering an object manipulation scenario and by explicitlytaking into account embodiment, interaction and the exploitationof smart design solutions. The robot is equipped with movingeyes, neck, arm and hand, and a rich set of sensors, enablingproprioceptive, kinesthetic, tactile and visual sensing. A greatdeal of effort has been devoted to the design of the hand andtouch sensors. Experiments, e.g. tactile object classification, havebeen performed, to validate the quality of the robot perceptualcapabilities.

I. INTRODUCTION

Flexible and adaptive behavior is not only the result of asmart controller but also of the careful design of the body, thesensors, and of the actuators. This has been shown to be trueboth in artificial and biological systems [1], [2], [3]. Specifi-cally, this consideration has underlined the necessity of provid-ing robots with a physical structure suited for interacting withan unknown external world. Within this framework, learningis the result of the continuous interaction with the environmentand not the outcome of abstract reasoning. Knowledge aboutthe external unstructured world is incrementally developed andbecomes detailed only after extensive exploration. Followingthese new ideas, we are doomed to consider actions as themain tool for cognitive development, see for example [3].When implementing such a general principle, two issuesbecome apparent. On one hand, actions require the existenceof a physical body (embodiment, [2]), which ‘safely’ interactswith a continuously changing, dynamic, unstructured and, inone word, “real” environment. On the other hand, the necessityof understanding the consequences of actions requires fineperceptual capabilities, i.e. the ability of retrieving infor-mation about the world (exteroception) and about the body(proprioception). Therefore, action and perception need tobe considered as a unique tool for the development of thebehavior of the robot, mutually affecting each other in a self-maintaining process of co-development [1]. Clearly, there is adirect relationship between the complexity of the action systemand the quality of the data that the robot can have access to bymeans of action [3]. At the same time, actions can be improved

because of the improvement of the perceptual system throughlearning and adaptation.

II. OVERVIEW OF THE PAPER

In the previous section we have discussed the importanceof three fundamental concepts in humanoid robotics: embod-iment, action, and perception. In the light of these ideas, thedesign and realization of a robot has become a research topicin itself. This paper focuses on the design of the humanoidrobot called James (see Figure 1), primarily reporting the de-sign choices which have been suggested/required by seriouslyconsidering embodiment, action, and perception. Some of thedesign aspects are related to the fact that the robot is intendedto operate in the context of object manipulation, which will bethe test-bed on which these new principles will be validated.James is a 22-DOF torso with moving eyes and neck, anarm and a highly anthropomorphic hand. In the next sectionswe cover the robot design, its actuation, and sensorization.Experiments are reported.

Fig. 1. The humanoid robot James.

The structure of the following sections has to be understoodin terms of the three principles mentioned above: embodiment,action, and perception.

- Section III deals with the body, i.e. the physical structureof the robot. The robot structure is similar to that ofhumans, both in size, number of DOFs and range ofmovements. Such a complex structure is, in our view,

mandatory when studying complex tasks like manipula-tion. Specific attention has been devoted to the structureof the hand, which, through a necessary simplification re-tains most of the joints of a human hand. The complexityof the structure makes modeling a challenging problem.Therefore, the platform seems to be ideal for exploringthe expressive potentialities of the principles describedabove.

- Section IV deals with the actuation system. As alreadypointed out, ‘safe’ interaction with the external world isa fundamental process. Therefore, when designing therobot, we decided to make it intrinsically compliant to actsafely over an unknown and unstructured world, withoutdamaging itself and the objects around it. This has beenachieved by an accurate choice of the the actuationsystem. Specifically, movements are made intrinsicallycompliant by using plastic belts, stainless-steel tendonswhich transmit torques from motors to joints, and springsat critical locations along the tendons. A great deal ofattention has been focused on the actuation of the hand.Being impossible to actuate all the 17 joints1, an under-actuated mechanism was designed. Roughly speaking,some of the joints are weakly coupled in such a way thatblocking one of them does not prevent the movement ofthe remaining joints.

- Section V deals with sensors. Once again, we shouldstress the fact that sensors are fundamental to retrieveinformation about the external world and about the bodyof the robot. Perception is provided by visual, kinesthetic,proprioceptive and tactile sensors, some of which havebeen created specifically for James. We will describe thedesign and realization of a novel silicone-made tactilesensor. Touch is indeed critical in manipulation tasks.

Furthermore, to prove the crucial role of tactile sensing andproprioception in getting information about grasped objects,we describe a set of classification experiments on objects withdifferent shape and/or softness.

III. PHYSICAL STRUCTURE

James consists of 22 DOFs, actuated by a total of 23 motors,whose torque is transmitted to the joints by belts and stainless-steel tendons. The head is equipped with two eyes, whichcan pan and tilt independently (4 DOFs), and is mounted ona 3-DOF neck, which allows the movement of the head asneeded in the 3D rotational space. The arm has 7 DOFs: threeof them are located in the shoulder, one in the elbow andthree in the wrist. The hand has five fingers. Each of themhas three joints (flexion/extension of the distal, middle andproximal phalanxes). Two additional degrees of freedom arerepresented by the thumb opposition and by the coordinatedabduction/adduction of four fingers (index, middle, ring, andlittle finger). Therefore, the hand has a total of 17 joints. Theiractuation will be described later in Section IV. The overall size

1All the motors for actuating the hand are located in the arm and forearm.Therefore, it was impossible to have one motor for each degree of freedomof the hand.

Fig. 2. On the left, a picture of the tendon actuation; wires slide insideflexible tubes. On the right, a picture of the tendon-spring actuation.

of James is that of a ten-year-old boy, with the appropriateproportions for a total weight of about 8 kg: 2 kg the head, 4kg the torso and 2 kg arm and hand together (the robot partsare made of aluminum and Ergal).

IV. ACTUATION SYSTEM

This section describes how compliance has guided thedesign of the actuation system. Special attention has beendevoted to the hand actuation, which is crucial in manipulationtasks.

A. Overall actuation structure

The robot is equipped with 23 rotary motors (Faulhaber),most of which directly drive a single DOF. Exceptions are inthe neck, where three motors are employed to roll and tilt thehead, and in the shoulder, where three motors are coupledto actuate three consecutive rotations. Transmission of thetorques generated by the motors is obtained through plastictoothed belts and stainless-steel made tendons; this solutionis particularly useful in designing the hand since it allowslocating most of the hand actuators in the wrist and forearmrather than in the hand itself, where strict size and weightconstraints are present. Furthermore, tendon actuation gives anoteworthy compliance to the system2, easing the interactionover an unknown and unstructured world. Indeed, the elasticityof the transmission prevents damages to the external worldand the robot itself, especially in presence of undesired andunexpected conditions, which are usual in the “real” world:small positional errors when the hand is very close to thetarget, obstacles in the path of the desired movements, strokesinflicted by external agents. Extra intrinsic compliance hasbeen added by means of springs in series with the tendons tofurther help the coupling/decoupling of the fingers (see Figure2).

B. Shoulder actuation structure

The design of the tendon driven shoulder was developedwith the main intent of allowing a wide range of movements.The current design, consists of three successive rotationscorresponding to pitch, yaw and roll respectively (see Figure3 for details). The three motors that actuate these rotations are

2Stainless-steel tendons are extremely elastic especially when the appliedtension exceeds a certain threshold.

Fig. 3. The picture shows the three degrees of freedom of the shoulder.Notice in particular how the yaw rotation is obtained by a double rotationaround two parallel axes.

Fig. 4. Picture of the shoulder. Notice the two mechanically coupled yawjoints. This specific design allows a good range of yaw rotation (more than180 degrees).

located in the torso3. Actuation is achieved by exploiting thedesign of tendons and pulleys. Special care has been takenin designing the abduction (yaw) movement. The abductionrotation is divided along two mechanically coupled joints(see Figure 4). The two joints correspond to a sequence oftwo rotations around two parallel axes. A mechanical tightcoupling forces the two angles of rotation to be equal. This‘ad hoc’ solution allows the arm to gain an impressive rangeof movement (pitch' 360o, yaw≥ 180o, roll' 180o) and toperform special tasks (e.g. position the hand behind the head)at the expense of a multiplication of the required torque by asimilar amount.

C. Head actuation structure

The head structure has a total of 7 degrees of freedom,actuated by 8 motors. Four of these motors are used to actuatethe pan and tilt movements of the two independent eyes (seeFigure 5 for a scheme of the tendon actuation). One motor

3This design is evidently non-standard. Standard manipulators (e.g. theUnimate Puma) have the shoulder motors in a serial configuration, with asingle motor directly actuating a single degree of freedom. In the Pumaexample, a pure pitch/yaw/roll rotation can be obtained by simply movingone motor and keeping the others fixed. On James instead, shoulder rotationsare the result of the coordinated movement of the three motors. The motorpositions (θ1, θ2, θ3) are related to the pitch, yaw and roll rotations (θp, θy ,θr) by a lower triangular matrix:

24

θp

θy

θr

35 =

24

1 0 0−1 1 01 2 1

3524

θ1

θ2

θ3

35 .

Fig. 5. The left picture shows the tendon driven eye. The two tendonsare actuated by two motors. The first motor moves the vertical tendon (tiltmotion). The second motor moves the horizontal tendon (pan motion). Theright figure sketches the actuation scheme.

Fig. 6. The two mechanically coupled joints of the shoulder allow the armto have the maximum range of the abduction (yaw) movement.

directly actuates the head pan. The remaining three motorsactuate the head’s two additional rotations: tilt and roll. Thesetwo rotations are achieved with an unconventional actuationsystem (see Figure 6). Each motor pulls a tendon; the tensionof the three tendons determines the equilibrium configurationof the spring on which the head is mounted. The structurehas an implicit compliance but it can become fairly stiff whenneeded by pulling the three tendons simultaneously.

D. Hand actuation structure

The main constraint in designing the hand is representedby the number of motors which can be embedded in thearm, assuming there is very little space in the hand itself.The current solution uses 8 motors to actuate the hand. Notethat the motors are insufficient to independently drive everyjoint singularly. After extensive studies, we ended up with asolution based on coupling some of the joints with springs.This solution does not prevent movement of the coupled jointswhen one of them is blocked (because of an obstacle) andtherefore results in an intrinsic compliance. Details of theactuation for each finger are given in the following.

- Thumb. The flexion/extension of the proximal and middlephalanxes is actuated by a single motor. However, themovement is not tightly coupled. The actuation systemallows movement of one phalanx if the other is blocked.

Two more motors are used to directly actuate the distalphalanx and the opposition movement.

- Index finger. The flexion/extension of the proximal pha-lanx is actuated by a single motor. One more motor isused to move the two more distal phalanxes.

- Middle, ring and little fingers. A unique motor is usedto move the distal phalanxes of the three fingers (flex-ion/extension). Once again, blocking one of the fingersdoes not prevent the others from moving. One additionalmotor actuates the proximal phalanxes and, similarly, themovement transfer to the others when one of the fingersis blocked.

Finally, an additional motor (for a total of 8 motors) isused to actuate the abduction/adduction of four fingers (index,middle, ring and little fingers). In this case, fingers are tightlycoupled and blocking one of the fingers blocks the others.

E. Motor control strategy

Low-level motor control is distributed on 12 C-programmable DSP-cards, some of which are embedded inthe robot’s arm and head. Communication between cards andan external cluster of PCs is achieved by the use of a CAN-BUS connection. Robot movement is controlled by the userwith positional/velocity commands which are processed by theDSPs which generate trajectories appropriately. Most of themotors are controlled with a standard PID control except forthe shoulder, the neck and the eyes motors where mechanicalconstraints require the use of more sophisticated and coordi-nated control strategies. These specific control strategies falloutside the scope and size of this paper.

V. SENSORS

The sensory system allows James to gather informationabout its own body and about the external world. The robotis equipped with vision, proprioception, kinesthetic and tactileinputs. As previously mentioned, the robot perceptual capabil-ities have been designed in line with the necessity of using itfor manipulation.

Vision is provided by two digital CCD cameras (PointGreyDragonfly remote head), located in the eyeballs, with thecontrolling electronics mounted inside the head and connectedvia FireWire to the external PCs.

The proprioceptive and kinesthetic senses are achievedthrough position sensors. Besides the magnetic incrementalencoders connected to all motors, absolute-position sensorshave been mounted on the shoulder to ease calibration, in thefingers (in every phalanx, for a total of 15) and in the twomotors that drive the abduction of fingers and of the thumb.In particular, the hand position sensors have been designedand built expressly for this robot, combining magnets and Halleffect sensors. In order to understand the sensor’s structure andfunctioning, consider one of the flexion joints of the fingers,between a pair of links (phalanxes). One metal ring holds a setof magnets and moves with the more distal link (of the two weconsider here). The Hall-effect sensor is mounted on the firstlink. Magnets are laid as to generate a specific magnetic field

sensed by the Hall-effect sensor, monotonically increasingwhen the finger moves from a fully-bended position to a fully-extended one: from the sensor output, the information aboutflexion angle is derived. Furthermore, a 3-axis inertial sensor(Intersense iCube2) has been mounted on top of the head, toemulate the vestibular system.

Tactile information is extracted from sensors which havebeen specifically designed and developed for James, and havebeen realized using a two-part silicone elastomer (Sylgard186), a Miniature Ratiometric Linear Hall Effect Sensor (Hon-eywell, mod. SS495A) and a little cylindrical magnet (gradeN35; dim. 2x1.5 mm).

Fig. 7. Description of the sensing method.

The sensor structure is shown in Figure 7: any externalpressure on the silicone surface causes a movement of themagnet and therefore a change in the magnetic field sensed bythe Hall-effect sensor, which measures indirectly the normalcomponent of the force which has generated the pressure.An air gap between the magnet and the Hall-effect sensorincreases the sensor sensitivity while keeping the structurerobust enough. Two kinds of sensors have been realized, witha different geometric structure depending on the mountinglocation: phalangeal-sensors for some of the phalanxes andfingertip-sensors for the distal ends of each finger (Figure 8).The former present one sensing element, which can measureonly a limited range of forces, even if with a high sensitivity.In practice, the response of the phalangeal sensors can beregarded as an on/off response. The fingertip-sensors, instead,have two sensing elements capable of sensing a larger rangeof stimuli.

In order to realize the silicone parts two different moldshave been used (Figure 9 and Figure 10), employing the samebuilding procedure. Two different covers for each mold canbe seen in the figures: cover A, with little cylindrical bulgesto produce suitable holes for the magnets, and cover B, withrabbets on some edges to create the already mentioned airgaps. Molds are first filled with some viscous silicone (mixtureof the two parts, base and curing agent), covered with cover Aand left drying under progressive heating: this latter processlasts 24 hours at room temperature and then, in the oven, 4

Fig. 8. On the left, the fingertip-sensor, with its two sensing elements. Onthe right, the phalangeal-sensor, with a unique sensing element.

hours at 65◦C, 1 hour at 100◦C and 20 minutes at 150◦C.Afterward, magnets are inserted, some more silicone is poured,and molds are covered with cover B and left drying again.When ready, silicone parts are fixed on the fingers with asealing silicone, next to the Hall effect sensors previouslyglued in the desired positions.

Fig. 9. Mold and covers for the fingertip-sensor.

Fig. 10. Mold and covers for the phalangeal-sensor.

The fingertip-sensor characteristic curve is reported in Fig-ure 11, where the thicker line is the mean value over seriesof several measurements and the vertical bars are the standarddeviation. The fact that the standard deviation is reasonablylow is due to the robustness of the mechanical structure,which is robust to pressure applied at different positions andfrom different directions, even if with substantially differentintensities.

The non-linear response of the sensor is a feature because ofits similarity with the log-shaped response curve observed inhumans, whose sensitivity decreases with the stimulus inten-sity [4]. Furthermore, the minimum force intensity detected bythe sensor (less than 10 grams for fingertip-sensors and about1-2 grams for phalangeal-sensors) is very low, an aspect which

Fig. 11. Fingertip-sensor’s characteristic curve.

has been pursued as the main feature of the device during thedesign and development of its mechanical structure. Finally,the use of the soft silicone shows an intrinsic compliance andincreases the friction cone of the grasping forces. Siliconeadapts its shape to that of touched object without being subjectto plastic deformations, easing the interaction with unknownenvironments.12 tactile sensors have been mounted on the James’s hand:5 fingertip-sensors, one for each finger, and 7 phalangeal-sensors, two on the thumb, ring finger and middle finger, andone on the index finger, as shown in Figure 12.

Fig. 12. James’s antropomorphic hand equipped with the 12 tactile sensors.

All the 32 Hall-effect sensors of the hand (employed intactile and proprioceptive sensing) are connected to an ac-quisition board, mounted on the hand back and interfacedto the CAN-BUS as for the DSP-based control boards. Theacquisition card is based on a PIC18F448 microcontroller, anADC and multiplexer, and a 40-cable connector, which holdsthe 32 signals and provides the 5 Volt DC power supply tothe sensors. To wire such a considerable number of devices

in such a little space, with a reasonable resiliency (requiredbecause of the constant interaction of the hand with theexternal environment), we have chosen a very thin stainless-steel cable, coated in Teflon, with a 0.23 mm external diameter.Moreover, in order to further increase system robustness,cables are grouped into silicone catheters along their routebetween different sensors and toward the acquisition card.

VI. EXPERIMENTS

Experiments have been carried out to validate the design andquality of proprioceptive signals to gather useful informationabout grasped objects. Proprioception can be further combinedwith touch in object clustering and recognition showing anotable improvement. The importance of proprioception inobject clustering and recognition has been proved both inhuman subjects [5] and in artificial systems [6]; our aimhere is to show how tactile information improves the systemperformance, providing a richer haptic perception of graspedobjects. A Self Organizing Map (SOM) has been employed tocluster and display data. The object representation is simplygiven by the data acquired by the proprioceptive and tactilesensors of the hand, which provides information about objectshape and softness.

The experiment is divided into three stages: data gather-ing, SOM training and SOM testing. During the first stage,six different objects (bottle, ball, mobile phone, scarf, foamblob and furry toy) have been brought metaphorically to therobot’s attention. James performed several times the samepre-programmed action of grasping and releasing the objects.A simple motor grasping strategy turned out to be effectivebecause of the overall design of the hand. Data provided byproprioceptive and tactile sensors have been acquired every50 ms, recorded, and organized in 25,000 32-sized vectors,each representing the haptic configuration of the hand in aspecific time instant. Afterward, data have been fed for 50epochs to the SOM, which is formed, in our case, by a mainlayer (Kohonen layer) of 225 neurons, organized in a 15x15bi-dimensional grid, with hexagonal topology, and an inputlayer of 32 neurons, that receive the recorded vectors of handconfigurations. During this training stage, main layer neuronsself-organize their position according to the distribution of theinput vectors, creating separate areas of neurons respondingto similar stimuli.

Finally, the neural network has been tested with bothdynamic and static inputs, to see how the grasping actionis represented and to check the classification and recognitioncapabilities of the trained system. Results are shown in Figure13 and Figure 14.

Figure 13 shows the state of the SOM main layer aftertraining. Each small diagram accounts for the values of thecorresponding weights of the neuron in that particular position.It is worth noting that nearby neurons show similar curves,which gradually change moving across the network. The pathindicated by the black arrows represents the temporal sequenceof the neurons activated during a ball grasping. Neurons layon a continuous path, because of the gradual changes in sensor

Fig. 13. Path of neural activations on the main layer’s bi-dimensional grid,due to the grasping of a ball.

Fig. 14. Results of object classification. The left picture shows the SOMresponse to an empty, open hand. The picture on the right shows the SOMresponse to six different objects.

outputs during a continuous movement. Figure 14 shows theresponse to static inputs: the spikes on both graphs indicatethe number of activations of the neuron in that position afterthe presentation of inputs describing hand haptic configurationwhen grasping a particular object or when free and open. Itis worth noting that different objects are well classified andsimilar objects lay in adjacent regions.

Another experiment has been carried out to prove thespecial importance of tactile stimuli, focusing on situationswhere the use of proprioception alone may lead to ambiguousresults. A first SOM has been trained only with proprioceptiveinformation. A second SOM was instead trained with bothtactile and proprioceptive inputs. Figure 15 demonstrates thattactile sensing improves the system performance: indeed, thegrasp of a scarf and a bottle is only distinguished by the secondneural network. In fact, with these two objects the final handposture is very similar, as a consequence of the fact that thetwo objects have approximatively the same shape. Therefore,the only way to distinguish the two objects is by the use

Fig. 15. Comparison between the use of proprioception alone, on the left,and the combined use of proprioceptive and tactile informations, on the right.

of tactile information. The two objects have indeed differentmaterial characteristics (softness) and therefore only the useof touch can sense their difference.

VII. FUTURE WORK

Future work is already planned, concerning both the armand the hand control. In particular, adaptive modular controland the exploration of grasping strategies will be investigated.

Remarkably, Bizzi and Mussa-Ivaldi [7] have proposed aninteresting experimental evidence supporting the idea thatbiological sensorimotor control is partitioned and organized ina set of modular structures. At the same time, Shadmehr [8]and Brashers-Krug [9] have shown the extreme adaptabilityof the human motor control system. So far, adaptation hasbeen proved to depend on performance errors (see [8]) andcontext related sensory information (see [10]). Following theseresults, a dynamic control law which consists of the linearcombination of a minimum set of motion primitives have beenexamined. Interestingly, the way of combining these primitivescan be parametrized on the actual context of movement [11].Even more remarkably, primitives can be combined in approx-imately the same way forces combine in nature. Adaptation(i.e. by means of combining primitives with different mixingcoefficients) can arise both from error-based updating ofparameters (where the error represent the difference betweenthe desired and measured position and velocity of the joints)and from previously learned context/parameters associations(even combining some of them in a new and unexperiencedcontext can be seen as the combination of some experiencedcontexts).

As far as the hand movement is concerned, in the experi-ment described in this paper, the grasping action used, even ifeffective, is still very simple: fingers are just driven to a genericclosed posture, wrapping the object by means of the elasticactuation system. Following cues from existing examples [12],future development include the implementation of a system for

choosing the best grasping action among a series of basic mo-tor synergies, combining some of them, in trying to maximizethe number of stimulated tactile sensors and hence obtaininga more comprehensive and effective object perception. Ofcourse, in order to show this ability of evaluation and improve-ment of performances (i.e. move the grasped object toward theunstimulated sensors) a topological map of the sensors withinthe hand is needed. Following the developmental approach,we want this map to be autonomously learned by the robotby random interactions with the environment. It is reasonableto assume [13] that this could be possible by analyzingthe correlation between channels from the recorded data asprovided by touch and proprioception, exploiting multisensoryintegration as suggested by generic neural science results[14].

VIII. CONCLUSIONS

We have described James, a 22-DOF humanoid robotequipped with moving eyes and neck, a 7 DOF arm and ahighly anthropomorphic hand. The design of James has beenguided by the concept of embodiment, material complianceand embodied interaction. Since James is intended to operatein a manipulation scenario, special care has been taken on thedesign of novel silicone-made tactile sensors. These sensorshave been designed, realized, tested and integrated within theJames’s anthropomorphic hand. Hence, to prove the impor-tance of proprioceptive and tactile information, experimentsof classification and recognition of grasped objects have beencarried out using a Self Organizing Map to cluster and visu-alize data. As expected, results have shown the improvementsoffered by tactile sensing in building object representations,enabling the system to classify and recognize objects accord-ing to their softness and elasticity. Shape information providedby the measured posture of the grasping hand are indeed notsufficient to discriminate two objects with similar size butdifferent softness. Finally, the future development of efficientand flexible grasping strategies has been outlined.

ACKNOWLEDGMENT

The work presented in this paper has been supported bythe ROBOTCUB project, funded by the European Commissionthrough Unit E5 “Cognition”. Moreover, it has been partiallysupported by NEUROBOTICS, a European FP6 project (IST-2003-511492).

REFERENCES

[1] G. Sandini, G. Metta, and D. Vernon, “Robotcub: An open frameworkfor research in embodied cognition,” in In IEEE-RAS/RSJ InternationalConference on Humanoid Robots (Humanoids 2004), 2004.

[2] R. Pfeifer and C. Scheier, “Representation in natural and artificial agents:a perspective from embodied cognitive science perspective,” NaturalOrganisms, Artificial Organisms and Their Brain, pp. 480–503, 1998.

[3] P. Fitzpatrick, G. Metta, L. Natale, S. Rao, and G. Sandini, “Learningabout objects through action: Initial steps towards artificial cognition,”in IEEE International Conference on Robotics and Automation (ICRA),Taipei, Taiwan, 2003.

[4] J. Wolfe, K. Kluender, and D. Levi, Sensation and Perception. SinauerAssociates Incorporated, 2005.

[5] M. Jeannerod, The Cognitive Neuroscience of Action. BlackwellScience, 1997.

[6] L. Natale, G. Metta, and G. Sandini, “Learning haptic representation ofobjects,” in International Conference on Intelligent Manipulation andGrasping. Genoa, Italy, 2004.

[7] F. A. Mussa-Ivaldi and E. Bizzi, “Motor learning through the combi-nation of primitives,” Philosophical Transactions of the Royal Society:Biological Sciences, vol. 355, pp. 1755–1769, 2000.

[8] R. Shadmehr and F. A. Mussa-Ivaldi, “Adaptive representation ofdynamics during learning of a motor task,” Journal of Neuroscience,vol. 74, no. 5, pp. 3208–3224, 1994.

[9] T. Brashers-Krug, R. Shadmehr, and E. Bizzi, “Consolidation in humanmotor memory,” Nature, vol. 382, pp. 252–255, 1996.

[10] M. Shelhamer, D. Robinson, and H. Tan, “Context-specific gain switch-ing in the human vestibuloocular reflex,” Annals of the New YorkAcademy of Sciences, vol. 656, no. 5, pp. 889–891, 1991.

[11] F. Nori, G.Metta, L.Jamone, and G.Sandini, “Adaptive combinationof motor primitives,” in In Workshop on Motor Development part ofAISB’06: Adaptation in Artificial and Biological Systems. University ofBristol, Bristol, England, 2006.

[12] J. Fernandez and I. Walker, “Biologically inspired robot grasping usinggenetic programming,” in IEEE International Conference on Robotics& Automation, 1998.

[13] L. Olsson, C. Nehaniv, and D. Polani, “Discovering motion flow bytemporal-informational correlations in sensors,” in Epigenetic Robotics2005, 2005.

[14] E. Kandel, J. Schwartz, and T. Jessell, Principles of Neural Science,4th ed. McGraw-Hill, 2000.

Potential relevance of audio-visuacomputational

Eeva Klintfors and Fran

Department of LinStockholm University, Sto

[email protected].

ABSTRACT The purpose of this study was to examine typically developing infants’ integration of audio-visual sensory information as a fundamental process involved in early word learning. One hundred sixty pre-linguistic children were randomly assigned to watch one of four counterbalanced versions of audio-visual video sequences. The infants’ eye-movements were recorded and their looking behavior was analyzed throughout three repetitions of exposure-test-phases. The results indicate that the infants were able to learn covariance between shapes and colors of arbitrary geometrical objects and to them corresponding nonsense words. Implications of audio-visual integration in infants and in non-human animals for modeling within speech recognition systems, neural networks and robotics are discussed. Index Terms: language acquisition, audio-visual, modeling

1. INTRODUCTION This study is part of the “MILLE” project (Modeling Interactive Language Learning, supported by the Bank of Sweden Tercentenary Foundation), interdisciplinary research collaboration between three research groups – Department of Linguistics, Stockholm University (SU, Sweden), Department of Psychology, Carnegie Mellon University (CMU, USA) and Department of Speech, Music and Hearing, Royal Institute of Technology (KTH, Sweden). The general goals of the co-work are to investigate fundamental processes involved in language acquisition and to develop computational models to simulate these processes.

As a first step towards these goals, perceptual experiments with human infants were performed at the SU. These experiments were designed to investigate infants’ early word acquisition as an implicit learning process based on integration of multi-sensory (audio-visual) information. The current study specifically provides data on integration of arbitrary visual geometrical objects and “nonsense” words corresponding to attributes of these objects.

2. BACKGROUND

2.1 Research in mammals As a theoretical starting point for the current study it is assumed that infants do not have a priori specified linguistic

kggstthinacLocoliimanank

inablemcoin[5mv(“mppcoBanmstexdpinsuro

2nAchen

INTERSPEECH 2006

1

l integration in mammals for modeling

cisco Lacerda

guistics ckholm, Sweden se

nowledge and that the acquisition of the ambient language is uided by general perception and memory processes. These eneral purpose mechanisms are assumed to lead to linguistic ructure through learning of implicit regularities available in e ambient language. Thus, as opposed to the belief that itial innate guidance is a prerequisite for language quisition [1], [2], the proposed Ecological Theory of

anguage Acquisition (ETLA) suggests that the early phases f the language acquisition processes are an emergent nsequence of the interplay between the infant and its

nguistic environment [3]. To be sure, observations based on plicit regularities may indeed lead to wrong assumptions d, just like other experience-based-learning, this type of trial d error use of words tends initially to create situated

nowledge [4]. To be able to extract and organize implicit sensory

formation available from several modalities is an important ility for an organism’s success in its environment. Implicit arning processes are presumed to occur also in non-human ammals – a hypothesis being currently investigated by our llaborators at CMU. Explicitly concerning audio-visual tegration in non-human animals – Ghazanfar & Logothetis ] showed using preferential-looking technique that rhesus onkeys (Macaca mulatta) were able to recognize auditory-

isual correspondence between their conspecific vocalizations coo” or “threat” calls) and appropriate facial gestures (small outh opening/protruding lips or big mouth opening/no lip

rotrusion). Just like the perception of human speech, the erception of monkey calls may thereby be enhanced by a mbination of auditory signals and visual facial expressions.

imodal perception in animals was viewed by the authors as evolutionary precursor of human’s ability to make ultimodal associations necessary for speech perception. Such udies on non-human mammals are obviously important to amine questions that are impossible, unethical, or extremely

ifficult to answer with human listeners [6], like the pre and ost operative comparisons of rhesus monkeys’ performance auditory-visual (e.g. noise-shape pairs) association tasks pporting the notion that left prefrontal cortex plays a central le in integration of auditory and visual information [7].

.2 Modeling within speech recognition systems, eural networks and robotics fter about one year’s exposure to their ambient language, ildren typically start to speak to interact with their vironment. Despite of its complexity, children soon learn

September 17-21, Pittsburgh, Pennsylvania

the linguistic principles of their ambient language. However, this seemingly simple task is not easily transferred to formalized knowledge about language acquisition, nor is it easily integrated within speech recognition or operational models. Part of the problem that speech recognition systems battle with is presumably caused by the focus on the speech signal as the primary component of the speech communication process. Within natural speech communication speech is only one, albeit crucial, part of the process and the latest systems have started to make use of multimodal information to improve the systems’ communication efficiency. As an example, Salvi [8] analyzed the behavior of Incremental Model-Based Clustering on infant directed speech data in order to describe acquisition of phonetic classes by an infant. Salvi analyzed the effects of two factors within the model, namely the number of coefficients describing the signal and the frame length of the incremental clustering. His results showed that despite varying amount of clusters, the classifications obtained were essentially consistent. In addition to the model by Salvi the current co-work with the KTH further aims to develop a computational model able to handle information received from at least two different sensory dimensions.

We have recently reported [9] that two neural network models submitted to process encoded video materials that were earlier tested on a group of adult subjects in a simple inference task (similar to the task in the current study), performed well on both classification and generalization tasks. The task of the first type of architecture was simply to associate colors and shapes of the visual objects to the words corresponding to these two attributes, i.e. the model merely reproduced its input at the output level. The second architecture of the model attempted to simulate the adult subjects’ ability to apply their just learned concepts to new contexts. The performance of the model was tested with novel data not previously seen by the network (either new colors or new shapes). The performance of both network models was robust and mimicked the adults’ results well. Since the outcome of a neural network is dependent on the peculiarities of the input coding, its architecture and specific training constraints, the type of neural network models described here would presumably not mimic well enough the behavior of the infants in the current study. One reason for a vague match of behavior would probably be caused by the unlimited memory of the network which enables it to process data unrestricted as compared with infants’ restricted memory capacities. Also neural networks, often using sigmoid activation function for nodes, are non-linear regression models in which small differences in input value may cause large differences in neural computation behavior. Hence, a better way of modeling infant behavior would be to calculate effects of memory constraints when predicting audio-visual integration.

In addition to development of communicative ability leading to more sophisticated use of language, children quickly learn to interact with their environment through observation and imitation of manipulative gestures. To explore whether these motor abilities are developing independently, or if there are fundamentally similar mechanisms involved in development of perception and production of speech and manipulation of gestures, results from the current study and similar other experimental studies are tested within an embodied humanoid robot system. The hypothesis deals with a recent scientific finding pointing at the fact that action representations can be addressed not only during execution but

alwcospmm[1thanBlaFinoresy

mpohC

3Tfrbexcrd(a3dfopwapmwwin5stM

3T(2EmFca

insewsudre

INTERSPEECH 2006

2

so by the perception of action-related stimuli. Experiments ith monkeys have shown that mirror neurons in premotor rtex (area F5) discharge both when a monkey makes a ecific action and when it observes another individual who is aking a similar action [10], [11] that is meaningful for the onkey. A mirror system is proposed also to exist in humans 2] and F5, the area for hand movements, is suggested to be e monkey homolog of Broca’s area, commonly thought of as area for speech, in the human brain [13]. Both F5 and

roca’s area have the neural structures for controlling oro-ryngeal, oro-facial, and brachio-manual movements [14]. urther studies in neuroscience suggest that there are parallels mechanisms underlying actions such as to manipulate, tear,

r put an object on a plate and mechanisms underlying actions cognized by their sound or mechanisms underlying speech stems [15].

For our research group at the SU the importance of odeling language acquisition lies unquestionably within the

ossibility of experimentally manipulating learning processes n the basis of experimental achievements and theoretical ypothesis formulated by us and other Neuroscience and hild-development partners.

3. METHOD

.1 Subjects and procedure he subjects were 160 Swedish infants randomly selected om the National Swedish address register (SPAR) on the asis of age and geographical criteria. Thirty-one infants were cluded from the study due to interrupted recordings (infant ying or technical problems). The remaining 129 infants were

ivided in two age groups: 26 boys and 27 girls in age-group I ge range 90-135days, mean age 120 days) and 42 boys and

4 girls in age-group II (age range 136-180days, mean age 156 ays). The subjects were randomly assigned to watch one of ur counterbalanced film sequences. The subjects and their

arents were not paid to participate in the study. The infant as seated in a safe baby-chair or on the parent’s lap at proximately 60 cm from the screen. The parent listened to asking music through soundproof headphones during the hole session. The infants’ eye-movements were recorded ith Tobii (1750, 17” TFT) Eye-tracking system using low-tensity infra-red light. The gaze estimation frequency was

0Hz, and accuracy 0.5 degrees. Software used for data orage was ClearView 2.2.0. The data was analyzed in athematica 5.0 and SPSS 14.0.

.2 Materials he films’ structure was: BASELINE (20 s), EXPOSURE1 5 s), TEST1 (20 s), EXPOSURE2 (25 s), TEST2 (20 s), XPOSURE3 (25 s), and TEST3 (20 s). The image used to easure infants’ pre-exposure bias in BASELINE is shown in igure 1. During BASELINE the audio played a lullaby to tch the infant’s attention towards the screen.

The elements used in the EXPOSURE phases are shown Figure 2. Each of the elements was shown in 6 s long film quences. Each object moved smoothly across the screen hile the audio played 2 × repetitions of two-word phrases, ch as nela dulle (red cube), nela bimma (red ball), lame

ulle (yellow cube), lame bimma (yellow ball), implicitly ferring to the color and shape of the object. The two-word

phrases were read aloud by a female speaker of an artificial language in infant-directed speech style. The “nonsense” words were constructed according to the phonotactic and morphotactic rules of Swedish. However, the prosody of the phrases did not mimic Swedish prosody of two-word phrases – the words were instead pronounced as if they occurred in isolation, without sentence accent on either one of the words.

During TEST1, TEST2 and TEST3 the image shown in Figure 1 appeared again while questions such as Vur bu skrett dulle? (Have you seen the cube?) or Vur bu skrett nela? (Have you seen the red one?) were asked. The questions were naturally produced using question intonation.

In each one of the four counterbalanced versions of the film there was one question on shape of an object and another on color of an object. In the test phases the target shapes appeared in new colors and the target colors appeared in new shapes – as compared with the shape-color combinations during exposure phases – in order to address generalization of learned associations into new contexts.

4. RESULTS We predicted that if infants are capable of extracting the objects’ underlying properties, their looking times towards the relevant target color (Red or Yellow) and target shape (Cubeor Ball) of an object will increase from BASELINE (Pre-exposure bias) to TEST1-3 (Post-exposure).

TtoloBwcopmre

Figure 1: Split-screen displaying four objects shown during the BASELINE along with a lullaby to measure infants’ spontaneous looking bias before the three repetitions of exposure and subsequent test phases.

After presentation (EXPOSURE1-3) of the objects (Figure 2) the split-screen was displayed again during TEST1-3 along with questions such as Vur bu skrett dulle?(Have you seen the cube?) and Vur bu skrett nela? (Have you seen the red one?). The films were counterbalanced regarding choice of words corresponding to colors/shapes of the objects.

ergwfoin

spto(llod<

(Ufr(nanat

Figure 2: An illustration of the elements shown during EXPOSURE1-3. Each object moved smoothly across the screen (in 6 s long film sequences) while the audio played 2 × repetitions of two-word phrases, such as nela dulle (red cube), nela bimma (red ball) lame dulle (yellow cube), lame bimma (yellow ball), implicitly referring to the color and shape of the object. The films were counterbalanced regarding the presentation order of the objects during exposure.

Wfosostfrstad

INTERSPEECH 2006

3

he results (Figure 3) showed increased looking times wards Red upper-left (UL), Cube upper-right (UR), Ballwer-left (LL) and Yellow lower-right (LR) from ASELINE to TEST1-3. The increments in looking time ere larger in response to target shapes (Cube and Ball) as mpared with target colors (Red and Yellow). This was in

articular true for age-group II. Furthermore, repeated easures ANOVA on BASELINE to TEST1-3 contrasts vealed significant differences for Red, Cube and Ball.

The looking behavior of age-group II – as indicated by the ror bars – was more stable than the looking behavior of age-

roup I. An analysis on BASELINE to TEST1-3 contrasts ith age as a factor, revealed an overall age-group tendency r Red (F = 4.106, d.f. = 18, P < 0.058) indicating disparity the looking behavior of infants in age-group I and II.

There were no significant interactions between the times ent looking at target and the choice of words corresponding the colors and shapes of the objects or the object-position eft or right) on the split-screen. Infants did, however, look nger towards the upper quadrants (target Red F = 37.564,

.f. = 54, P < 0.0005 and target Cube F =26.758, d.f. = 63, P 0.0005) than towards the lower quadrants.

Figure 3: The attribute targets were Red (UL), CubeR), Ball (LL) and Yellow (LR). The bars in each quadrant

om left to right respectively indicate: Pre-exposure bias ormalized looking time at future target during BASELINE), d Post-exposure target dominance (normalized looking time target during TEST1-3) for age-group I and II respectively.

5. DISCUSSION hereas early studies on language acquisition in infants were cused on production and perception of isolated speech unds [16], recent experimental studies have addressed the ructure of perceptual categories [17], and word extraction om connected speech [18], [19], [20], [21]. The current udy further expands the scope of these investigations by dressing the emergence of referential function, integrating

recognition of patterns in auditory with the recognition of patterns in the visual input. In addition, our theoretical outline views early language acquisition as a consequence of general sensory and memory processes. Through these processes auditory representations of sound sequences are linked to co-occurring sensory stimuli and since spoken language is used to refer to objects and actions in the world, the implicit correlation between hearing words relating to objects and seeing (or feeling, smelling or otherwise perceiving) the referents can be expected to underline the acquisition of spoken language.

The materials in the current study were, as opposed to synthetic stimuli, naturally produced two-word phrases, whose linguistic meaning emerges from their situated implicit reference to the shapes and colors of the visual geometrical objects. The adjective-noun pattern in the two-word phrases followed the Swedish syntactical pattern. This situation, in which four new words were mapped onto a known grammatical structure, is of course different from the challenge faced by a pre-linguistic child, but simplifying the experimental materials (to only two shapes and two colors) was necessary for revealing the essential process of audio-visual integration during short exposure (about 1 minute) in young infants (< 6mos). Also, because success in the test phases required generalization of learned associations into new visual contexts, finding the correct visual target cannot be seen as simple “translation process”.

The age range of the subjects was 90-135 days (age-group I) and 136-180 days (age-group II). These age-groups were selected to investigate the extent of the age-related differences in the infants’ capacity to learn covariance between the two different types of attribute targets (shapes and colors) and to them corresponding words. Indeed, despite of the fact that color words typically appear first after about ten months, these young infants showed some evidence of being able to pick up the color references implicit in the current experiment. Thus, although infants in this age range may not have the capacity to handle the concept of color as such, they at least demonstrate, as observed in other mammals, an underlying capacity to associate recurrent visual dimensions with their implicit acoustic labels.

6. ACKNOWLEDGEMENTS Research supported by the Swedish Research Council (421-2001-4876), the Bank of Sweden Tercentenary Foundation (K2003-0867) and Birgit & Gad Rausing’s Foundation.

7. REFERENCES [1] N. Chomsky, Language and mind. New York: Harcourt

Brace Jovanovich, 1968. [2] S. Pinker, The Language Instinct: How the Mind Creates

Language, 1 ed. New York: William Morrow and Company, Inc., 1994, pp. 15-472.

[3] F. Lacerda, E. Klintfors, L. Gustavsson, L. Lagerkvist, E. Marklund, and U. Sundberg, "Ecological Theory of Language Acquisition,", Epigenetics and Robotics 2004 ed Genova: Epirob 2004, 2004.

[4] E. J. Gibson and A. D. Pick, An Ecological Approach to Perceptual Learning and Development Oxford University Press US, 2006.

[5

[6

[7

[8

[9

[1

[1

[1

[1

[1

[1

[1

[1

[1

[1

[2

[2

INTERSPEECH 2006

4

] A. A. Ghazanfar and N. K. Logothetis, "Neuroperception: Facial expressions linked to monkey calls," Nature, vol. 423, pp. 937-938, 2003.

] K. R. Kluender, A. J. Lotto, and L. L. Holt, "Contributions of nonhuman animal models to understanding human speech perception," in Listening to Speech: An Auditory Perspective. S. Greenberg and W. Ainsworth, Eds. New York, NY: Oxford University Press, 2005.

] D. Gaffan and S. Harrison, "Auditory-visual associations, hemispheric specialization and temporal-frontal interaction in the rhesus monkey," Brain, vol. Oct; 114, no. Pt 5, pp. 2133-2144, 1991.

] G. Salvi, "Ecological Language Acquisition via Incremental Model-Based Clustering,", InterSpeech'2005 - Eurospeech, 9th European Conference on Speech Communication and Technology ed Lissabon: InterSpeech'2005, 2005.

] F. Lacerda, E. Klintfors, and L. Gustavsson, "Multi-sensory information as an improvement for communication systems efficiency,", Fonetik 2005 ed Gothenburg: Fonetik 2005, 2005, pp. 83-86.

0] G. Rizzolatti and M. A. Arbib, "Language within our grasp," Trends in Neurosciences, vol. 21, no. 5, pp. 188-194, May1998.

1] C. Keysers, E. Kohler, M. A. Umilta, L. Nanetti, L. Fogassi, and V. Gallese, "Audiovisual mirror neurons and action recognition," Experimental Brain Research, vol. 4, pp. 628-636, 2003.

2] L. Fadiga, L. Fogassi, G. Pavesi, and G. Rizzolatti, "Motor Facilitation During Action Observation - A Magnetic Stimulation Study," Journal of Neurophysiology, vol. 73, no. 6, pp. 2608-2611, June1995.

3] G. Rizzolatti and M. A. Arbib, "Language within our grasp," Trends in Neurosciences, vol. 21, no. 5, pp. 188-194, May1998.

4] G. Rizzolatti and M. A. Arbib, "Language within our grasp," Trends in Neurosciences, vol. 21, no. 5, pp. 188-194, May1998.

5] E. Kohler, C. Keysers, M. A. Umilta, L. Fogassi, V. Gallese, and G. Rizzolatti, "Hearing sounds, understanding actions: action representation in mirror neurons," Science, vol. 297, no. 5582, pp. 846-848, Aug.2002.

6] P. D. Eimas, E. R. Siqueland, P. Jusczyk, and J. Vigorito, "Speech perception in infants," Science, vol. 171, pp. 303-306, 1971.

7] P. Kuhl, K. Williams, F. Lacerda, K. N. Stevens, and B. Lindblom, "Linguistic experience alters phonetic perception in infants by 6 months of age," Science, vol. 255, no. 5044, pp. 606-608, Jan.1992.

8] P. Jusczyk, "How infants begin to extract words from speech," Trends Cogn Sci., vol. 3, no. 9, pp. 323-328, Sept.1999.

9] P. Jusczyk and R. N. Aslin, "Infants' Detection of the Sound Patterns of Words in Fluent Speech," Cognitive Psychology, vol. 29, no. 1, pp. 1-23, Aug.1995.

0] P. Jusczyk and E. Hohne, "Infants' Memory for Spoken Words," Science, vol. 277, no. 5334, pp. 1984-1986, Sept.1997.

1] J. R. Saffran, R. N. Aslin, and E. Newport, "Statistical learning by 8-month old infants," Science, vol. 274, pp. 1926-1928, 1996.

Ett ekologiskt perspektiv på tidig talspråksutveckling Francisco Lacerda

Institutionen för Lingvistik

Stokcholms universitet

Språkets begynnelse… Föreställ dig att du är en nykomling till en jordliknande planet vars invånare också

använder sig av språk för att kommunicera med varandra. Föreställ dig också att du till

råga på allt glömt allt du viste, inklusive ditt eget språk, under din långa resa. I praktiken

är du som ett spädbarn som anlände i ett rymdskepp som snart upptäckts av planeternas

invånare. Du blir omhändertagen och till en början kan använda dina sinnesorgan för att

känna av ljud, ljus, tyngd, skrovlighet, o.s.v. Du är allmän nyfiken på det som sker

omkring dig men har samtidigt inte några förväntningar på eller kunskap om din omvärld.

Vilka förutsättningar behövs för att du skall kunna lära dig detta nya språk?

Om du bara har tillgång till ljudet som invånarna använder när de kommunicerar med

varandra (du kan varken se eller interagera med invånarna) kommer det sannolikt att vara

omöjligt för dig att lista ut vad dessa ljud står för. Eftersom du inte ens vet att ljuden kan

användas i kommunikativt syfte och än mindre att de ljudsekvenserna som du hör kan

betraktas som strängar av ordliknande element, kommer du i första hand att uppfatta

dessa ljudsekvenser som ostrukturerade och isolerade akustiska händelser. Till en början

kommer du till och med att blanda ihop dessa kommunikativa akustiska händelser med

andra akustiska händelser som inte har kommunikations innebörd. I din naiva betraktelse

av omvärldens akustiska egenskaper är deras mest framträdande egenskap själva

kontrasten med tystnaden som du upplever när det inte finns tillräckligt med akustisk

energi för att stimulera ditt hörselsystem. Men risken är att du till en början blandar ihop

alla slags ljud, kommunikativa och icke-kommunikativa, för att det du märker först bara

är att det finns ljud som på något sätt avviker från bakgrunden. På så sätt, och utan att

vara medveten om det, åstadkommer du en grundläggande binär kategorisering som

utnyttjar ljudintensiteten. I nästa steg börjar du att lägga märke till skillnader som också

finns bland ljuden som ingår i kategorin ”ljud” och än en gång sorterar du dessa i två

frasse
Text Box
To appear in 10th Nordisk Barnspråkssymposium (Gävle, November 2005)

olika kategorier enligt andra akustiska dimensioner. Notera att kategorisering, i sig, är

varken rätt eller fel. Den är inte anpassad till vad ljuden skulle kunna betyda utan

avspeglar bara ett sätt att organisera akustiska förnimmelser beroende på uppmärksamhet,

som i sin tur påverkas av själva iakttagelserna. Parallellt med ökningen av

ljudkategorierna kan även upptäckten av sekventiella relationer mellan kategorierna

iakttas, som t.ex. att en viss ljudkategori ofta följs av en annan. På det hela taget kan man

säga att din exponering till ljuden på denna främmande planet leder till att du kan lära dig

att kategorisera dessa ljud enligt olika akustiska kriterier och att du även kan lära dig att

förvänta att vissa ljudkategorier kommer efter vissa andra. Trots detta kan du ändå

knappast förstå hur dessa ljud används i kommunikativt syfte. I själva verket är

”kommunikativt syfte” inte ens ett problem för dig eftersom du i detta scenario hör bara

ljud som ur ditt perspektiv inte sätts i relation med någonting annat.

Situationen ändras dock radikalt om du, till skillnad från situationen ovan, också har

tillgång till minst ett annat sinnesintryck samtidigt som du hör ljuden omkring dig. Anta,

för enkelhetsskull, att samma typ av kategoriseringsprocess tillämpas på stimuli

representerade av de nya sensoriska dimensionerna. Den avgörande skillnaden är att

händelserna i din omvärld inte längre är endimensionella (representerade endast av dess

akustiska dimension). Varje händelse representeras nu av en punkt i en flerdimensionell

rymd där punktens placering vid varje ögonblick direkt motsvarar den löpande statusen

på dina samlade sinnesintryck. Notera dock att medan övergången från en sensorisk

dimension till flera sensoriska dimensioner är av avgörande betydelse för vårt

resonemang, antal dimensionerna hos den multisensoriska representationsrymd är inte.

Det principiella resonemanget som förs på två-dimensioner kan direkt generaliseras till

flerdimensionella representationsrymder. Detta beror på att språk bygger, per definition,

på användningen av något slags symboler som refererar till något annat än själva

symbolen, vilket innebär att det måste finnas minst två element som relateras till

varandra. Dessa element tillhör vanligtvis olika sensoriska dimensioner, så som den

auditiva och den visuella, men kan i princip bestå av mycket generella relationer mellan

olika dimensioner förutsatt att relationerna är kända bland individerna som använder

detta språk för att kommunicerar med varandra. I vårt nya scenario kan vi då föreställa

oss om att du naivt betraktar både auditiva och visuella sinnesintryck runtomkring dig i

denna främmande planet. Frågan är om du nu har bättre förutsättningar än tidigare för att

lära dig invånarnas språk.

I den nya situationen kan du både höra vad invånarna säger och samtidigt se vad de gör.

Eftersom vi antar att invånarna har ett språk baserat på ljudsymboler kan vi utgå ifrån att

dessa symboler kommer att ha bestämda (men för dig okända) relationer till visuella

objekten i denna ”tvåsensoriska” värld. Du vet dock inte om hur dessa audiovisuella

relationer ser ut och inte heller om att sådana relationer överhuvudtaget finns. Vad krävs

för att du, trots detta, ändå skall komma fram till hur invånarna använder sig av sitt

språk?

Om vi utgår från att du också tillämpar till den visuella dimensionen samma

kategoriseringsprincip som inom den auditiva domänen, kommer du till en början att ha

bara ett fåtal kategorier att hålla reda på. Ljud kontra icke-ljud kombinerat med ”synliga”

kontra icke-synliga objekt skapar fyra möjliga klasser i denna tvådimensionella rymd.

Med två tillgängliga dimensioner kommer dock antalet audiovisuella kategorier snabbt att

växa eftersom den totala mängd möjliga kategorier ges alltid av produkten av antal olika

kategorier längst varje dimension, i.e. , där kat∏=N

iikatkatTot _ i står för antalet kategorier

längst dimensionen i och Tot_kat representerar antalet möjliga kategorier när alla N

dimensioner utnyttjas (i detta exempel är N=2). Eftersom invånarna använder sig av ljud

för att referera till visuella objekt, kommer din ursprungligen binära kategorisering längst

dessa två dimensioner att leda dig till att upptäcka att just ljud förekommer ofta i

samband med synliga objekt medan icke-ljud hänger ihop med icke-synliga objekt. Dessa

samband är naturligtvis inte deterministiska men avspeglar en generell relation mellan

invånarnas ljudsymboler och de visuella objekten som symbolerna refererar till. Det

väsentliga här är att genom observationen av de konkreta fallen där den manifesterar sig

kommer du att koppla kategorin ”ljud” med kategorin ”synligt objekt” utan att ens veta

vilka lingvistiska principer använder. Att dessa två kategorier dyker ofta upp tillsammans

avspeglar, i sig, invånarnas lingvistiska principer. Som i det endimensionella fallet

kommer även här dina kategoriseringar att visa sig uppdelbara i ytterligare kategorier. Att

du inser att det går att göra finare uppdelningar längst dessa dimensioner innebär på

samma gång att du faktiskt kan upptäcka snävare audiovisuella samband. Utan att

nödvändigtvis få insikt om själva lingvistiska relationen mellan ljudsymbolerna och dess

referenter, innebär din betraktelse dessa ihopkopplade företeelser att du i praktiken agerar

som om du behärskade den lingvistiska processen. Men när du väl har upptäckt att det

finns systematiska samband mellan ljud och visuella föremål i vissa fall ligger det nära

till hands att generalisera iakttagelsen och inleda en hypotesdriven sökning efter

ytterliggare auditiva-visuella samband. Så som processen beskrivs här kommer du att gå

från en exemplar-baserad representation av observerade audiovisuella samband till en

aktiv och principiell insikt om att denna typ av relationer förekommer. Du har, med andra

ord, fått insikt om hur invånarna använder sig av ljud för att refererar till visuella objekt

och det var nu möjligt att komma till denna insikt just för att du fick tillgång till en

tvådimensionell representation av din omvärld.

En viktig slutsats av denna galaktiska övning är att multisensorisk betraktelse av

lingvistiska scenarion kan anses innehålla tillräckligt med information för att spontant

härleda upptäckten av lingvistiska symboliska referenser eftersom aktörer som använder

sig av ett språk för att kommunicera med varandra utnyttjar (relativt) stabila symboliska

relationer mellan olika sensoriska dimensioner (Minsky, 1985).

En ekologisk teori av tidig talspråksinlärning Mot bakgrund av tankexercisen här ovan, låt oss nu försöka formulera det som vi

kommer att kalla för en ekologisk teori av tidig språkinlärning (ETLA, Ecological Theory

of Language Acquisition). Vår ambition är att kunna förklara hur spädbarnets tidiga

talspråksinlärning kan vara en oavsiktlig emergent konsekvens av samspelet mellan

spädbarnets allmänna biologiska förutsättningar och dess typiska ekologiska kontext,

samt skapa en testbar ETLA modell som formaliserar våra teoretiska antaganden

(Lacerda et al., 2004).

ETLA:s utgångspunkt är att den tidiga talspråksinlärningen skapas genom samspelet

mellan spädbarnets icke-språkspecifika biologiska förutsättningar och barnets språkliga

miljö. Spädbarnets biologiska förutsättningar omfattar egentligen alla anatomiska och

fysiologiska aspekter samt alla sensoriska dimensioner men teorins grunddrag kan med

fördel illustreras utifrån ett mera begränsat och hanterbart perspektiv. Vi kommer därför

att inskränka oss till att huvudsakligen betrakta spädbarnets ansatsrörs anatomiska

konfiguration och dess rörelsemöjligheter samt barnets auditiva och visuella

representationsförmåga. Vi antar att barnet har en allmän minnesförmåga som möjliggör

igenkänningen av stimuli som det har exponerats för tidigare. För tillfället ignorerar vi

dock barnets handlingsförmåga, som i praktiken leder till att barnet avslöjar sina

tolkningar av omvärlden genom att barnet agerar utifrån sina förväntningar och intressen.

Dessa aktiva handlingar kommer att visa sig kunna spela en viktig roll för

talspråksutvecklingen och vi skall därför återkomma till denna aspekt. Beträffande

barnets språkliga miljö måste vi betrakta s.k. ”barnriktat tal”, som i vårt sammanhang

kommer att visa sig vara en väsentlig komponent av barnets tidiga talspråksutveckling.

Visserligen är barnriktat tal endast en liten del av all tal som ett spädbarn får höra (van

der Weijer, 1999) men det är kvalitativt så annorlunda de övriga talsignalerna att det är

rimligt att tillskriva barnriktat tal ett särskilt status vid tidig talspråksutveckling. Vi börjar

med att utgå från barnriktat tal och återkommer sedan till resonemanget kring hur

representativt denna talstil är i förhållande till övriga talstilar som barnet också får höra.

Våra explicita initiala antaganden är följande:

• Spädbarnet

o Har allmän multisensorisk minneskapacitet (känner igen återkommande

stimuli och har proprioceptivt minne)

o Har allmän förmåga till att diskriminera och representera ljud

o Har allmän förmåga till att diskriminera och representera visuella stimuli

o Ansatsrörets anatomi skiljer sig (i proportionerna) från den vuxnes

o Producerar gärna ljud med talapparaten

• Barnets språkliga miljö

o Vuxna använder barnriktat tal när de pratar med spädbarnet

Antaganden

Allmän minneskapacitet På det perceptuella planet vet vi att spädbarn visar förmågan att känna igen föremål och

ljud de sett förut. Eftersom denna igenkänningsförmåga utnyttjas i

habituering/dishabituerings tekniker som fungerar även med nyfödda barn kan man utgå

ifrån att det sannolikt rör sig om mycket basala och generella igenkänningsprocesser.

Denna sorts allmänna igenkänningsförmåga spelar dessutom en viktig roll för

organismens överlevnad i sin ekologiska nisch eftersom den möjliggör för organismen

att, bl.a., åter söka sig till tidigare gynnsamma miljöer eller att undvika de som skapade

obehagliga minnen.

Diskrimination och representation av ljud Denna förmåga är väl dokumenterad och olika experiment med nyfödda barn

demonstrerar att förmågan finns redan från födelsen. Nyfödda spädbarn har visat sig

kunna diskriminera i stort sätt alla språkligt relevanta ljudkontraster som de har testats

med. Mera generellt har frekvensdiskrimination och hörseltröskeln visat sig följa ungefär

den vuxna kapaciteten Den absoluta hörseltröskeln är dock något förhöjd i förhållande till

vuxnas när den mäts med beteendemetoder, vilket kan avspegla spädbarnets oförmåga att

koncentrera sig på en upptäcka en ton vars intensitet under visa tidsintervaller kan ligga

under hörseltröskeln. Till skillnad från en kognitivt mognare försöksperson, kan

spädbarnet inte följa instruktioner om att uppmärksamma det svagaste ljudet som går att

upptäcka och signalera detta genom att trycka på en knapp.

Diskrimination och representation av visuella stimuli Även här visar det sig att spädbarn har tillräcklig god visuell diskriminations- och

representationsförmåga från åtminstone c:a 2 månaders ålder. Vi kan därför anta att

spädbarn i den åldern iakttar visuella scener ungefär på samma sätt som vuxna gör.

Ansatsrörets anatomiska skillnader i förhållande till vuxna Spädbarnets ansatsrör är inte en skalenlig version av den vuxnes. Spädbarnets farynx är

förhållandevis mycket kortare är vuxnas och proportionerna mellan farynx längd och den

orala kavitetens längd förändrar sig dramatiskt under spädbarnets första levnadsår.

Farynx fortsätter visserligen att växa även efter spädbarnsåldern men i betydligt

långsammare takt än under spädbarnets första år i livet.

Spädbarns joller under det första levnadsåret Spädbarn låter och tycks njuta av att undersöka gränserna för sina vokaliseringar. Trots

starka individuella preferenser, visar sig spädbarns jollerutveckling generellt börja med

bakre vokaliseringar som så småningom ”flyttas fram” i munnen. Denna utvecklingstrend

kan anses skapa en viss struktur i barnets väg mot det omgivande språket.

Barnriktat tal Denna talstil karaktäriseras av stora modulationer av gruntonsfrekvensen, relativt

förenklade yttrandestrukturer, korta yttranden och ofta upprepningar.

Kontextberoende inlärning

Barnriktat tal Under typiska förhållanden utsätts spädbarn för det talade språket direkt efter födelsen

och det nyfödda barnet brukar visar påtagligt intresse för det omgivande talade språket,

särskilt moderns tal (Spence & de Casper, 1987; de Casper & Spence, 1986; de Casper &

Fifer, 1980). Detta spontana intresse för det omgivande språket har ofta tolkats som

tecken på medfödd och genetiskt betingad benägenhet att lära sig det omgivande språket

men vi föreslår att även detta intresse kan en rimlig ekologisk förklaring. Eftersom

fostrets hörselsystem mognar vid ungefär den femte graviditetsmånaden, kan fostret höra

externa ljud som överförs genom den havande kvinnans vävnader samt höra mammans

talljud som överförs direkt, via kvinnans anatomiska strukturer, från stämbandens

vibration och resonanskaviteter associerade till själva talproduktionen till den amniotiska

vätskan i vilken fostret utvecklas. Det faktum att fostret exponeras för ljud redan före

födelsen innebär ett slags bekantskap med, framförallt, de prosodiska aspekterna av

mammas tal, vilket förklarar det nyfödda barnets preferens för omgivningens talspråk

som en konsekvens av allmängiltiga minnesprocesser, utan behovet att åberopa

språkspecifika medfödda benägenheter.

Även de typiska aspekterna av det så kallade barnriktade talet kan betraktas som en

konsekvens av fostrets prenatala exponering till mammans tal. Den strikt fonetiska

utgångspunkten är att det nyfödda spädbarnet visar tecken på att känna igen de

prosodiska mönster som används av mamman och, i stor utsträckning, andra talare av

modersmålet. Barnets igenkänning uttrycker sig i orientering mot talaren som använder

de bekanta prosodiska mönstren samtidigt som barnets intresse för talet också upptäcks

av talaren. Eftersom i detta scenario är det i första hand själva prosodin som det nyfödda

barnet intresserar sig för, reagerar talaren med att förtydliga just dessa fonetiska drag som

det nyfödda barnet verkar reagerar på. Barnets intresse för talet upprätthålls och talaren

leds intuitivt till att framhäva de fonetiska dragen som hade lett till att upprätthålla

kontakten med det nyfödda barnet. Interaktionen mellan det nyfödda barnet och en

(vuxen) talare omfattar naturligtvis viktiga emotionella aspekter som inte betraktas här

men den emotionella laddningen som omger denna typ av interaktion spelar sannolikt en

viktig roll för att bekräfta och stabilisera barnets intresse för detta barnriktade tal. Det

sammanlagda resultatet av spädbarnets samspel med talaren i denna ekologiska kontext

är då en framhävning av de fonetiska dragen som tenderar att upprätthålla samspelet, dvs.

främst de prosodiska drag som känns igen av barnet och bidrar till att upprätthålla

kommunikationsutbytet. Figuren 1 visar F0 konturerna som skapades av en mamma när

hon pratade med sitt 3 månaders gammalt spädbarn (vänster) och när hon pratade med en

vuxen, under samma inspelningstillfälle.

Figur 1. Vågformer och F0 konturer från samma talare som producerar barnriktat tal

(vänster, 3 månaders gammalt barn) och vuxenriktat tal (höger). Y-axelns övre del visar

signalens amplitud, Y-axelns nedre del visar F0 frekvens i cents (1 oktav = 1200 cents)

och tiden visas i sekund (olika tidsskalor).

Ytterligare en fonetisk konsekvens av talarens framhävning av prosodiska drag är att

sonoranta ljud kommer att ha huvudrollen i sammanhanget eftersom det är denna typ av

ljud som lättaste kan bära de F0 modulationerna som fångar barnets intresse. Om vi

dessutom utgår från att det unga spädbarnet inte har lingvistiska grunder för att intressera

sig för att fokusera på en eller en annan kategori av talljud, förefaller det rimligt att

förvänta sig att spädbarnets uppmärksamhet för olika typer av talljud styrs i första hand

av relativt grova akustiska kriterier, så som ljudstyrkan. Enligt denna hypotes är då ljud

vars intensitet ligger över den övriga akustiska miljön som spädbarnet uppmärksammar

0 2 4 6 8 10 12 14 16

Input signal

-1

-0.5

0

0.5

[cent] F0 trend

-1500-1000 -500 0

500 1000 1500 2000

En oktav

0 2 4 6 8

Input signal

-1

-0.5

0

0.5

[cent] F0 trend

-2000-1500-1000 -500 0

500 10001500

men hur kan barnet skilja relevanta talljud från andra, ungefär lika starka, akustiska

företeelser i omgivningen?

Det principiella svaret på denna fråga är att det kan spädbarnet inte göra. Det finns alltså

risk för att framträdande ljud som fångar barnets uppmärksamhet inte har någon som

helst lingvistisk betydelse. Barnet uppmärksammar säkert något instrument som spelas

eller ljud från något slags utrustning och skulle, i princip, bli vilseledd av dessa

”meningslösa” men tillräckligt starka ljud från omgivningen. Men om det lilla barnets

ekologiska miljö tas i beräkningen förefaller denna risk som avsevärt mindre än det som i

första hand kan befaras. Dels skapar fostrets prenatala auditiva erfarenhet benägenhet att

fokusera just på akustiska signaler vars prosodiska egenskaper liknar mammans tal. Detta

i sig innebär att talliknande signaler kommer generellt att betraktas som mera bekanta än

andra akustiska signaler som saknar talets prosodiska drag. Dessutom sker den typiska

vuxen-barn kommunikationen där talsignalen är inblandad i en ekologisk kontext som

karaktäriseras av närhet mellan talaren och barnet och relativt lugnt akustiskt bakgrund.

När dessa två aspekter tas i beräkningen blir talsignalens fysikaliska intensitet plötsligt en

viktig variabel, som trots sin generiska och annars inte särskild relevanta lingvistiska

betydelse, kan lägga grund till barnets första steg mot det omgivande språk. Enligt denna

hypotes bör sonoranter vara talljuden som lättaste fångar barnets uppmärksamhet,

eftersom dessa ljud typiskt har relativt hög akustisk energi. Även det lilla barnets tidiga

intresse för modulationer av talsignalens grundtonsfrekvens leder till att sonoranter

implicit hamnar i barnets spontana auditiva fokus eftersom de tydligt förmedlar just de

F0-variationerna som barnet intresserar sig för. Tar man hänsyn till spektrala aspekter kan

bilden av de ”mest auditivt framträdande” talljuden förändras något, eftersom frikativor

kan kontrastera med den akustiska omgivningens spektrala drag. Frikativor är inte särskilt

energirika talljud i förhållande till sonoranter men innehåller ofta högfrekvent akustisk

energi vilket innebär att deras spektra generellt avviker från sonoranternas karaktäristiska

medel- och lågfrekventa spektralenergi. På grund av att frikativor levererar akustisk

energi vid frekvensband som de mer lågfrekventa sonoranter lämnar relativt fria, kan

frikativor ändå skapa tillräckligt auditivt i förhållande till sonoranter och därför träda

fram trots sin totalt lägre akustiska energi.

Spädbarnets perception av talljud Resonemanget här ovan kan ses som något onödigt mot bakgrunden av studier, under

1970-1980, av spädbarns förmåga till att diskriminera mellan olika talljud. Resultaten

från dessa studier pekar på spädbarnets förmåga att hålla isär de flesta talljudkontraster

som de har testats med (Jusczyk et al., 1993; Bertoncini et al., 1988; Mehler et al., 1988;

Jusczyk & Derrah, 1987) och då ligger det nära till hands att anse att spädbarnet inte

skulle ha några problem med att iaktta samma ljudkontraster när dessa förekommer i

löpande tal, i spädbarnets omgivning. I själva verket är dock risken stor för att

diskriminationsresultaten från perceptionsexperiment med enstaka ljudkontraster inte kan

överföras till löpande tal. I ett perceptionsexperiment presenteras talljuden ofta isolerade

och under optimala akustiska förhållanden, medan i ekologiskt relevant löpande tal

påverkas samma talljud av fonetiska kontexten samtidigt som barnets fokusering på just

dessa kontraster störs av andra konkurrerande talljud. Med andra ord, förmedlar

resultaten från dessa tidiga talperceptionsstudier med spädbarn ett slags principiell

förmåga att diskriminera lingvistiskt relevanta segmentella kontraster men det är inte

givet att spädbarnet kommer att lyckas utnyttja denna principiella förmåga för att härleda

dessa kontrasters lingvistiska funktion. Studier av hur spädbarnets förmåga att

diskriminera olika vokal- och konsonantkontraster tyder faktiskt på att dessa kontraster

behandlas på ett sätt som förändras med barnets mognad. Det sammanlagda budskapet

från dessa studier är att spädbarnets diskrimination av isolerade talljud utgår från

övervägande akustisk-betingad diskrimination till en mera lingvistisk baserad

diskrimination, i takt med att barnet mognar. Kontraster som initialt diskriminerades

förmodligen för att ljuden i sig lätt akustiskt olika, börjar så småningom ignoreras för att

de faktiskt är ”lingvistisk irrelevanta” allofoniska variationer i barnets omgivande språk.

Alltså, den ”förlorade” diskriminationsförmågan avslöjar en lingvistisk mognadsprocess

som kan innebära att barnet börjar förstå vilka de lingvistiskt relevanta akustiska dragen

är.

I stora drag, tyder scenariot som har skissats ovan på att allmänna auditiva- och

minnesprocesser kan leda det unga spädbarnet att fokusera på vissa kategorier av

språkljud. Utgångspunkten för denna ekologiska teori av tidig talspråksinlärning är just

att spädbarnets intresse för talet och förmodligen för vissa kategorier av talljud uppstår

som konsekvens av den ekologiska kontexten i vilken vuxen-barn interaktion sker,

snarare än på grund av specialiserade språkinriktade färdigheter som det nyfödda barnet

kommer utrustat med. Nästa fråga är hur lingvistiska funktioner skall kunna härledas

utifrån detta nyfödda barns allmänna intresse för det omgivande talet, särskilt mammas

tal. Med andra ord, finns det tillräckligt lingvistisk information i talsignalen som typiskt

riktas mot unga spädbarn för att överhuvudtaget kunna härleda grundläggande

lingvistiska funktioner?

Att härleda lingvistisk struktur ur barnriktat tal Vid institutionen för lingvistik, Stockholms universitet undersöker vår grupp för studier

av tidig talspråksutveckling hur talet som vuxna riktar till sina spädbarn eventuellt

anpassas till barnets ålder, eller det som vuxna uppfattar som barnets mognadsnivå. Våra

preliminära resultat pekar tydligt på att mammans talstil förändras beroende på barnets

ålder. Som förväntat utnyttjar mammorna variationer i F0 för att fånga det unga

spädbarnets uppmärksamhet men utöver det tenderar mammorna att också använda en

mycket repetitiv talstil när de adresserar sina c:a 3 månaders gamla spädbarn. Dessa

upprepningar verkar dock minskar kraftigt i takt med att barnet närmar sig 12 månaders

ålder.

Figuren 2a illustrerar hur förra figurens barnriktat talet kan uppfattas utifrån spädbarnets

naiva perspektiv. Eftersom man utgår från att spädbarnet ännu inte har lärt sig vad språk

är för någonting, representerar transkriptionen de ljudsjok som hörs av barnet. I denna

representation finns därför inga mellanslag mellan orden, eftersom enligt hypotesen ord

ännu inte finns i barnets språkvärld. Enligt principen som skissades ovan, det som finns

ur spädbarnets perspektiv är ljudsekvenser med tillräckligt hög intensitet för att bli

uppmärksammade och som kontrasterar med den övriga akustiska bakgrunden. I denna

illustration representerar mellanslagen endast pauser som avgränsar de holistiska

ljudsekvenserna.

Enligt ETLA kan lingvistisk struktur härledas utifrån dessa holistiska, icke analyserade,

ljudsekvenser, för att den vuxne upprepar ljudsjok och att dessa sjok ofta integreras i

längre ljudsekvenser. ETLA:s modell tillämpad till detta material visar just att relativt

enkla igenkänningsprocesser leder till upptäckten av ljudsträngar som återanvänds och

kombineras i olika sekvenser. Med andra ord, de holistiska ljudsekvenserna som

spädbarnet exponeras för kan organiseras i termer av rekursiv kombinatorisk användning

av delar av de ursprungliga ljudsträngarna. Figurerna 2b och 2c visar ett antal möjliga

”lexikala kandidater” som är möjliga att härleda ur detta material, under förutsättning att

spädbarnet skulle lagra samtliga ljudsekvenser i minnet och upptäcka alla upprepningar

och återanvändningar av dessa. Antagandet är, i sig, orealistiskt eftersom spädbarnets

uppmärksamhet sannolikt fluktuerar under tiden det exponeras för talmaterialet som

användes i illustrationen, men visar ändå potentialen av ETLA:s princip. Ett annat

8hej, skavilekamedleksakidag, skavigöradet,ha, skavileka, tittahär, kommerduihågdenhärmyran,hardusett, titta, nukommermyran, dumpadumpadump,hej, hejsamyran, kommerduihågdenhärmyranmaria,ådenhärmyransomdublevalldelesräddförförragången, ja,mariablevalldelesräddfördenhärmyran, ädemyranoskar,myranoskaräde, tittavafinmyranä, åvamångaarmarhanhar,hardusettvamångaarmarmaria, hejhejmaria, hallå,hallå, hallåhallå, hejhej, ojojsågladsäjerOskar,ämariagladidag, ämariagladidag, ha, hejmariasäjermyran<

ämariagladidag, ha, hallå, hardusett, hej, hejhej, kommerduihågdenhärmyran, skavileka, titta

medleksakidag, här, samyran, maria, vafinmyranä, åvamångaarmarnr, vamångaarmarmaria, maria, mariasäjermyran

Figur 2. (a) Transkription av ett stycke tal riktat mot ett 3 månaders gammalt spädbarn. Orden som uttalas i följd representeras här utan mellanslag. Kommateringen representerar pauser i mammans tal. (b) Lexikala kandidater efter första matchning av upprepade sjok. (c) Ytterligare lexikala kandidater som finns kvar i sjoken efter subtraktion av lexikala kandidaterna som erhölls i (b).

(2a)

(2b)

(2c)

orealistiskt antagande i exemplet ovan är dessutom att spädbarnet bara skulle ha tillgång

till den akustiska signalen.

När även den visuella informationen tas i beräkningen, i kombination med mera

begränsad minneskapacitet, blir resultaten genast mera realistiska. En modell som

kombinerar spädbarnets visuella fokus och också inkluderar en begränsning i

minneskapacitet för att simulera att spädbarnet bara kan hålla de sista fem yttranden som

uttalades vid varje tidpunkt i närminnet ger mera realistiska resultat. Resultaten pekar nu

på ett fåtal lexikala kandidater, men som i gengäld effektivt binder föremålets visuella

representation med yttranden som innehåller föremålets namn. Denna sorts audiovisuella

upprepningar anses, enligt ETLA, ligga till grund för vidare upptäckt av lingvistiska

funktioner, i synnerhet associationen mellan ett föremåls visuella representation och dess

namn, representerat av en ljudsekvens. Antalet lexikala kandidater som kan härledas ur

talet riktat mot äldre barn minskar dramatiskt i takt med att barnet börjar visa tecken på

att kunna namn på vissa objekt eller händelser. Den vuxna verkar ersätta sin repetitiva

talstil med mer innehållsinriktat tal som använder barnets begynnande lingvistiska

begrepp som utgångspunkt. I vuxenriktat tal verkar upprepningarna som förekommer i

barnriktat tal (2-3 mån) överhuvudtaget inte finnas.

Spädbarnets egna handlingar och dess tolkning Trots att det lilla spädbarnet är helt beroende av sin omgivning för överlevnaden är det

långt ifrån passivt. Spädbarnet kan påkalla den vuxnas uppmärksamhet genom att självt

skapa handlingar som t.ex. att skrika, gråta, vokalisera, gestikulera eller le, beroende på

spädbarnets utvecklingsnivå. Genom denna sorts handlingar kan spädbarnet initiera

interaktion med eller svara på omgivningen. Spädbarnets förmåga att generera och

kontrollera handlingar, om än i begränsad utsträckning, är i sig en viktig komponent av

den tidiga talspråksutvecklingen så som den betraktas av ETLA. Det innebär att

spädbarnet i princip kan ha informationsutbyte med omgivningen, vilket i sig är grunden

till den tidiga talspråksutvecklingen som ETLA avser förklara.

Spädbarnets möjligheter till differentierade och välkontrollerade vokaliseringar och

motoriska handlingar är dock begränsade, vilket ofta ses som ett hinder för det lilla

barnets tidiga talspråksutveckling. Utifrån ETLA:s perspektiv anses dock dessa initiala

begränsningar i spädbarnets vokaliska och motoriska sökrum vara till fördel för barnet

eftersom detta initialt begränsat urval av artikulatoriska möjligheter faktiskt innebär att

spädbarnets sökrum implicit struktureras och förenklas något. De anatomiska

skillnaderna mellan det nyfödda spädbarnets ansatsrör och den vuxnas ansatsrör erbjuder

ett bra exempel på hur det akustisko-artikulatoriska sökrummet kan de facto struktureras

på grund av initiala ”begränsningar”.

Hos nyfödda spädbarn ligger glottis mycket högt upp och epiglottis kan ofta ses göra

kontakt med uvula när det lilla barnet gapar. Med andra ord, det faryngeala området hos

ett nyfött spädbarn är så pass kort att spädbarnets ansatsrör kan anses bestå av nästan bara

den orala kaviteten, vid jämförelse med det vuxnes. En direkt akustisk konsekvens av

detta är att när spädbarnet öppnar och stänger underkäken samtidigt som det vokaliserar

blir det i stort sätt nästan bara F1 som påverkas, medan alla de högre formanterna tenderar

att vara relativt opåverkade av denna artikulatoriska rörelse (Lacerda, 2003). I praktiken

innebär detta att det unga spädbarnets vokalproduktion blir effektivt begränsat till

kontraster längst öppen-sluten dimensionen. Under spädbarnets första levnadsår sker

dock en dramatisk förlängning av barnets farynx i och med att glottis position relativt

halskotorna sjunker kraftigt fram till slutet av det första levnadsåret. Generellt innebär

den oproportionella placeringen av olika artikulatoriska strukturer i spädbarnets

respektive vuxnas ansatsrör att spädbarnet och vuxna i princip måste använda olika

artikulatoriska lösningar för att åstadkomma fonetiskt likvärdiga talljud. Samma

resonemang kan användas för att förklara dominansen av faryngeala ljud vid tidig

jollerproduktion. Anledningen till att unga spädbarn uppfattas producera dessa faryngeala

ljud (Roug et al., 1989), figur 3) kan helt enkelt bero på att spädbarnet artikulerar

”vanligare” velara eller palatala ljud som dock uppfattas som faryngeala eftersom

artikulationsstället, på grund av den korta farynx, ligger relativt mycket längre bak hos

spädbarnet än hos den vuxne.

När spädbarnets ”begränsningar” av akustisko-artikulatoriska möjligheter läggs ihop med

vuxnas spontana uppfattningar av joller (Lacerda & Ichijima, 1995) och spädbarnets

benägenhet till lättare uppfattning av ljudkontraster förmedlade av F1 än av F2 (Lacerda,

1992b; Lacerda, 1992a), konvergerar alla dessa komponenter mot en bild av tidig

talperceptionsutveckling där initiala ojämnheter kan anses hjälpa spädbarnet att i första

hand fokusera på ett urval av tillängliga dimensioner i detta enorma sökrum. Den positiva

betydelsen av initiala begränsningar av sökmöjligheter har tidigare demonstrerats i arbete

med neurala nätverk och det finns all anledning till att ta detta budskap på allvar i

ekologiskt förankrade teorier om talspråksutveckling.

Age in months

20181614121086420

% o

ccur

renc

e

100

90

80

70

60

50

40

30

20

10

0

Place

glottal

velar/uvular

dental/alveolar

bilabial

Figur 3. Typisk jollerutveckling för ett svenskt barn. Notera den initiala dominansen av glottala artikulationer, som kan anses bero på spädbarnets anatomiska förutsättningar. Data från (Roug et al., 1989).

Testning av inlärningsmodellen med hjälp av en humanoid robot

Att skapa realistiska modeller av tidig talspråksutveckling kräver omfattande integrering

av en rad olika processer som pågår parallellt och interagerar med varandra. Det gäller till

exempel att modellera hur allmängiltiga produktions- och perceptionsmekanismer skall

kunna samspela med varandra för att härleda lingvistisk information ur interaktionen med

omgivningen. Det gäller också att kunna integrera dessa mekanismer i en modell som kan

generera handlingar utifrån ett slags allmän bakgrundsaktivitet som driver systemet mot

nya situationer, som i sin tur kan innebära nya inlärningsmöjligheter för detta system.

I syfte att kunna studera olika allmänna inlärningsprinciper och hur de interagerar med

varandra, testas för närvarande den ekologiska modellen av tidig talspråksinlärning med

hjälp av en humanoid robot, BabyBot, som skall kunna visa på paralleller mellan den

tidiga språk- och motoriska utvecklingen.

Vi ser med spänning fram till hur BabyBot, vår ställföreträdande galaktisk resenär,

kommer att klara sin talspråkutveckling under olika experimentella betingelser!

_________________

Denna översiktsartikel har producerats inom ramen för projekten ”Språkets begynnelse”

(Vetenskapsrådet), ”MILLE: Modelling Interactive Language Learning” (Riksbankens

Jubileumsfond) och ”CONTACT: Learning and Development of Contextual Action”

(NEST, EU).

Referenser

Bertoncini, J., Bijeljac-Babic, R., Kennedy, L., Jusczyk, P., & Mehler, J. (1988). An investigation of young infants' perceptual representation of speech sounds. Journal of Experimental Psychology: General, 117, 21-33.

de Casper, A. & Fifer, W. P. (1980). On human bonding: Newborns prefer their mothers' voices. Science, 208, 1174-1176.

de Casper, A. & Spence, M. (1986). Prenatal maternal speech influences newborns' perception of speech sounds. Infant Behavior and Development, 9, 133-150.

Jusczyk, P. & Derrah, C. (1987). Representation of speech sounds by young infants. Developmental Psychology, 23, 648-654.

Jusczyk, P., Friederici, A. D., Wessels, J. M., I, Svenkerud, V. Y., & Jusczyk, A. M. (1993). Infants' Sensitivity to the Sound Patterns of Native Language Words. Journal of Memory and Language, 32, 402-420.

Lacerda, F. (1992a). Does Young Infants Vowel Perception Favor High-Low Contrasts. International Journal of Psychology, 27, 58-59.

Lacerda, F. (1992b). Young Infants' Discrimination of Confusable Speech Signals. In M.E.H.Schouten (Ed.), THE AUDITORY PROCESSING OF SPEECH: FROM SOUNDS TO WORDS (pp. 229-238). Berlin, Federal Republic of Germany: Mouton de Gruyter.

Lacerda, F. (2003). Phonolgoy: An emergent consequence of memory constraints and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 41-59.

Lacerda, F. & Ichijima, T. (1995). Adult judgements of infant vocalizations. In K. Ellenius & P. Branderud (Eds.), (pp. 142-145). Stockholm: KTH and Stockholm University.

Lacerda, F., Klintfors, E., Gustavsson, L., Lagerkvist, L., Marklund, E., & Sundberg, U. (2004). Ecological Theory of Language Acquisition. In Genova: Epirob 2004.

Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., & Amiel-Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29, 143-178.

Minsky, M. (1985). Why intelligent aliens will be intelligible. In E.Regis (Ed.), Extraterrestrials (pp. 117-128). Cambridge: Cambridge Univ. Press.

Roug, L., Landberg, I., & Lundberg, L. J. (1989). Phonetic development in early infancy: a study of four Swedish children during the first eighteen months of life. J.Child Lang, 16, 19-40.

Spence, M. J. & de Casper, A. (1987). Prenatal experience with low-frequency maternal-voice sounds influence neonatal perception of maternal voice samples. Infant Behavior and Development, 10, 133-142.

van der Weijer, J. (1999). Language Input for Word Discovery. MPI Series in Psycholinguistics.

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Proposing ETLA An Ecological Theory of Language Acquisition

Revista de Estudos Linguísticos da Universidade do Porto

Francisco Lacerda and Ulla Sundberg Department of Linguistics

Stockholm University SE-106 91 Stockholm

Sweden [email protected]

1

frasse
Text Box
Invited paper for the opening number of "Revista de Linguistica da Univ. do Porto", Portugal

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

ABSTRACT An ecological approach to early language acquisition is presented in this article. The general view is that the ability of language communication must have arisen as an evolutionary adaptation to the representational needs of Homo sapiens and that about the same process is observed in language acquisition, although under different ecological settings. It is argued that the basic principles of human language communication are observed even in non-human species and that it is possible to account for the emergence of an initial linguistic referential function on the basis of general-purpose perceptual, production and memory mechanisms, if there language learner interacts with the ecological environment. A simple computational model of how early language learning may be initiated in today’s human infants is proposed.

INTRODUCTION The ability for language communication is a unique human trait differentiating us from all other species, including those most closely related in genetic terms. In spite of the morphological and molecular proximity to closely related species in the family Hominidae, like Pongo (orangutan), Gorilla (gorilla) and Pan (chimpanzees) (Futuyama, 1998: p730), it is obvious that only Homo (human), and most likely only Homo sapiens, has evolved the faculty of language. In evolutionary terms, the capacity for language communication appears to be a relatively recent invention that nowadays is “rediscovered” by typical human infants. From this broad perspective it may be valuable to investigate parallels between the phylogenetic and ontogenetic components of the process of language communication, while keeping in mind that Ernst Haeckel’s (1834-1919) biogenetic law, “ontogeny recapitulates phylogeny” (1866), is indeed far too simplistic to be taken in strict sense (Gould, 1977). This article opens with a broad look at the evolutionary aspects presumably associated with the discovery of language as a communication system. This evolutionary perspective will focus on general notions regarding the emergence of the language communication ability in the Homo’s ecological system. In the reminder of the article, an ecologically inspired developmental perspective on early language acquisition in human infants will be presented, enhancing potential similarities and differences between the developmental and evolutionary accounts.

From the phylogenetic perspective, language communication, in the sense that it has today, may have emerged about 200 to 300 thousand years ago (Gärdenfors, 2000) with the advent of Homo sapiens1. Until then it is assumed that hominids might have communicated mainly by calls and gestures. Derek Bickerton (Bickerton, 1990), for instance, speculates that a protolanguage, assigning referential meaning to arbitrary vocalizations, may have been used already by Homo erectus, about 1.5 Mya to 0.5 Mya (Futuyama, 1998: p731). Such a protolanguage must have been essentially a “referential lexicon” and somewhat of a “phylogenetic precursor of true language that is recapitulated in the child (…), and can be elicited by training from the chimpanzee” (Knight et al., 2000: p4). While this protolanguage may have consisted of both vocalizations and gestures used referentially, the anatomy of the jaw and skull in earlier hominids must necessarily have constrained the range of differentiated vocalizations that might have been used referentially. Considerations on the relative positions of the skull, jaw and vertebral column in early hominids clearly indicate significant changes in the mobility of those structures, from the Australopithecus afarensis (4.0-2.5 Mya) to Homo sapiens (0.3 Mya), changes that must have had important consequences for the capacity to produce differentiated vocalizations. An important aspect of these changes is the lowering of 1 “Most hominid fossils from about 0.3 Mya onward, as well as some African fossils about 0.4 My old, are referred to as Homo sapiens.”, (Futuyama, 1998)

2

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

the pharynx throughout the evolutionary process. This lowering of the pharynx is also observed on the ontogenetic time scale. Newborn human infants have at birth a very short pharynx, with high placement of their vocal folds, but they undergo a rather dramatic lengthening of the pharynx within the first years of life, in particular within about the first 6 months of life. As will be discussed later in this article, this short pharynx in relation to the oral cavity has some important phonetic consequences that can be summarized as a reduction in the control that the infant may have over the acoustic properties of its vocalizations. And from the acoustic-articulatory point of view, there are essentially no differences between the expected acoustic-articulatory relationships in the infant or in the adult since it seems that the infant’s vocal tract in this case can be considered as a downscaled version of an Australopithecus afarensis adult. A more difficult question is to determine at which point of the evolutionary process did the laryngeal position reach the placement it has nowadays but the general consensus is that Homo erectus probably marks the transition from high to low laryngeal positions (Deacon, 1997: p56). However, in evolutionary terms the emergence of a new anatomical trait does not directly lead to the long-term consequences that are found retrospectively. Changes are typically much more local, opportunistic and less dramatic than what they appear to be on the back-mirror perspective, reflecting Nature’s tinkering rather than design or goal-oriented strategies (Jacob, 1982) and Homo erectus probably did not engage in speech-like communication just to explore the new emerging anatomical possibility. A more plausible view is then that Homo erectus might have expanded the number of vocalizations used to signal objects and events available in the ecological environment, nevertheless without exploring the potential combinatorial and recursive power of the vocalizations’ and gestures’ inventories. Indeed, as recently pointed out by Nishimura and colleagues (Nishimura et al., ), the descent of the larynx is observed also in chimpanzees, leading to changes in the proportions of the vocal tract during infancy, just like in humans, and the authors argue that descent of the larynx may have evolved in common ancestor and for reasons that did not have to do with the ability to produce differentiated vocalizations that might be advantageously used in speech-like communication. In particular with regard to gestures, it is also plausible that Homo erectus might not have used gestures in the typically integrated and recursive way in which they spontaneous develop in nowadays Homo sapiens’ sign language (Schick et al., 2006), although their gestural anatomic and physiologic capacity for sign language must have been about the same. Thus, in this line of reasoning the question of how nowadays language communication ability did arise is not settled by the lowering of the larynx, per se, and the added differential acoustic capacity that it confers to vocalizations.

Since communication using sounds or gestures does not leave material fossil evidence it is necessary to speculate how vocalizations might have evolved into today’s spoken language and draw on useful parallels based on the general communicative behaviour among related hominid species. Situation-specific calls used by primates, such as the differentiated alarm calls observed in Old World monkeys like the vervet monkeys (Cercopithecus aethiops) (Cheney & Seyfarth, 1990), may in fact be seen as an early precursor of humanoid communication. To be sure, the calls used by the vervet monkeys to signal different kinds of predators and prompt the group to take appropriate action are rather primitive as compared to the symbolic functions of modern human language communication but they surely demonstrate a multi-sensory associative process from which further linguistic referential function may emerge. The ability to learn sound-meaning relationships is present in primates at a variety of levels and animals tend to learn the meaning of different alarm calls, even those used by other species, if those calls provide them with meaningful information and they can even modify their vocalizations to introduce novel components in the established sound-meaning coding system (Gil-da-Costa et al., 2003; Hauser, 1992; Hauser, 1989; Hauser & Wrangham, 1987). But using multi-sensory associations does not pay-off in the long run

3

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

under the pressure of increasingly large representational demands and once the need for representation of objects or events in the ecological setting reach a critical mass of about 50 items, sorting out those items in terms of rules becomes a much more efficient strategy (Nowak et al., 2000; Nowak et al., 2001). In line with this, the driving force towards linguistic representation and the emergence of syntactic structure appears to come from increased complexity in the representation of events in the species ecological environment and again there are plausible evolutionary accounts for such an increase in representational complexity with the advent of Homo erectus. One line of speculation is linked to bipedalism. Bipedal locomotion is definitely not exclusive of humans (avian species have also discovered it and use it efficiently, like in the case of the ostrich) but provided humanoids with the rather unique capacity of using the arms for foraging and effectively carrying food over larger distances. Again, mobility, per se, does not account for the need to increase representational power (migration birds travel over large distances and must solve navigation problems and yet have not developed the faculty of language because of that) but it adds to the complexity of the species ecological setting, posing increasing demands on representational capacity. Another type of navigational needs that has been suggested as potentially demanding more sophisticated representation and communication abilities is the more abstract navigation in social groups (Tomasello & Carpenter, 2005) and the abstract demands raised by the establishment of “social contracts”, as Terrence Deacon had pointed out in his 1997 book (Deacon, 1997). To be sure, abstract representation capacity can even be observed, at least to a certain extent, in non-human primates (Savage-Rumbaugh et al., 1993; Tomasello et al., 1993) but nevertheless they fall short of human language as far as syntax and combinatorial ability is concerned.

The general notion conveyed by this overview of the possible evolution of language communication is that the capacity for language communication must have emerged as a gradual adaptation, building on available anatomic and physiologic structures whose functionality enables small representational advantages in the humanoids’ increasingly complex ecological contexts. Of course, because evolution is not goal-oriented, the successive adaptations must be seen more as accidental and local consequences of a general “arms-race” process that although not aimed at language communication still got there, as it can be observed in retrospective. Evolutionary processes are typically like this, lacking purpose or road-map but conveying a strong feeling of purposeful design when looked upon in the back-mirror. William Paley’s proposal of the “watchmaker argument” is an example of how deep rooted the notion of intelligent design can be when viewing evolutionary history in retrospective (and knowing its endpoint from the very beginning of the reasoning). Yet it is well known that complex or effective results do not have to be necessarily achieved by dramatic or complex design, but that they are rather often the consequence of deceivingly simple interactions that eventually produce relatively stable long-term consequences (Enquist & Ghirlanda, 2005; Dawkins, 1998; Dawkins, 1987). The problem is accepting non-teleological explanations when the final results appear to be so overwhelmingly clear in their message of an underlying essential order. But how could that be otherwise? How could the current state of the evolutionary process not look like a perfectly designed adaptation to the present ecological settings if non-adapted behaviour or structures confer disadvantages that undermine the species’ survival? Obviously the potential advantages that small and purposeless anatomic and physiological changes might confer are strictly linked to the ecological context in which they occur. Suppose, for instance that or recently discovered ancestor, Tiktaalik (Daeschler et al., 2006), the missing link between fish and tetrapods, had appeared now rather than under the Devonian-Carboniferous Period, between 409 Mya and 354 Mya (Futuyama, 1998): How long would Tiktaalik have survived in today’s ecological settings? Probably not very long. The bony fins that the animal successfully used as

4

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

rudimentary legs conveyed significant advantage in an ecological context where land life was limited to the immediate proximity of water and therefore there were no land competitors. Under these circumstances, the ability to move on the water bank just long enough to search for fish that might have been trapped in shallow waters was obviously a big advantage but would have made the animal an easy prey nowadays, unless it had evolved shielding, poisoning or stealth strategies to compensate for its low mobility. In other words, the potential success of traits emerging at some point in the evolutionary process is intimately dependent on the ecological context in which they appear, suggesting that even a rudimentary ability to convey information via vocalizations may have offered small but systematic and significant advantages to certain groups of hominids. Indeed, an important aspect of this ecological context is that the innovation presented by even a rudimentary discovery of representational principles is likely to spread through cultural evolution processes (Enquist et al., 2002; Kamo et al., 2002; Enquist & Ghirlanda, 1998), propagating the innovation to individuals that may be both upstream, downstream or peers in the specie’s genetic lineage. From this perspective language evolution must be seen as the combined result of both cultural and genetic evolutionary processes.

The ability to use vocal or sign symbols to represent events or objects in the ecological context was probably common at least within several hominid species (Johansson, 2005) but its recursive use must have been discovered by Homo sapiens whose brain capacity developed (Jones et al., 1992) in an “arms-race” with the increasingly complex symbolic representational demands. From this evolutionary perspective, the language acquisition process observed in nowadays infants can be seen as a process that in a sense repeats the species’ evolutionary story, keeping in mind that it starts off from an ecological context where language communication per se does not have to be re-discovered and is profusely used by humans in the infant’s immediate environment. Of course, since both the early Homo sapiens and today’s infants adjusted and adjust to the existing ecological context, the dramatic contrast of the communicative settings of their respective ecological scenarios must account for much of the observed differences in outcome. The challenge here is to account for how this language acquisition process may unfold as a consequence of the general interaction between the infant and it’s ambient.

THE INFANT’S PRE-CONDITIONS FOR SPEECH COMMUNICATION To appreciate the biological setting for language acquisition it is appropriate to take a closer look at some of the infant’s capabilities at the onset of life.

Among the range of abilities displayed by the young infant, perceiving and producing sounds are traditionally considered to be the most relevant to the language acquisition process. These capabilities are not the only determinants of the language acquisition process (language acquisition demands multi-sensory context information, as it will be argued below) but they certainly play an important role in the shaping of the individual developmental path. If there are biologically determined general propensities to perceive or generate sounds in particular ways, these propensities are likely to contribute with significant developmental biases that ought to become apparent components of language acquisition, although such biological biases are dynamic plastic components of the language acquisition process that will necessarily influence and be influenced by the process itself (Sundberg, 1998) but here the focus will be on just some production and perception biases.

Production constraints To estimate the infant’s production strategies, one may use the evidence provided by infant vocalizations in order to derive underlying articulatory gestures. However the acoustic analysis of high pitch vocalizations is not a trivial task, since high fundamental frequencies

5

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

effectively reduce the resolution of the available spectral representations. An additional difficulty is caused by the anatomic differences between the infant’s and the adult’s vocal tracts. The vocal tract of the newborn infant is not a downscaled version of the adult vocal tract. One of the most conspicuous departures from proportionality with the adult vocal tract is the exceedingly short pharyngeal tract in relation to the oral cavity observed in the newborn infant (Fort & Manfredi, 1998). If the infant’s vocal tract anatomy was proportionally the same as the adult, the expected spectral characteristics of the infant’s utterances would be essentially the same as the adult’s although proportionally shifted to higher frequencies. However this is not the case, in particular during early infancy, and therefore articulatory gestures involving analogous anatomic and physiologic structures in the adult and the young infant will tend to result in acoustic outputs that are not linearly related to each other. To be sure, if the adult vocal tract were proportionally larger than the infant’s, equivalent articulatory gestures would generally tend to result in proportional acoustic results.2 Conversely, due to the lack of articulatory proportionality, when anatomic and physiologic equivalent actions are applied to each of the vocal tracts the adult’s and the infant’s vocal tracts will simply acquire different articulatory configurations. The fact that analogous articulatory gestures in young infants and adults lead to different acoustic consequences raises the question of the alleged phonetic equivalence between the infant’s and adult’s speech sound production. In fact it is not trivial to equate production data from infants with their supposed adult counterparts since phonetic equivalence will lead to different answers if addressed in articulatory, acoustic or auditory terms.

To gain some insight on this issue a crude articulatory model of the infant’s vocal tract was used to calculate the resonance frequencies associated with different articulatory configurations. The model allows a virtual up-scaling of the infant’s vocal tract model, to “match” a typical adult length, enabling the “infant’s formant data” to be plotted directly on the adult formant domain. With this model the acoustic consequences of vocalizations while swinging the jaw to create a stereotypical opening and closure gesture could be estimated for between larynx and vocal tract length corresponding to an infant, a child and an adult speaker (Lacerda, 2003). An important outcome of this model is that the vowel sounds produced by the “young infant” vocalizing while opening and closing the jaw are differentiated almost exclusively in terms of their F1 values. Variation in F2 appears only once the larynx length increases towards adult proportions (Robb & Cacace, 1995). Equivalent opening and closing gestures in the adult introduce some variation in F2, in particular when the jaw is wide open because the ramus tends to push back and squeeze the pharynx. Since the infant’s pharynx is much shorter, an infant producing the same swinging gesture will not generate appreciable variation in F2. To be able to modulate F2 in a similar way, the infant would have to pull the tongue body upwards toward the velum, probably by contracting the palatoglossus and the styloglossus muscles. For this configuration the infant’s vocalization would have approximately the same formant structure as an adult [ɑ] vowel. To produce more front vowels, the infant would have to create a constriction in a vocal tract region at a distance about 70% full vocal tract length from the glottis, as predicted by the Acoustical Theory of Speech Production (Fant, 1960; Stevens, 1998).

A clear implication of the anatomical disproportion between infants and adults is that the early infant babbling sounds in general cannot be directly interpreted in adult articulatory phonetic terms. Of course acoustic-articulatory relations are even problematic in the adult case because the acoustic-articulatory mapping is not biunivocal, as demonstrated by the 2 Departures from this acoustic proportionality would however be observed for situations where the cross sectional area of the infant’s proportional vocal tract would turn out to be under absolute critical values associated with turbulent or laminar flow through constrictions.

6

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

everyday examples of individual compensatory strategies and byte-block experiments (Gay et al., 1981; Gay et al., 1980; Lindblom et al., 1977; Lane et al., 2005), but the problem is even more pertinent in the case of the infant’s production of speech sounds (Menard et al., 2004). Indeed, in addition to the non-proportional anatomic transformation, the infant, in contrast with the adult speaker, does not necessarily have underlying phonological targets, implying that infant vocalizations in general ought to be taken at the face-values of the acoustic output rather than interpreted in terms of phonetically motivated articulatory gestures. Disregarding for a moment experimental results suggesting that the young infant has the ability to imitate certain gestures (Meltzoff & Moore, 1983; Meltzoff & Moore, 1977) and even appears to be able to relate articulatory gestures with their acoustic outputs (Kuhl & Meltzoff, 1996; Kuhl & Meltzoff, 1988; Kuhl & Meltzoff, 1984; Kuhl & Meltzoff, 1982; Meltzoff & Borton, 1979), the assumption here is that the infant is initially not even attempting to aim at an adult target sound. Thus in a situation of spontaneous babbling, where the infant is not presented with a “target” sound produced by a model, the phonetic content of the infant’s productions must be strongly determined by the infant’s articulatory biases and preferences while the infant’s productions (which often are rather diffuse in phonetic terms) will be perceived according to the expectations of the adult listener. As a consequence, the infant’s vocalizations in a normal communicative setting are prone to be influenced from the very beginning by the adult’s knowledge and expectation on the language. This type of interpretative circularity is indeed an integrate component of the very speech communication process and must be taken into account in studies of language development as well as in linguistic analyses.

Consider, for instance, the Swedish infants’ preferences for different articulatory positions illustrated in figure 1, redrawn here after data from Roug, Landberg and Lundberg (Roug et al., 1989).

7

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Age in months

20181614121086420

% o

ccur

renc

e

100

90

80

70

60

50

40

30

20

10

0

Place

glottal

velar/uvular

dental/alveolar

bilabial

Age in months

20181614121086420

% o

ccur

renc

e

100

90

80

70

60

50

40

30

20

10

0

Place

glottal

velar/uvular

dental/alveolar

bilabial

Age in months

20181614121086420

% o

ccur

renc

e

100

90

80

70

60

50

40

30

20

10

0

Place

glottal

velar/uvular

dental/alveolar

bilabial

Age in months

20181614121086420

% o

ccur

renc

es

100

90

80

70

60

50

40

30

20

10

0

Place

glottal

velar/uvular

dental/alveolar

bilabial

Figure 1. Occurrence of place of different places of articulation in babbling produced by four Swedish infants. Original data from Roug et al. (1989).

In spite of the individual preferences, the overall trend of these data indicates a preference for velar and pharyngeal places of articulation during the first months of life. Pharyngeal places of articulation are typologically less frequent than places of articulation involving contact somewhere between the lips and the velum but they were nevertheless dominating in these infants’ early babbling. But although the bias towards the pharyngeal places of articulation is likely to be a consequence of the vocal tract anatomy, where the young infant’s extremely short pharynx leads to pharyngeal-like formant patterns when the infant’s tongue actually makes contact with the velum or palate, the infant’s productions are actually perceived by the adult as being produced far back in the vocal tract, which translated in adult proportions means pharyngeal articulations. In a sense there is nothing wrong with this adult interpretation. It is correct from the acoustic perspective; the problem is that it is mapped into an adult vocal-tract that is not an up-scaled version of the infant’s. Thus, because of this non-proportional mapping, acoustic or articulatory equivalence between sounds produced by the infant or the adult lead necessarily to different results and may lead to in overestimation of the backness of the consonantal sounds produced by the infant early in life (Menard et al., 2004).

Interactive constraints All living systems interact one way or the other with their environment. At any given time the state of a living system is therefore the result of the system’s history of interaction with its ecological context. Humans are no exception. They are simultaneously “victims” and actors in their ecologic setting. They are the living result of long-term cross-interactions between internal genetic factors, the organism’s life history and external ambient factors. Among the multivariable interactions observed in natural systems, language development is certainly an

8

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

outstanding example of endogenous plasticity and context interactivity. As pointed out above, the infant’s language acquisition process must be seen as an interactive process between the infant and its environment. In principle, in this mutual interaction perspective, both the infant and its environment are likely to be affected by the interaction. However, because the infant is born in a linguistic community that has already developed crystallized conventions for the use of sound symbols in communication, the newcomer has in fact little power against the community’s organized structure. Obviously, the conventions used by the community are not themselves static but this aspect has very limited bearing on the general view of language acquisition. Not only do the users of the established language outnumber the infant but also the community language represents a coherent and ecologically integrated communication system, whose organization the individual newcomer can hardly threaten. In line with this, a crucial determinant of the infant’s initial language development must be given by the feedback that the community in general and the immediate caregivers, in particular, provide. Thus, if ambient speakers implicitly (and rightly) assume that the infant will eventually learn to master the community’s language, they are likely to interpret the infant’s utterances within the framework of their own language. In other words, adult speakers will tend to assign phonetic value to the infant’s vocalizations, relating them to the pool of speech sounds used in the established ambient language and the adult will tend to reward and encourage the infant if some sort of match between the infant’s productions and the adult expectations is detected.

To address this question (Lacerda & Ichijima, 1995) investigated adult spontaneous interpretations of infant vocalizations. The infant vocalizations were a random selection of 96 babbled vowel-like utterances obtained from a longitudinal recording of two Japanese infants at 17, 34 and 78 weeks of age. The subjects were 12 Swedish students attending the second term of Phonetics3 who were requested to estimate the tongue positions that the infant might have used to produce the utterances. The subjects indicated their estimates in a crude 5×5 matrix representing the tongue body’s frontness and height parameters. In an attempt to simulate spontaneous adult responses in a natural adult-infant interaction situation, the students’ judgements had to be given within a short time window to encourage responses on the basis of their first impressions. The answer forms consisted of a series of response matrices, one for each stimulus. The subjects’ task was to mark in the matrices the cell that best represented tongue height and frontness for each presented utterance. If all the subjects would agree on a specific height and frontness level for a given stimulus, the overall matrix for that stimulus would contain 12 response units on the cell representing those coordinates and zero for all the other cells.

3 Although students at this level are not strictly speaking naïve listeners, they represented a good enough balance between non-trained adults and the ability to express crude phonetic dimensions.

9

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Figure 2. The panels show level contours for agreement in adult spontaneous judgements of height and frontness of an infant’s babbling. The left panel babbling produced at 17-34 weeks of age; right panel babbling produced at 78 weeks of age. High responses levels indicate that the adult subjects could agree upon the underlying articulatory gesture associated with the vowel-like utterance. The y-axis indicates the height dimension in an arbitrary scale, where 0 corresponds to a maximally closed vowel and 5 to a maximally open. The x-axis represents frontness, with 0 corresponding to extreme front and 5 to extreme back position of the tongue.

When the individual judgements were accumulated within each of the cells of the response matrices, two age-dependent patterns of frequently occurring responses emerged. For babbling utterances coming from the samples at 17 and 34 weeks of age, the judgements were scattered over a wide range along the vowel height dimension whereas the range of variation along the frontness dimension was rather limited. In contrast, the judgements of the samples from the 78 week-olds displayed an expanded range of variation both along the frontness and the height dimensions. In other words, whereas earlier babbling samples were interpreted as reflecting the use of mainly the height dimension, the responses to the later babbling suggest a tendency towards adult-like expansion of the vowel space. These results are illustrated in the two panels of figure 2.

Admittedly, this type of results depends both on the acoustic nature of the infant utterances and on the adult’s auditory interpretations. In fact the results might reflect an adult judgement bias, some kind of vocalization preference from the infant or a mix of these biases. One possibility would be that the infant actually produces a variety of vowel-like utterances uniformly scanning the entire domain of the available articulatory space. In this case, the pattern of the results from earlier babbling would reflect the adults’ inability to consistently estimate the frontness dimension. This would be an adult bias that might be referred to as a sort of “phonological filter” that the adult developed as a result of language experience. Another possibility would be that adults actually could pinpoint the infant’s articulatory gestures but that the young infant simply does not produce vowel-like sounds that are differentiated along the frontness dimension.

10

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Whatever the underlying reason for the observed response patterns, they reflect an actual interactive situation where adults and infants meet each other, linked by phonetics. Indeed, the adult and the infant are intimately connected with each other in the language communication process and the actual causes of the observed response patterns cannot be disentangled without the independent information provided by general theoretical models. At this point an acoustic-articulatory model of the infant vocal tract is a useful research tool because it provides some insight on the potential acoustic domain of the infant’s articulatory gestures. Interestingly, as implied by the acoustic-articulatory considerations addressed in the former section, the infant’s short pharyngeal length clearly curtails acoustic variance in F2, which is the acoustic correlate of the frontness dimension. Indeed, the notion that young infants might be mainly exploring the height dimension of the vowel space was corroborated in a follow-up listening test where a group of four trained phoneticians was asked to make narrow phonetic transcriptions of the babbling materials that had been evaluated by the students. The phoneticians carried out their task in an individual self-paced fashion. They were allowed to listen repeatedly to the stimuli, without response-time constraints, and requested to rely on the IPA symbols, with any necessary diacritics, to characterize the vowel-like utterances as well as possible. However, contrary to the typical procedures used in phonetic transcriptions of babbling, these phoneticians were not allowed to reach consensus decisions nor were they aware that their colleagues had been assigned the same task. The phoneticians’ data were subsequently mapped onto the 5×5 matrices that had been used by the students by imposing the 5×5 matrix on the IPA vowel chart, assuming that the “corners” of the vowel quadrilateral would fit on the corners of the domain defined by the matrix. The results of the phoneticians’ IPA transcriptions and the subsequent mapping procedure essentially corroborated the overall pattern of the students. The consistency between the judgements by the students and by the trained phoneticians discloses a provocative agreement between the two groups’ spontaneous interpretations of infant babbling. Once again, it is not possible to resolve the issue of whether this agreement is a consequence of a strong phonological filter or of acoustic-articulatory constraints in the infant’s sound production. The data can also be interpreted as suggesting that the infant’s utterances are articulatorily so undifferentiated that only main aspects of their production can consistently be agreed upon by a panel of listeners. Apparently, it is only the height dimension that is differentiated enough to generate consistent variance in the adult estimates. To be sure, it cannot be excluded that all that the young infant is doing during these vocalizations is to open and close the mouth keeping the tongue essentially locked to the lower jaw (Davis & Lindblom, 2001; MacNeilage & Davis, 2000a; MacNeilage & Davis, 2000b; Davis & MacNeilage, 1995). This account is supported by the theoretical evidence from the previous acoustic-articulatory model using short pharyngeal length (Lacerda, 2003) and suggests that the spontaneous adult judgements of the infant’s babbling reflect rather accurately the major features of the infant’s actual phonetic output.

In summary, the acoustic-articulatory data, the adult perception data and the acoustic-articulatory model, all provide consistent support to the notion that the infant initially uses mainly the high-low dimension of the vowel space. The full exploration of the vowel space, including the use of the front-back dimension, comes at a later age.

Clearly, the interactive constraints affect much more than just vowel perception. As stated above, the infant and the adult interacting with each other create a particular setting of mutual adjustments to reach the overall common goal of maintaining speech communication (cf. Jakobson’s (1968) phatic function of speech communication), where adults modify their speaking styles in response to their own assumptions on the infant’s communication needs and intentions. These speech and language adjustments involve modifications at all linguistic levels, like frequent prosodic and lexical repetitions, expanded intonation contours and longer

11

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

pauses between utterances. In addition to these supra-segmental modifications, there is evidence of more detailed modifications observed at the segmental level. For instance, when addressing 3-month-olds adults seem to expand their vowel space in semantic morphemes (van der Weijer, 1999), as demonstrated by cross-linguistic data from mothers speaking American English, Russian and Swedish (Kuhl et al., 1997). The adult production of consonants is also affected in the adult-infant communication. The voice/voiceless distinction expressed by VOT (Voice Onset Time) seems to be systematically changed as a function of the infant’s age while preserving the main pattern of durational relations that occur in the adult language4. In infant-directed speech (IDS) to 3-month-olds, the VOT is significantly shorter than in corresponding utterances in adult-directed speech (ADS), leading to an increased overlap of the voiced and voiceless stops (Sundberg & Lacerda, 1999). In contrast, as suggested by preliminary results (Sundberg & Lacerda, under review), VOT in IDS to 12-month-olds seems to provide a higher differentiation between voiced and voiceless stops than is typically observed in ADS. Within the framework of the communication setting, such adult phonetic and linguistic modifications can be interpreted as an indication that the adult adjusts to the communicative demands created by the situation. Putting together the suprasegmental, the VOT and the vowel formant data a picture of adjustments to the infant needs emerges. Exploring the significance of variations in F0, the adult may modulate the infant’s attention during the adult-infant interaction in order to keep the infant at an adequate state of alertness. When the adult-infant communication link is established, the adult seems to intuitively offer speech containing clear examples of language-specific vowel qualities embedded in a rich emotional content. Because F0 modulation is essentially conveyed by the vowels, their phonetic enhancement is quite natural, while VOT distinctions may be less phonetic specification at this stage. Although the perceptual relevance vowel enhancement and VOT reduction have not been experimentally assessed from the infant’s perspective, a plausible interpretation is that the adult intuitively guides the infant towards language.

One possibility is that the adult opportunistically explores the infant’s general auditory preference for F0 modulations. The adult adds some excitement to the infant perception by spicing the F0 contours with “crispy” consonants that by contrast enhance the vowel segments (Stern et al., 1983). This patterning of the speech signal is likely to draw the infant’s attention to the vocalic segments. A further indication of the adult’s adaptation to the perceived demands of the infant is provided by the change in phonetic strategy that is observed when adults address infants at about 12 months of age. Triggered by the clear signs of language development provided by the first word productions and proto-conversations, the adult introduces phonetic explicitness also to the stop-consonants, as suggested by data being processed at Stockholm University’s Phonetics laboratory. Obviously, in this scenario, the communication process between the adult and the infant cannot be equated to an adult-to-adult conversation. Rather, the adult appears to use language as an efficient and interactive playing tool, where speech sounds are the toys themselves. Through this adaptive use of the language, the adult explicitly provides the infant with proper linguistic information wrapped in the playful and efficient setting of the adult-infant interaction. The infant is spontaneously guided by the adult’s clear cues towards the general conventions of the speech communication act, like phonetic markers of turn-taking such as F0 declination, final lengthening and pause duration (Bruce, 1982).

Yet another conspicuous aspect of interaction constraints is the frequent use of repetitive patterns in infant-directed speech. As evidenced by a number of cross-language studies, adults interacting with infants tend to repeat target words and phrases as well as intonation patterns 4 This is true at least for Swedish, where it has been observed that the complementary quantity distinctions were preserved in IDS to 3-month-olds.

12

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

(Papousek & Hwang, 1991; Papousek & Papousek, 1989; Fernald et al., 1989). Stockholm University’s experimental data on mother-infant interactions clearly indicates that adults are extremely persistent in using, for example, words and phrases in a highly repetitive manner when addressing young infants. In the context of memory processes and brain plasticity this is likely to be a powerful strategy, as predicted by the language acquisition model that generates emergent patterns as a direct consequence of the interaction between memory decay and exposure frequency (Lacerda et al., 2004a).

To take the infant’s perspective, it is mandatory to discuss the impact that the adult’s phonetic modifications and repetition patterns may have for the infant.

Perceptual constraints The newborn infant is exposed to a wide variety of information from the onset of its post-natal life. The infant’s contact with its external environment is mediated by the information from all the sensory input channels, in continuous interaction with the infant’s endogenous system. Clearly, the infant’s initial perception of the world has to rely on the range of the physical dimensions in the environment that the infant’s sensory system can represent.5 From the current point of view, because the infant is not assumed to be endowed with specific language-learning capabilities, the infant’s perceptual capabilities at birth can be expected to be important determinants of the infant’s development. The infant’s auditory capability, in particular, is likely to be an important, though not sufficient, pre-requisite for the development of the speech communication ability, so let us try to get a picture of the infant’s initial auditory capacity by relating it to the adult’s.

From the onset of its post-natal life, the infant seems to have an overall frequency response curve that is essentially similar to the adult’s, though shifted upwards by about 10 dB (Werner, 1992). Also the quality factors of the infant’s tuning curves are comparable to the adult’s, at least in the low frequency region up to about 1 kHz. Given this resemblance, the infant’s and the adult’s auditory systems may be expected to mediate similar sensory representations of the speech signals, implying that differences in behavioural response patterns to speech stimuli may be attributed to higher-level integrative factors rather than peripheral psychoacoustic constraints.

Infant speech discrimination studies involving isolated speech sounds, typically demonstrate that young infants are able to discriminate a wide variety of contrasts, virtually all the speech sound contrasts that they have been tested with. In fact, even 4-days old infants have been shown to discriminate between CV bursts as short as 20 ms (Bertoncini et al., 1987), suggesting that the newborn infant is equipped with the necessary processing mechanisms to differentiate between bilabial, dentoalveolar and velar stop consonants. These and similar results from speech discrimination experiments (Eimas, 1974) with young infants demonstrate that there is enough acoustic information to discriminate the stimuli. This is likely to be a pre-requisite for linguistic development but discrimination ability, by itself, is clearly not enough. In fact, discrimination alone is likely to generate a non-functional overcapacity of separating sounds on the basis of their acoustic details alone. Linguistically relevant categories explore similarities among speech sounds that go beyond the immediate acoustic characteristics, as it is the case of allophonic variation or the same vowel uttered by female or male speakers. Such

5 An organism’s sensory system can be seen as a filter attending to a limited range within each of the environment’s physical dimensions. The human visual system, for instance, is adapted to represent the range of electromagnetic radiation with frequencies between the infrared and ultra-violet and the auditory system reacts to changes in the atmospheric pressure falling within a limited range of amplitude and frequency whereas other species pick up other ranges of these physical dimensions. Given the differences in the representation ranges, the “reality” available to the different species is likely to be different too.

13

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

sounds are easily discriminable on pure acoustic basis but are obviously linguistically equated by competent speakers. Thus, a relevant question that has to be addressed by the model concerns the processes of early handling of phonetic variance underlying the formation of linguistically equivalent classes and the question of how the infant’s initial discrimination ability relates to potential initial structure in the infant’s perceptual organisation must be addressed. Parallel to the main trend of the experimental results pointing to a general ability to discriminate among speech sound contrasts, a remarkable asymmetry in the discrimination of vowel contrasts was observed in Stockholm University’s Phonetics laboratory. Experiments addressing the young infants’ ability to discriminate [ɑ] vs. [a] and [ɑ] vs. [^], indicated that the latter contrast was more consistently discriminated than the former (Lacerda, 1992b; Lacerda, 1992a; Lacerda & Sundberg, 1996). The stimuli were synthetic vowels that differed only in their F1 or F2 values6, reflecting a contrast in sonority (i.e. vowel height dimension, F1) or in chromaticity (i.e. vowel frontness dimension, F2). The contrasts were conveyed by equal shifts in F1 or F2, expressed in Bark. The results were obtained from two different age ranges and subject groups, using age-adequate techniques. The younger infants, around 3 months of age, were tested using the High Amplitude Sucking technique (Eimas et al., 1971) and the older subjects, 6 to 7 month old, were tested with the Head-Turn technique (Kuhl, 1985). In both cases the outcome was that discrimination of the contrasts involving differences in F1 dimension was more successful than for the corresponding contrasts conveyed by F2 differences, in spite of the fact that equal steps in Bark actually mean larger differences in F2 than in F1 if expressed in Hz. In principle this discrimination advantage for the F1 contrasts might be attributed to the concomitant intensity differences that are associated with changes in F1 frequency but intensity was also strictly controlled in follow-up experiments using a parallel synthesis technique (where overall intensity is not dependent on formant frequency) and yet the F1 advantage in discrimination performance persisted. Thus, these experiments suggest that the infant’s perceptual space for vowels is asymmetric in terms of height and frontness contrasts, with a positive discrimination bias towards height.

Putting the infant in its global context of production, perception and interaction, there seems to be an inescapable pattern of asymmetry that tends to enhance vowel contrasts along the height dimension. As stated above, data from infant vowel production, adult-infant interaction and infant vowel perception, all converge towards a pattern of dominance of the height dimension in early language acquisition. These results are also consistent with typological data from natural vowel systems (Maddieson & Emmorey, 1985; Maddieson, 1980; Liljencrants & Lindblom, 1972). To the extent that infant speech perception and the adult’s interpretation of babbling provide an indication of the biases underlying language development in general, vowel height may be expected to play a dominant role in the organization of vowel systems. In fact, this is in good agreement with the typological data showing that vowel height seems to be the first single explored dimension in vowel systems of increasing complexity whereas frontness contrasts usually are accompanied by rounding gestures, as if to underline the frontness distinction (Liljencrants & Lindblom, 1972).

In summary, the overall message provided by the speech perception experiments with infants is indeed compatible with the notion that the infant starts off with a general auditory process that gains linguistic content in the course of the language acquisition process. Young infants are reportedly good at discriminating speech sound contrasts but their discrimination can largely be accounted for by sensitivity to acoustic differences per se, not necessarily liked to underlying linguistic strategies. A large body of speech perception studies in which infants

6 F1 and F2 refer to the first and second resonance frequencies of the vocal tract, i.e. formants. The first two formants are usually enough to specify the main characteristics of a vowel’s phonetic quality.

14

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

were tested on discrimination of both native and non-native speech sound contrasts indicates a progressive attention focus towards sound contrasts that are relevant in the ambient language. With respect to vowel perception, for instance, it was observed that 6-month-old infants tend to display a vowel discrimination behaviour that seems to have been influenced by their exposure to the ambient language (Kuhl et al., 1992). Infants at this age show a higher tolerance to allophonic variation for the vowels occurring in their native language than for non-native vowels, a phenomenon often referred to as the “perceptual magnet effect” (Iverson & Kuhl, 2000; Iverson & Kuhl, 1995; Kuhl, 1991; Lotto et al., 1998). This process seems to be accompanied by increasing attention focus on the ambient language and, by about 10 months of age, infants may no longer be able to discriminate foreign vowel contrasts (Polka & Werker, 1994). Also the perception of consonantal contrasts appears to follow a similar developmental path, although the development is shifted upwards in age. Whereas by about 6 to 7 months of age no particular differences in the discrimination ability for native and non-native consonantal contrasts have been observed, at 12 months of age infants seem to be more focused on the native than on the non-native consonant contrasts (Werker & Tees, 1992) (Best et al., 2001; Werker & Logan, 1985; Tees & Werker, 1984; Werker & Tees, 1983; Werker et al., 1981). Taken together, these experimental results seem to be that exposure to the ambient language shifts the infant’s focus from a general to more a differentiated and language-bound discrimination ability (Polka & Werker, 1994; Polka & Bohn, 2003).

Much of the speech discrimination studies assessing the infant’s early capabilities have been carried out using isolated speech sounds. Isolated speech sounds and their underlying phonemic representations, as portrayed in linguistic theories, are useful for logical and formal descriptions of language. Nevertheless the phonemic concept is not obviously connected to the speech sounds that supposedly materialize it. Phonemes are idealizations that capture the essential contrastive function in language and are therefore not immediately available to the young language learner. What the infant is exposed to are strings of interwoven speech sounds, with all the concomitant co-articulation effects and non-canonical aspects affecting all levels of connected natural speech. Isolated speech sounds are rare in natural interaction, raising problems to the linguistic interpretation of many infant speech discrimination experiments, in particular when single speech sound tokens are taken to represent a phonemic category. In particular, the lack of variance in the stimuli presented to the infant is likely to severely limit the ecological validity of the results but the linguistically relevant issue is to try to find out how the infant, handling natural variance, nevertheless homes in an adult-compatible linguistic representation. The challenge is to account for language acquisition building on the fuzziness and variability that are characteristic of the infant’s natural environment.

About a decade ago, infant speech perception studies started to address this issue using a variety of experimental approaches. For example, in an attempt to assess the significance of repetitive structures embedded in a continuous speech signal, Saffran and colleagues (Saffran et al., 1996) presented 8-month old infants with sequences of concatenated CV-sequences drawn from a set of four basic CV-sequences. The Bayesian conditional probability of the concatenated CV-sequences was manipulated to ensure that certain CV pairs would occur with higher probability than others. This was done to reflect natural language structure, in which syllable sequences within words tend to have higher transitional probabilities than syllable sequences across words. After a 2 minutes exposure to this type of material, the infants showed a significant preference for the pseudo-words formed by syllables of high transitional probability, suggesting that they had been able to pick up implicit statistical properties of the speech material (Saffran & Thiessen, 2003; Saffran, 2003; Saffran, 2002; Seidenberg et al., 2002). Also a number of studies carried out by the late Peter Jusczyk and

15

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

colleagues suggest that infants are sensitive to high frequency words in their ambient language. The group reported, for instance, that four-month-old infants were sensitive to the high frequency exposure to their own names (Mandel & Jusczyk, 1996; Mandel et al., 1996) and also that nine-month-olds, in contrast with 7½-month-olds, were able to pick up high frequency words from the speech stream of a story telling (Johnson & Jusczyk, 2001; Mattys & Jusczyk, 2001; Houston et al., 2000; Nazzi et al., 2000; Jusczyk, 1999).

These experiments represent different scenarios of language exposure. In the experimental set up Saffran et al. only the acoustic information provided by synthetic utterances was available to the infants, who nevertheless could use the statistical regularities to structure the continuous sound streams in spite of the limited exposure and sparse information load of the signal. Jusczyk’s work provides a closer match to a natural language acquisition setting since, in addition to the audio signal, the infants also had access to picture books providing visual support to the story they were exposed to. In comparison with Saffran and colleagues’s set up, Mandel and Jusczyk’s experiment clearly offered a much richer linguistic environment due to the variance in the speech material that the infants were exposed to (Mandel & Jusczyk, 1996). By itself, the audio signal available to Mandel and Jusczyk’s subjects during the training sessions is not nearly as explicit as in Saffran’s set up. However, the total amount of exposure (with daily exposure sessions carried out for about two weeks) was far more extensive in Jusczyk’s experiment. In addition to this longer exposure, the infants were also encouraged to look at a picture book, which may have been a critical component for the positive outcome of the experiment. Indeed, in line with the ideas expressed in this article, linguistic meaning emerges from the co-varying multi-sensory information available during exposure – in Mandel and Jusczyk’s case, the naturalistic speech co-varying with the visual information.

To study the significance of co-varying multi-sensory information a series of experiments designed to create learning situations from controlled multi-sensorial information was recently started. To assess the infant’s ability to link acoustic and visual information in a linguistically relevant way a variant of the visual preference technique has been used. The preliminary results suggest that 8-month-old infants are able to establish linguistic categories, such as nouns, from exposure to variable but consistent audio-visual information (Lacerda et al.).

MODELLING THE INFANT IN AN ECOLOGICAL SETTING

Background and overview of an Ecological Theory of Language Acquisition This section presents a general model of a system capable of learning linguistic referential functions from its exposure to multi-sensory information (Lacerda et al., 2004a). By itself, the model has a wider and more abstract scope than what has traditionally been considered as language learning. Rather than focusing on the acoustic signal per se, as the main determinant of the acquisition of spoken language, it is here suggested that the language acquisition should be addressed as a particular case of a general process capturing relations between different sensory dimensions. In this view linguistic information is implicitly available in the infant’s multi-sensory ecological context, and is derivable from the implicit relationships between auditory sensory information and other sensory information reaching the infant. Early language acquisition becomes therefore a particular case of a more general process of detection of contingencies available in the sensory representation space.

Two initial assumptions are made in this model: one is that there is multi-sensory information available to the system and the other is that the system has a general capacity of storing the incoming sensory information but this latter assumption does not mean that the system will permanently store information or even all the incoming information.

16

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

In the case of the infant, this input is thought to consist of all the visual, auditory, olfactory, gustatory, tactile as well as kinaesthetic information. Maintaining life requires continuous interaction between the living system and its environment, e.g. the complex organisms’ basic life-supporting functions like breathing and eating. To succeed, the biological system must be able to handle information available in its environment and use it to acquire the necessary life-supporting resources. In this sense, information obviously refers to a vast range of physical and chemical properties of the organism’s immediate environment (like harshness and structure of surfaces with which the organism makes contact, chemical properties of the environment, pressure gradients, electromagnetic radiation, etc.) as well as information conveyed by potential relationships between these elements, like the underlying relation between sounds of speech and the visual properties of the objects that the speech signal might refer to. To be sure, living organisms tend to specialize on processing quite a limited range of the potentially available physical and biochemical diversity available in their environments, i.e. different systems specialize in more or less different segments of the available environment, exploring the potential advantages of focusing on specific ecological niches. Numerous examples, like bats exploring echolocation to survive in the ecological niche left open by their daylight competitors, or the specialized bugs, like the Agonum quadripunctatum or the Denticollis borealis (Nylin, 2006), that only emerge in the special environment created by the aftermath of forest fires. In these very general terms, the external environment is represented by changes in the system itself, changes that may in turn be responses and generate further interaction with the environment, and that a species evolutionary history has prompted the organism to select a segment of.

Also the infant must be regarded as biological system integrated and interacting with its environment. The infant’s environment is represented by sensory information conveyed by the sensory transducers interfacing with the environment, i.e. the changes that the environment variables induce in the specialized sensory transducers. To be sure, the infant’s sensory system’s response to the environment variables is not time-independent. Typically, as the infant learns and develops, its responses will change not only of the exposure to external stimuli but also as a consequence of the very changes in internal, global variables that are induced by that exposure. Given that the infant is continuously exposed to parallel sensory input simultaneously available from different sensory modalities, there the potential for combining this multimodal information into different layers of interaction. The hippocampus, for instance, is thought to be a structure capable of integrating different sources of sensory information in implicit memory representations (Degonda et al., 2005). However the perspective in this article is more functional and somewhat abstract. Sensory inputs are seen as n-dimensional representations of the external variables, and that such representations may be continuously passed to a general-purpose memory that will retain at least part of that information. Also the concept of memory is very general. It is a general-purpose memory space on which continuous sensory input is represented by changes in the detail-state of the organism that can be modelled as Hebbian learning (Munakata & Pfaffly, 2004). The sensory input is continuously mapped in this memory space, in a purposeless and automatic way but representations that are not maintained tend to fade out with time.

The infant’s ecologic settings In its ecologic setting, the infant is inevitably exposed to huge amounts of sensory input that at first sight may appear to be unstructured. At a closer look, however, the infant’s immediate world is indeed highly structured in the sense that the status of the parallel sensory inputs implicitly reflects the aspects of the structure of objects and events that are being perceived (Gibson & Pick, 2000). Language is part of this and the infant’s early exposure to it is indeed also highly context-bound and structured. In typical infant-adult interactions, the speech used

17

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

by the adult is attention catching, circumstantial and repetitive, making this sort of speech attractive for the infant, context bound and easily related to external objects and actors. Under these circumstances, constantly storing general sensory input under the exposure to the statistic regularities embedded in the external world must lead to the emergence of the underlying structure because memory decay effectively filters out infrequent exposures. So far, this formulation may seem rather close to classic Skinnerian reasoning and exposed to the classical criticism calling on the “poverty of the stimulus” argument. However, taking into account the very variance of the input, it is possible to tip the argument over its head and instead used as a resource for language learning.

From a global and long-term perspective language obviously builds on regularities at many levels, reflecting the conventionalized communication system with which individuals can share references to objects and actions in their common external, as well as internal and imaginary, worlds. On a short-term basis, the statistical stability of those regularities may not always be apparent, especially for the newborn. However, lacking long-term experience with language and knowledge of the world, the newborn is happily unaware of the potential problems that linguists assume infants to have. In other words, early language acquisition is not seen as a problem since the infant presumably does not have a teleological perspective. The infant may very well converge to adult language as a result of a parsimonious and “low-key tinkering process” that, under the pressure of local contingencies, leads to disclosure of the implicit structure of language. In fact, just because language is always used in a context – particularly at the early stages of infant-adult interactions, where language tends the focus on the proximate context – there are plenty of systematic relations between the acoustic manifestations of language and their referents.

Sketching an ecological setting of early language acquisition General background The present sketch of the language acquisition model assumes that sensory input is represented by values in an n-dimensional space, where each dimension arbitrarily corresponds to a single sensory dimension. Obviously, in this generic model, sensory dimensions like auditory input may themselves be further represented by m-dimensional spaces to accommodate different relevant auditory dimensions, but for this sketch a general and principled description will be sufficient. These external stimuli are mapped by the sensory system into the internal representation space. The mapping process does not involve any explicit interpretation of the input stimulus. It is viewed essentially as a sensory map affected by sensory limitations and the stochastic representation noise inherent to the system. This means that rather than directly converting the input into coordinates on the representation space, the sensory system adds a certain amount of noise to the input, reflecting the stochastic nature of neuronal activity. As a consequence of the added noise physically identical external stimuli will tend to be mapped onto overlapping activity distributions centred at neighbouring rather than identical coordinates on the representation space. At this stage it is assumed that the neuronal noise is uncorrelated with the input and that it has zero mean value, an assumption that essentially means that the noise’s long-term effect can be disregarded.

To calibrate the proposed model, taking into consideration its ecological context it is convenient to start out with an estimate of the magnitude and range of variation available in the model’s n-dimensional sensory space. To create a manageable model, a 2-dimensional space, supposedly representing the auditory and visual (A-V) sensory inputs7, was chosen

7 This implies a linearization of the auditory and visual components, meaning that each possible auditory spectrum is assigned a unique point along the “auditory axis” and a similar representation for the visual stimuli, a transformation that indeed does not really capture the natural dimensionality of the auditory and visual spaces.

18

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

here. Activity on a location of the representational space corresponding to these two dimensions is a manifestation of co-occurrences of rather specific auditory and visual stimulations that are stochastically associated with the location’s coordinates. Uncorrelated occurrences of auditory and visual sensory inputs will tend to scatter the corresponding representation activities. As a consequence, uncorrelated or random A-V activity tends not to build up at specific locations and does not contribute to structuring the representation space because activity will be smeared over a wide region. Looking at the sensory input from the perspective of the representational space, this lack of heightened localized activity means that there is no systematic relation between the input stimuli. But in the context of early language acquisition, uncorrelated A-V representations are not much more than a simple academic abstraction. Because language is used in a coherent way with its referents, ecologically relevant language acquisition settings are much more likely to provide correlated audio-visual information than not. In terms of the model, this will tend to raise the activity level in specific locations in the representation space. Put in these terms, a crucial question concerns the probability of hitting approximately the same location in the representation space, when the sensory input is mediating random, uncorrelated events.

However the notion of underlying sensory contingencies is not new. Behaviourists have attempted to account for learning in terms of reinforced stimulus-response correlations and been confronted with the “poverty of stimulus” argument. Therefore it is convenient to take a look at the dimensionality of this A-V search space in order to estimate of the number of possible discriminable events in the auditory space that may be created by all possible combinations of discriminable intensity levels and frequency bands. To simplify the calculations aspects like lateral suppression and masking effects that might affect the ability to discriminate between some of the theoretical sounds in this set are not taken into account. In reality, those effects would slightly reduce the number of distinct sounds but that is largely compensated by the rather the rather conservative assumptions on frequency and intensity discriminable steps. On the intensity dimension, it is assumed that level differences of at least 5 dB can be discriminated which yields approximately 17 intensity steps along the intensity dimension throughout the audible spectrum, taking into account the frequency dependency of the hearing threshold. The frequency range between 100 Hz and 4000 Hz is assumed to be discriminable also in 17 steps, corresponding approximately to 17 Bark scale intervals. It is assumed that the linguistically relevant sounds can be represented by their crude energy distribution along these 17 Bark bands, a rather conservative frequency range from which speech sounds like many fricatives tend to be excluded.

Viewing the auditory stimuli as instances of 17-band spectra with 17 possible intensity steps per band leads to an estimate of 1717 (≈ 1020) potentially discriminable spectral contours. This is, of course, a crude estimate whose main goal is to map possible acoustic stimuli onto a one-dimensional linear representation but it gives nevertheless a feeling for the range of possible spectral variation of relevant acoustic stimuli. An important consequence of estimating the domain of acoustic variation is that it gives an indication of the likelihood of hitting a given point in this space under a plausible time-window. 1020 is an amazingly large number. To get a feeling for its magnitude it may be imagined that each of those possible 1020 spectra is represented by a cell in a human body. With this analogy, it would be necessary to have about 10 million individuals, weighting 100 kg each, in order to obtain the 1020 cells corresponding to this crude estimate of the acoustic search space8. In such a space, hitting twice a given neighbourhood is an extremely significant even. Indeed, exploring the cell analogy, if one assumes that all those 10 million people pick up at random their travel destinations on the 8 The calculation is based on the assumption that there are approximately 100 million cells per gram of biological tissue, relatively independent of the tissue (Werner, 2006).

19

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

earth, chances are vanishing small that the same two people will happen to meet twice on those trips. Therefore, if one happens to meet the same neighbour of two trips, it is reasonable to view that second meeting as a very significant event, potentially suggesting that one is being followed by that neighbour. But even if it is not exactly the same neighbour that is met twice but just one of the nearest neighbours, the assumption of random trips suggests that even this is highly significant given the vanishingly small probability of that recurring random event. In other words, because of the extremely vast acoustic search space, recurrent events are extremely unlikely at random and therefore highly significant if they occur. Adding to the auditory space another sensory dimension, like the visual dimension, enhances even more the significance of recurrent correlated audio-visual events. Given the extremely low probabilities of repeated audio-visual events generated by random associations such repeated instances of correlated audio-visual events are even more unlikely than when the auditory dimension was considered alone. Thus, under ecologically relevant conditions an organism can “take for granted” these highly significant events and interpret them as systematic associations. In other words, events recurring a couple of times in this audio-visual space are so unlikely to happen at random that the organism can take them for granted. Of course such a generalization is not “safe” in absolute terms since the organism “jumps into conclusions” that are not clearly supported by the available data. However, the generalization risks involved in such “hasty” conclusions provide the organism with useful power in structuring its immediate environment and humans seem to be prone to jump into conclusions regarding potential relationships between different observed phenomena (Dawkins, 1998). Such a generalization power has to be captured by the language acquisition model because it introduces a very important qualitative discontinuity in the interactive process between the organism and its environment. The over-generalization provides the organism with a hypothesis about the structure of the environment, a hypothesis that is taken for granted and used as an axiomatic building block in the on-going interaction with the organism’s context. In other words, the organism generates a hypothesis about its relation with the environment and uses it as a framework to structure future interactions although it does not necessarily have full information about the potential correctness of that decision (Bechara et al., 1997).9 George Kelly’s Theory of Personality and his psychology of personal “constructs” (Kelly, 1963) also provides an interesting insight on the potential early mechanisms learning because it can be used to draw parallels between associations that individuals learn to view as unquestionable truths playing a fundamental role on their personality traits and the type of fundamental and unquestioned links that individuals learn to establish between the sounds of words and the objects (or actions) they refer to. Indeed, just like in the establishment of constructs that are accepted as self evident truths, although they may be based on a few observations and sometimes wrong generalizations that become integrated in the individual’s personality traits, early language acquisition is likely to be full of similarly obvious “constructs” between sound sequences and events that they are likely to refer to. Such “constructs” in the domain of language acquisition are also obvious and unquestioned sound-meaning relations that are simply assumed to be unquestionable truths and just because of that are useful as obvious building blocks in that emerging speech communication process.

The significance of co-occurrences The domain of possible audio-visual variance is, as mentioned above, enormous and is therefore not practical to deal with directly. To simulate a realistic learning situation the 9 According to these authors frontal lobe injured patients lacked the ability to “jump into conclusions” until enough data were gathered to support a less risky conclusion. Although the particular mechanism of non-conscious somatic markers proposed by the authors have subsequently been questioned (Maia & McClelland, 2004) is the object of further discussion (Bechara et al., 2005), their study still illustrates the healthy subjects’ propensity to integrate whatever information they have available in order to reach a quick and correct decision.

20

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

model was scaled down for computational purposes, although reducing the domain of the audio-visual space impacts on the l cific coordinates in that domain. Thus, to maintain realistic proportions between the domain and the likelihood of random hits, it is necessary to downscale the time-dimension as well. This means that reducing the number of cells in the audio-visual space must lead to a proportional reduction in the number of events processed by the model.

ikelihood of hitting spe

These data come from a 12 minutes’ mother-infant session to calibrate the model. The session was transcribed and the absolute and relative frequencies of word types and tokens calculated. The total number of word types throughout the session was 172, including play and hesitation utterances. The relative and absolute numbers of the top 80% word occurrences is shown in figure 3.

21

A typical aspect of this kind of token distributions is the high frequency use of a very limited number of these tokens. Of the 960 words used during the session, no more than 17 words (about 1.7%) account for 475 uses (i.e. nearly 50%), as listed in the table below. This distribution is somewhat similar to the observations of word use in adult speech communication but the proportions of item re-use are rather different. In a typical 3 minutes sample of adult-directed speech recorded from a radio interview with a Swedish politician, about 10% of the items accounted for the same 50% of word use, indicating therefore a much wider variation in the lexical items than in the case of infant-directed speech. Thus, the high repetition rate of a limited number of words embedded in slightly different linguistic and prosodic frames is likely to provide the infant with a highly efficient exposure to linguistic material from which the phonetic core of the tokens emerges.

To(to

5.0

6.0

50

60

ken frequency and percentages in IDS word usep 80% from a 12 min session)

0.0

1.0

2.0

du va harpu

ss ri a settbru

m jtitt

ade ja d då

godd

a härsä

jertittut å fin

nalle

björne så sk oj på nu

båpa en tutso

marm

rfin

a g hurka

nmoro

t

morotenvilk

enbil

en biphe

t rOsk

ar l rom

i för gårgö

raha

llåha erlek ter me

mångamår

spad

e

Word tokens

0

10

20

PercentFrequency

3.0

4.0

Perc

enta

ges

30

40

Mamyra n ä he n e n a ar ila å

be o uk

tpä

r d dn a

komm lå

Figure 3. Relative frequency and percentages of the top80% forms used in a 12 minutes Infant-Directed Speeech session.

säjer says 16 491 1.7% 51.1%

here 16 475 1.7% 49.5%

Word Word translation Frequency

Cumulative frequency Percent

Cumulative percent

du you 53 53 5.5% 5.5%

va was, how 43 96 4.5% 10.0%

har has 40 136 4.2% 14.2%

puss kiss 36 172 3.8% 17.9%

Maria Maria 30 202 3.1% 21.0%

myran ant 30 232 3.1% 24.2%

är is 29 261 3.0% 27.2%

sett seen 28 289 2.9% 30.1%

brum (play sound) 27 316 2.8% 32.9%

hej hi 25 341 2.6% 35.5%

titta look 23 364 2.4% 37.9%

den the 20 384 2.1% 40.0%

ja yes 20 404 2.1% 42.1%

de they, them 19 423 2.0% 44.1%

då then 19 442 2.0% 46.0%

goddag hello 17 459 1.8% 47.8%

här

Table 1. Tokens accounting for 50% of the occurrences in a 12 minute’s Infant-Directed Speech session.

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

A more detailed analysis of the material from the infant-directed speech, segmented in 60 s chunks, discloses an even more focused structure in the linguistic content of the infant directed speech. Indeed, during the first and the second minutes of the session the mother introduces a toy that she refers to as “myra” (eng ant) and the target word “myra” is repeated over and over again, as long as the infant’s visual focus of attention is directed towards the toy. During the first minute the mother produces 122 word tokens, within which there are 10 occurrences of the target word, 8 occurrences of the attention modulating interjection “hej” (hi), 7 occurrences of the infant’s name “Maria” and 5 occurrences of the words “du”, “här” and “är” (you, here and is/are). In other words, only six word types account for about 1/3 of the total number of words during the first minute. Using the binomial distribution to estimate the probability of 10 occurrences of the same word type to be produced within this 60 s time window, under the assumption that all types would have equal probabilities of being produced, yields a probability of p<0.000235. Obviously, the probability of having one of the 47 types being repeated by chance 10 times among the 122 produced tokens is extremely low. Therefore the fact that such repetitions actually occur in infant-directed speech is an extremely significant event, whose importance can be taken for granted. Figure 4 displays the probabilities of different number of random occurrence of a given type for this first minute of the session. This pattern of significant repetitions is maintained throughout the session. For the second minute, for instance, the probability of random token repetitions associated with the target word is still as low as p<0.000984, for 28 types, 105 tokens and 11 repetitions of the most frequent word.

One might object that to this kind of technical analysis as being a too simplistic exercise to account for a situation that is complex, very natural and familiar to many people. However, the value of the current approach is just that such rudimentary assumptions do expose essential aspects of the linguistic structure involved in the early language acquisition process. From a naturalistic point of view, repetitions occur as a consequence of the introduction of a new object that the mother intends to present to the infant. But from a human communication point of view referring to objects in the shared external world is exactly one of the essential functions of language, particularly at its early stages of development.10 This is also the message from the probability estimates. If words were used at random it would be unlikely to observe the number of token repetitions that the infant is faced with in natural situations. Thus, it is the very obvious fact that language is not based on random use that underlines the significance of token repetitions. This is, of course, true both for adult-directed and for infant-speech but the number of repetitions occurring in infant-directed speech is more unlikely than that observed in adult-directed speech and this may very well trigger the infant’s initial focus on the audio-visual contingencies that are thought to underlie the early language acquisition process. Indeed, these repetitions are not solely an auditory phenomenon and because they tend to be highly correlated with other sensory inputs, such as the visual and often tactile input, they build up salient multi-sensorial representations in the environment shared to the mother-infant dyad. In other words, recurrent audio-visual contingencies against the background of overall possible sensory variation are so unlikely to occur at random that a few repeated co-occurrences in this multi-sensory information domain can be treated as “sure indications” of the outside world’s structure. However the current theoretical model does not propose that the infant will actually be seeking this type of correlations in order to learn its ambient language. In fact, it is assumed that detection of those important contingencies may initially be underlined by general purpose memory processes, provided the processes’ time-windows are long enough to store some of these repeated events. Given the probability of

10 It goes without saying that function of spoken language also includes other complex aspects such as behaviour regulation and information exchange.

22

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

repetitions in the early infant-directed speech, time windows of a few seconds will initially be enough to detect some of the recurrent audio-visual patterns in infant-directed speech.

Probability of random token repetitionsfrom a pool of 47 types and 122 trials

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0 2 4 6 8 10 12 14 16 18 20

Number of repetitions

Pro

babi

lity

significance

Figure 4. Probability of random token repetitions considering a pool of 147 types and 122 trials.

While the audio-visual contingencies described above may trigger the early language acquisition process in infants, it is obvious that adults’ speech does not consist of series of isolated words like the ones depicted above, but rather of chains of coarticulated speech sounds that the young language learner has to deal with in order to learn its ambient language. Dealing with chunks of speech sounds that are not “properly” that are essentially continuous and not properly separated n word-like sequences raises a new challenge to the language acquisition process: How can words emerge from a continuous speech signal at the same time that the young language learner is assumed to lack linguistic insights?

An ecological theory of early language acquisition What is the potential significance of repetitions in the speech material that the infant is exposed to? In order to have an ecologically relevant processing time frame, it is assumed that the young infant has a holistic auditory assessment of the speech signal. This means that in the absence of linguistic knowledge the infant will partition the incoming speech signal mainly on the basis of its acoustic level contour. In other words, the infant is assumed to perceive the signal as a non-segmented train of on and off sequences of sounds where silences in between utterances are the only delimiters. To mimic this situation the utterances produced by one mother were represented as a series of character strings corresponding to the utterances’ orthographic transcriptions but without considering any between-word boundaries that were not associated with an actual pause in the speech material (Lacerda et al., 2004a; Lacerda et al., 2004b). The actual model input, corresponding to the first 60 s of mother-infant interaction was represented as shown below, where the commas simply represent pauses

23

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

between successive utterances. The timeline of the original recording is kept in this sequence (figure 5).

hej, skavilekamedleksakidag, skavigöradet, ha, skavileka, tittahär, kommerduihågdenhärmyran, hardusett, titta, nukommermyran, dumpadumpadump, hej, hejsamyran, kommerduihågdenhärmyranmaria, ådenhärmyransomdublevalldelesräddförförragången, ja, mariablevalldelesräddfördenhärmyran, ädemyranoskar, myranoskaräde, tittavafinmyranä, åvamångaarmarhanhar, hardusettvamångaarmarmaria, hejhejmaria, hallå, hallå, hallåhallå, hejhej, ojojsågladsäjerOskar, ämariagladidag, ämariagladidag, ha, hejmariasäjermyran

Figure 5. Transcript of the input sumtted to the model. The strings represent utterances separated by pauses. Words are concatenated if produced without pause in between.

Once a pause is detected, the on-going recording of the utterance is stopped and the utterance is stored unanalyzed in memory. The new utterance is then compared with previously stored utterances on a purely holistic basis, in order to find a possible match between the new

utterance and those already stored. The search for a possible match is performed by considering two utterances at a time, one the newly stored utterance and the other an utterance drawn from the set already stored in memory. The shortest utterance in the pair is taken as a pattern reference and the other sequence is searched for a partial match with the pattern defined by the reference. If a match is found (figure 6) the common portion is assumed to be significant, in accordance with the principles presented in the previous section, and gets its level of memory activity increased. The rationale here is that in the huge audio-visual space the likelihood of randomly finding two similar strings is vanishingly small, as discussed above, which means that repeated items can immediately be taken as good lexical candidates. The first loop of matches is displayed in the table below, where n1 and n2 refer to the items being compared and the pairs in curly brackets indicate the position of the matched elements on n2. The result of this process yields the activated items that are displayed in figure 7. These items become now part of the model’s “lexical inventory”.

n1=5 n2=2−>88 <<1, 9 n1=8 n2=4−>88 <<1, 2

Items in this lexical inventory are now matched against both new and old (unanalyzed but stored in memory) input utterances. The effect of this procedure is that the items in the lexical inventory can now be treated as (temporarily) acquired and therefore removed from the input utterances, thereby exposing the non-analyzed portions of the utterances. Applying this procedure to the original material leads to the non-analyzed chunks listed below. The procedure can be applied recursively to the list of non-analyzed chunks and in this case a new lexical item (“maria”) can be derived.

n1=9 n2=6−> 1, 588 << n1=12 n2=1−> 1, 388 << n1=13 n2=12−> 1, 388 << n1=13 n2=1−> 1, 388 << n1=14 n2=7−> 1, 2388 << n1=20 n2=9−> 1, 588 << n1=21 n2=4−> 14, 15 , 817, 18<<88 < n1=22 n2=8−> 1, 988 << n1=22 n2=4−>881, 2<< n1=23 n2=12−>881, 3<, 84, 6<<

Figure 6. Example of identification of recurrent patterns: n1 and n2 indicate the positions in the transcript displayed in figure 5 of the utterances being compared. The numbers within curly brackets specify the substrings of n2 that match the content of n1.

24

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Raw matches across utterances skavileka, ha, titta, hej, hej, hej, kommerduihågdenhärmyran, titta, ha, hardusett, ha, hej, hej, ha, hallå, ha, hallå, hallå, ha, hejhej, hej, hej, ämariagladidag, ha, ha, ha, ha, ha, ha, ha, hej, hej

Independent lexical items: ämariagladidag, ha, hallå, hardusett, hej, hejhej, kommerduihågdenhärmyran, skavileka, titta

Figure 7. Raw matches and unique lexical items detected in the utterances of figure 5.

Further runs of the procedure on this limited material do not add to the number of stored lexical candidates. In a more realistic version of the model, the representations of both the audio input and of the lexical candidates have to be affected by memory decay but that is disregarded in the present example, as if the memory span would be long enough to represent the 60 s of input without appreciable degradation.

The procedure used for the simulated audio input can also be applied to the visual input. The visual component can be thought of as a series of images entering a short-term visual memory. This series of images is associated with a series of visual representations that will

tend to overlap to the extent that there are common elements throughout the series.11 Thus, the overlapping regions of the visual representations will implicitly yield information on recognizable visual patterns. And, just as in the case of the audio input, recurring visual patterns can be treated as significant elements, given the low likelihood of observing the same visual input would be repeated at random. This is illustrated in the following example.

medleksakidag, här, samyran, maria, vafinmyranä, åvamångaarmarnr, vamångaarmarmaria, maria, mariasäjermyran

Figure 8. Additional lexical candidates identified in a second iteration, using the lexical items identified on the first interaction.

Working on the audio-visual space In this example it is assumed that the auditory and the visual dimensions can be linearized and arbitrarily represented by 100 steps along each of the dimensions.12 A series of 5000 audio-visual possible events was created by combining uniform random distributions for both the audio and the visual dimensions. To mimic real-life audio-visual presentations of objects, the probability of three arbitrary but correlated audio-visual events was increased. Figure 9 illustrates this generated audio-visual space, where the three areas with heightened dot density reflect the referential use of the auditory information.

11 This is obviously only a principle description of how visual similarity may be detected. The computational processing required to establish matches between similar images has ultimately to be expressed in terms of convolutions, rotations, translations and scaling transformations, but at this point the focus is on the very general principles of visual processing, not with their explicit mathematical implementation. 12 This is, of course, a microscopic domain in comparison with the actual physical world but it has the advantage of allowing the computations to be completed before retirement.

25

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

20 40 60 80 100

20

40

60

80

100

Figure 9. Illustration of random A-V occurrences containing three recurrent A-V associations. One of the axis represents the visual dimension and the other represents the auditory dimension.

As the sensory input is mapped onto the representation space, it is affected by both sensory smearing and memory decay. Memory decay reflects a decreasing activity in the representation space, as a function of time. In this model memory is simply a volatile storage of mapped exemplars associated with the multi-sensory synchronic inputs.13 New representations in this space add activity to the previously existing activity profile. In this representation space, overlapping neighbouring distributions convey an implicit similarity measure which makes sensory smearing crucial for the establishment of similarity patterns. The smearing value is thus critical for the model output. A too narrow smearing value leads to an overestimation of the number of categories in the representation space, whereas a too broad smearing spreads information in a meaningless way all over the representation space.14 In this model smearing is viewed as two-dimensional Gaussian distributions centred on the incoming auditory-visual stimuli values. In the present example, based on 5000 simulated events, the memory decay was set so that a level of 30% of the initial activity would be reached after about 2000 time units (events). The sensory smearing was set to 30 sensory units in each of the two dimensions. Figure 10 displays the activity landscape after exposure to the 5000 stimuli shown above. The height of the hills in this landscape reflects both the cumulative effect of similar sensory inputs and proximity in time (memory decay implies that older representations tend to vanish).

13 Synchrony does not introduce a lack of generality since asynchronic but systematically related events can be mapped onto synchronic representations simply by introducing adequate time delays. 14 The ability to organize information into equivalence classes is undoubtedly a central aspect of the organism’s information handling capacities. Narrow discrimination ability may, by itself, become an obstacle to the information structuring process. Some of the language and learning disabilities observed in children may in fact be attributed to inability to disregard details in the incoming sensory information.

26

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

0

25

50

75

100

0

25

50

75

100

0

0.000025

0.00005

0.000075

0.0001

0

25

50

75

100

Figure 10. Illustration of the memory activity created by the A-V stimuli simulated in figure 9, taking into account memory decay.

The model generates a representational space in which correlated external information tends to cluster into denser clouds, as a result of memory decay interacting with frequency of co-occurrence of the different sensory inputs. In the context of language acquisition, it may be expected that the initial regularities conveyed by the use of frequent expressions to refer to objects or actors in the infant’s immediate neighbourhood will suffice to generate stable enough correlations between the recurrent sounds of the expressions and the referents they are associated to. Such stable representations form specific lexical candidates, both words and lexical phrases, that somehow correspond to events or objects in the linguistic scene. Obviously “words” do not necessarily correspond to established adult forms. They simply are relatively stable and non-analyzed sound sequences, like the ones picked up by the first model above, that can be associated with other sensory inputs. Once this holistic correspondence between sound sequences and their referents is discovered, it is possible for the infant to focus and explore similar regularities, thereby bringing potential structure to the speech signal that the infant is exposed to. In summary, the current model starts off with a crude associative process but it can rapidly evolve towards an active exploration and crystallization of linguistic regularities. Applied to actual language acquisition settings, the model suggests that the infant quickly becomes an active component of the language discovery process. Language acquisition starts by capitalizing on general sensory processing and memory mechanisms which tend to capture sensory regularities in the typical use of language. Recurrent consistency in the structure of the input stimuli leads to clustering in the representation space. The sensory relationships implicit in these clusters can subsequently be used in cognitive and explorative processes, that although not designed to aim at language, converge to it as a result of playful and internally rewarding actions (Locke, 1996).

CONCLUSION The question of how language communication in general and language development in particular have to rely on specialized language-learning mechanisms opened this article and it was argued that both may perhaps be seen as unfolding from general biological and ecological principles. Resolving the issue of specialized versus general mechanism is virtually impossible, if based on empirical data alone. All living organisms are the result of their own cumulative ontogenetic history, reflecting the combined influences of biological, ecological and interactive components.

27

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

To pave the way for the view that language acquisition may largely be accounted for by general principles this article started with a short overview and discussion of possible evolutionary scenarios leading to the emergence of language communication. From the evolutionary perspective the ability to communicate using symbols does not seem to be exclusive for Homo sapiens. A number of other species do also represent relevant events of their ecological environment by using codes like sound, mimic or gestures but yet they fall short of human language because they seem to lack the capacity to use those symbols recursively (Hauser et al., 2002). In some sense, the process of early language acquisition can be seen as a modern revival of the evolutionary origins of language communication and from this perspective the study of early language acquisition may provide unique insights on how intelligent beings learn to pick up the regularities of symbolic representation available in their immediate ecological environment. There are, of course, very important differences between the modern situation of language acquisition and the evolution of language communication itself. One of those differences is the fact that infants now are born in an ecological context where language is already established and coherently and profusely used in their natural ambient. Another difference may be that Homo sapiens have evolved a propensity to attend spontaneously to relations between events and their sensory representations, a propensity that leads them to attend to rules linking events. Recent discoveries on mirror neurons (Rizzolatti, 2005; Iacoboni et al., 2005; Rizzolatti & Craighero, 2004; Rizzolatti et al., 2000), for instance, indicate that primates do perceive meaningful actions performed by others by referring to the motor activation that would be generated if they were to perform the same action that they are perceiving and understanding. Activation of mirror neurons seems to be a requirement for understanding action (Falck-Ytter et al., 2006) and may also be relevant for the perception of speech (Liberman & Mattingly, 1985; Liberman et al., 1967).

The general importance of implicit information and our ability to extract embedded rules from experience with recurrent patterns has been recently demonstrated in experiments exploring more or less complex computer games with hidden principles (Holt & Wade, 2005). The situation created by these games is structurally similar to the typical adult-infant interaction situations described above, with the addition that in the actual adult-infant interaction, the adult tends to help the infant by responding on-line to the infant’s focus of attention and making the “hidden” rules much more explicit than in a game-like situation. At the same time, the complexity of the ecological scenario of adult-infant interaction is potentially much larger than what often is achieved in a computer game situation. But in this respect the infant’s own perceptual and production limitations will actually help constraining the vast search space that it has to deal with. This is why some of the experimental evidence on infant speech perception, speech production and on the interaction between adults and infants was reviewed here. The combined message from these three areas of research all converge towards an image of the infant as being exposed to and processing its ambient language in a differentiated fashion. The recurrent asymmetry favouring high-low distinctions in infant perception and production, along with the adult’s interpretation and reinforcement of that asymmetry, provides a relevant bias for the infant’s future language development as the acoustic signal may effectively be structured along this dominant dimension already in the early stages of the language acquisition process. However, going from exposure to speech signals to the acquisition of linguistic structure is not possible using acoustic information alone. An emerging linguistic structure must necessarily rely on the very essence of the speech communication process, i.e. the link between the acoustic nature of speech and other sensory dimensions. Since these aspects are necessarily interwoven in real-life data, a simplified model of the early stages of the language acquisition process was created to study the impact of simple assumptions. An important venue of further model work is to investigate the basic constraints enabling the emergence of grammatical structure. By systematic control of the

28

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

model constraints and its linguistic input, it may be possible to get some insight on the necessary and sufficient conditions for higher levels of linguistic development. At any rate, future mismatches between the model predictions and empirical data will hopefully lead to a better understanding of the real language acquisition process.

AKNOWLEDGEMENTS This work was supported by project grants from The Swedish Research Council, The Bank of Sweden Tercentenary Foundation (MILLE, K2003-0867) and EU NEST program (CONTACT, 5010). The first author was also supported by a grant from the Faculty of Humanities, Stockholm University and the second author received additional support from The Anna Ahlström and Ellen Terserus Foundation. The authors are also indebted to the senior members of the staff at the Dept of Linguistics, for their critical and helpful comments on earlier versions of this manuscript.

REFERENCES Bechara, A., Damasio, H., Tranel, D., & Damasio, A. R. (2005). The Iowa Gambling Task and the somatic

marker hypothesis: some questions and answers. Trends Cogn Sci., 9, 159-162. Bechara, A., Damasio, H., Tranel, D., & Damasio, A. R. (1997). Deciding Advantageously Before Knowing the

Advantageous Strategy. Science, 275, 1293-1295. Bertoncini, J., Bijeljac-Babic, R., Blumstein, S. E., & Mehler, J. (1987). Discrimination in neonates of very short

CVs. J.Acoust.Soc.Am., 82, 31-37. Best, C. T., McRoberts, G. W., & Goodell, E. (2001). Discrimination of non-native consonant contrasts varying

in perceptual assimilation to the listener's native phonological system. J.Acoust.Soc.Am., 109, 775-794. Bickerton, D. (1990). Language & species. Chicago: University of Chicago Press. Bruce, G. (1982). Textual aspects of prosody in Swedish. Phonetica, 39, 274-287. Cheney, D. L. & Seyfarth, R. M. (1990). How monkeys see the world

inside the mind of another species. Chicago: University of Chicago Press. Daeschler, E. B., Shubin, N. H., & Jenkins, F. A. (2006). A Devonian tetrapod-like fish and the evolution of the

tetrapod body plan. Nature, 440, 757-763. Davis, B. L. & Lindblom, B. (2001). Phonetic Variability in Baby Talk and Development of Vowel Categories.

In F.Lacerda, C. von Hofsten, & M. Heimann (Eds.), Emerging Cognitive Abilities in Early Infancy (pp. 135-171). Mahwah, New Jersey: Lawrence Erlbaum Associates.

Davis, B. L. & MacNeilage, P. (1995). The articulatory basis of babbling. Journal of Speech and Hearing Research, 0000, 0000.

Dawkins, R. (1987). The Blind Watchmaker. New York: W.W.Norton & Company. Dawkins, R. (1998). Unweaving the Rainbow. London: The Penguin Group. Deacon, T. W. (1997). The symbolic species

the co-evolution of language and the brain. (1st ed ed.) New York: W.W. Norton. Degonda, N., Mondadori, C. R., Bosshardt, S., Schmidt, C. F., Boesiger, P., Nitsch, R. M. et al. (2005). Implicit

associative learning engages the hippocampus and interacts with explicit associative learning. Neuron, 46, 505-520.

Eimas, P. D. (1974). Auditory and linguistic processing of cues for place of articulation by infants. Perception and Psychophysics, 16, 513-521.

Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science, 171, 303-306.

Enquist, M., Arak, A., Ghirlanda, S., & Wachtmeister, C. A. (2002). Spectacular phenomena and limits to rationality in genetic and cultural evolution. Philos.Trans.R.Soc.Lond B Biol.Sci., 357, 1585-1594.

Enquist, M. & Ghirlanda, S. (1998). Evolutionary biology. The secrets of faces. Nature, 394, 826-827. Enquist, M. & Ghirlanda, S. (2005). Neural Networks and Animal Behavior. Princeton and Oxford: Princeton

University Press. Falck-Ytter, T., Gredeback, G., & von, H. C. (2006). Infants predict other people's action goals. Nat.Neurosci., 9,

878-879. Fant, G. (1960). Acoustic theory of speech production. The Hague, Netherlands: Mouton. Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B., & Fukui, I. (1989). A cross-

language study of prosodic modifications in mothers' and fathers' speech to preverbal infants. J.Child Lang, 16, 477-501.

Fort, A. & Manfredi, C. (1998). Acoustic analysis of newborn infant cry signals. Med.Eng Phys., 20, 432-442. Futuyama, D. J. (1998). Evolutionary Biology. Sunderland, Massachusetts: Sinauer Associates, Inc.

29

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Gärdenfors, P. (2000). Hur Homo blev sapiens: om tänkandets evolution. Nora: Bokförlaget Nya Doxa. Gay, T., Lindblom, B., & Lubker, J. (1980). The production of bite block vowels: Acoustic equivalence by

selective compensation. The Journal of the Acoustical Society of America, 68, S31. Gay, T., Lindblom, B., & Lubker, J. (1981). Production of bite-block vowels: Acoustic equivalence by selective

compensation. The Journal of the Acoustical Society of America, 69, 802-810. Gibson, E. J. & Pick, A. D. (2000). An Ecological Approach to Perceptual Learning and Development. Oxford:

Oxford University Press. Gil-da-Costa, R., Palleroni, A., Hauser, M. D., Touchton, J., & Kelley, J. P. (2003). Rapid acquisition of an

alarm response by a neotropical primate to a newly introduced avian predator. Proc.Biol.Sci., 270, 605-610.

Gould, S. J. (1977). Ontogeny and phylogeny. Cambridge, Mass: Belknap Press of Harvard University Press. Hauser, M. D. (1989). Ontogenetic changes in the comprehension and production of Vervet monkey

(Cercopithecus aethiops) vocalizations. Journal of Comparative Psychology, 103, 149-158. Hauser, M. D. (1992). Articulatory and social factors influence the acoustic structure of rhesus monkey

vocalizations: a learned mode of production? J.Acoust.Soc.Am., 91, 2175-2179. Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The Faculty of Language: What Is It, Who Has It, and How

Did It Evolve? Science, 298, 1569-1579. Hauser, M. D. & Wrangham, R. W. (1987). Manipulation of food calls in captive chimpanzees. A preliminary

report. Folia Primatol.(Basel), 48, 207-210. Holt, L. L. & Wade, T. (2005). Categorization of spectrally complex non-invariant auditory stimuli in a

computer game task. In (pp. 1038). JASA. Houston, D., Jusczyk, P., Kuijpers, C., Coolen, R., & Cutler, A. (2000). Cross-language word segmentation by

9-month-olds. Psychon.Bull.Rev., 7, 504-509. Iacoboni, M., Molnar-Szakacs, I., Gallese, V., Buccino, G., Mazziotta, J. C., & Rizzolatti, G. (2005). Grasping

the intentions of others with one's own mirror neuron system. PLoS.Biol., 3, e79. Iverson, P. & Kuhl, P. (1995). Mapping the perceptual magnet effect for speech using signal detection theory

and multidimensional scaling. J.Acoust.Soc.Am., 97, 553-562. Iverson, P. & Kuhl, P. (2000). Perceptual magnet and phoneme boundary effects in speech perception: do they

arise from a common mechanism? Percept.Psychophys., 62, 874-886. Jacob, F. (1982). The possible and the actual. New York: Pantheon. Jakobson, R. (1968). Child language. Aphasia and phonological universals. --. (72 ed.) The Hague: Mouton. Johansson, S. (2005). Origins of language : constraints on hypotheses. Amsterdam; Philadelphia, PA: John

Benjamins Publishing Company. Johnson, E. K. & Jusczyk, P. (2001). Word segmentation by 8-month-olds: when speech cues count more than

statistics. J.Mem.Lang., 44, 548-567. Jones, S., Martin, R., & Pilbeam, D. (1992). The Cambridge Encyclopedia of Human Evolution. Cambridge:

Cambridge University Press. Jusczyk, P. (1999). How infants begin to extract words from speech. Trends Cogn Sci., 3, 323-328. Kamo, M., Ghirlanda, S., & Enquist, M. (2002). The evolution of signal form: effects of learned versus inherited

recognition. Proc.Biol.Sci., 269, 1765-1771. Kelly, G. A. (1963). A theroy of personality: The psychology of personal constructs. New York: W. W. Norton. Knight, C., Studdert-Kennedy, M., & Hurford, J. R. (2000). Language: A Darwinian adaptation? In C.Knight

(Ed.), Evolutionary Emergence of Language: Social Function and the Origins of Linguistic Form (pp. 1-15). New York: Cambridge University Press.

Kuhl, P. (1991). Human adults and human infants show a perceptual magnet effect for the prototypes of speech categories, monkeys do not. Perception and Psychophysics, 50, 93-107.

Kuhl, P., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., Ryskina, V. L. et al. (1997). Cross-language analysis of phonetic units in language addressed to infants. Science, 277, 684-686.

Kuhl, P. & Meltzoff, A. N. (1982). The bimodal perception of speech in infancy. Science, 218, 1138-1141. Kuhl, P. & Meltzoff, A. N. (1984). The intermodal representation of speech in infants. Infant Behavior and

Development, 7, 361-381. Kuhl, P. & Meltzoff, A. N. (1988). The bimodal perception of speech in infancy. Science, 218, 1138-1141. Kuhl, P. & Meltzoff, A. N. (1996). Infant vocalizations in response to speech: vocal imitation and developmental

change. J.Acoust.Soc.Am., 100, 2425-2438. Kuhl, P., Williams, K., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic

perception in infants by 6 months of age. Science, 255, 606-608. Kuhl, P. K. (1985). Categorization of speech by infants. In J.Mehler & R. Fox (Eds.), Neonate Cognition:

Beyond the Bloming Buzzing Confusion (pp. 231-262). Hillsdale, New Jersey: Laurence Earlbaum Associates.

30

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Lacerda, F. (1992a). Does Young Infants Vowel Perception Favor High-Low Contrasts? International Journal of Psychology, 27, 58-59.

Lacerda, F. (1992b). Young Infants' Discrimination of Confusable Speech Signals. In M.E.H.Schouten (Ed.), THE AUDITORY PROCESSING OF SPEECH: FROM SOUNDS TO WORDS (pp. 229-238). Berlin, Federal Republic of Germany: Mouton de Gruyter.

Lacerda, F. (2003). Phonolgoy: An emergent consequence of memory constraints and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 41-59.

Lacerda, F. & Ichijima, T. (1995). Adult judgements of infant vocalizations. In K. Ellenius & P. Branderud (Eds.), (pp. 142-145). Stockholm: KTH and Stockholm University.

Lacerda, F., Klintfors, E., Gustavsson, L., Lagerkvist, L., Marklund, E., & Sundberg, U. (2004a). Ecological Theory of Language Acquisition. In Genova: Epirob 2004.

Lacerda, F., Marklund, E., Lagerkvist, L., Gustavsson, L., Klintfors, E., & Sundberg, U. (2004b). On the linguistic implications of context-bound adult-infant interactions. In Genova: Epirob 204.

Lacerda, F. & Sundberg, U. (1996). Linguistic strategies in the first 6-months of life. In (pp. 2574). JASA. Lacerda, F., Sundberg, U., Klintfors, E., & Gustavsson, L. (forthc.). Infants learn nouns from audio-visual

contingencies. Ref Type: Unpublished Work Lane, H., Denny, M., Guenther, F. H., Matthies, M. L., Menard, L., Perkell, J. S. et al. (2005). Effects of bite

blocks and hearing status on vowel production. J.Acoust.Soc.Am., 118, 1636-1646. Liberman, A., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code.

Psychological Review, 74, 431-461. Liberman, A. & Mattingly, I. (1985). The motor theory of speech perception revised. Cognition, 21, 1-36. Liljencrants, J. & Lindblom, B. (1972). Numerical simulation of vowel quality systems: The role of perceptual

contrast. Language, 48, 839-862. Lindblom, B., Lubker, J., & Gay, T. (1977). Formant frequencies of some fixed-mandible vowels and a model of

speech motor programming by predictive simulation. The Journal of the Acoustical Society of America, 62, S15.

Locke, J. L. (1996). Why do infants begin to talk? Language as an unintended consequence. J.Child Lang, 23, 251-268.

Lotto, A. J., Kluender, K. R., & Holt, L. L. (1998). Depolarizing the perceptual magnet effect. J.Acoust.Soc.Am., 103, 3648-3655.

MacNeilage, P. & Davis, B. L. (2000a). Deriving speech from nonspeech: a view from ontogeny. Phonetica, 57(2-4), 284-296.

MacNeilage, P. & Davis, B. L. (2000b). On the Origin of Internal Structure of Word Forms. Science, 288, 527-531.

Maddieson, I. (1980). Phonological generalizations from the UCLA Phonological Segment Inventory Database. UCLA Working Papers in Phonetics, 50, 57-68.

Maddieson, I. & Emmorey, K. (1985). Relationship between semivowels and vowels: cross-linguistic investigations of acoustic difference and coarticulation. Phonetica, 42, 163-174.

Maia, T. V. & McClelland, J. L. (2004). A reexamination of the evidence for the somatic marker hypothesis: what participants really know in the Iowa gambling task. Proc.Natl Acad.Sci.U.S.A, 101, 16075-16080.

Mandel, D. & Jusczyk, P. (1996). When do infants respond to their names in sentences? Infant Behavior and Development, 19, 598.

Mandel, D., Kemler Nelson, D. G., & Jusczyk, P. (1996). Infants remember the order of words in a spoken sentence. Cognitive Development, 11, 181-196.

Mattys, S. L. & Jusczyk, P. (2001). Do infants segment words or recurring contiguous patterns? J.Exp.Psychol.Hum.Percept.Perform., 27, 644-655.

Meltzoff, A. N. & Borton, R. W. (1979). Intermodal matching by human neo-nates. Nature, 282, 403-404. Meltzoff, A. N. & Moore, M. K. (1977). Imitation of facial and manual gestures by human neonates. Science,

198, 75-78. Meltzoff, A. N. & Moore, M. K. (1983). Newborn infants imitate adult facial gestures. Child Development, 54,

702-709. Menard, L., Schwartz, J. L., & Boe, L. J. (2004). Role of vocal tract morphology in speech development:

perceptual targets and sensorimotor maps for synthesized French vowels from birth to adulthood. J.Speech Lang Hear.Res., 47, 1059-1080.

Munakata, Y. & Pfaffly, J. (2004). Hebbian learning and development. Dev.Sci., 7, 141-148. Nazzi, T., Jusczyk, P., & Johnson, E. K. (2000). Language Discrimination by English-Learning 5-Month-Olds:

Effects of Rhythm and Familiarity. Journal of Memory and Language, 43, 1-19. Nishimura, T., Mikami, A., Suzuki, J., & Matsuzawa, T. Descent of the hyoid in chimpanzees: evolution of face

flattening and speech. Journal of Human Evolution, In Press, Corrected Proof. Nowak, M. A., Komarova, N., & Niyogi, P. (2001). Evolution of Universal Grammar. Science, 291, 114-118.

31

E:\My Documents\Fra\JOLLER\Univ_Porto_RevistaDeLinguistica_060228.doc 2006-09-13, 14:57

Nowak, M. A., Plotkin, J. B., & Jansen, V. A. A. (2000). The evolution of syntactic communication. Nature, 404, 495-498.

Nylin, S. (6-7-2006). Ref Type: Personal Communication

Papousek, M. & Hwang, S. F. C. (1991). Tone and intonation in Mandarin babytalk topresyllabic infants: Comparison with registers of adult conversation and foreign languageinstruction. Applied Psycholinguistics, 12, 481-504.

Papousek, M. & Papousek, H. (1989). Forms and functions of vocal matching in interactions between mothers and their precanonical infants. First Language, 9, 137-158.

Polka, L. & Werker, J. F. (1994). Developmental changes in perception of nonnative vowel contrasts. J.Exp.Psychol.Hum.Percept.Perform., 20, 421-435.

Polka, L. & Bohn, O. S. (2003). Asymmetries in vowel perception. Speech Communication, 41, 221-231. Rizzolatti, G. (2005). The mirror neuron system and its function in humans. Anat.Embryol.(Berl), 210, 419-421. Rizzolatti, G. & Craighero, L. (2004). The mirror-neuron system. Annu.Rev.Neurosci., 27, 169-192. Rizzolatti, G., Fogassi, L., & Gallese, V. (2000). Mirror neurons: Intentionality detectors? International Journal

of Psychology, 35, 205. Robb, M. P. & Cacace, A. T. (1995). Estimation of formant frequencies in infant cry.

Int.J.Pediatr.Otorhinolaryngol., 32, 57-67. Roug, L., Landberg, I., & Lundberg, L. J. (1989). Phonetic development in early infancy: a study of four

Swedish children during the first eighteen months of life. J.Child Lang, 16, 19-40. Saffran, J. R. (2002). Constraints on statistical language learning. J.Mem.Lang., 47, 172-196. Saffran, J. R., Aslin, R. N., & Newport, E. (1996). Statistical learning by 8-month old infants. Science, 274,

1926-1928. Saffran, J. R. & Thiessen, E. D. (2003). Pattern induction by infant language learners. Dev.Psychol., 39, 484-

494. Saffran, J. R. (2003). Statistical language learning: mechanisms and constraints. Current Directions in

Psychological Science, 12, 110-114. Savage-Rumbaugh, E. S., Murphy, J., Sevcik, R. A., Brakke, K. E., Williams, S. L., & Rumbaugh, D. M. (1993).

Language comprehension in ape and child. Monogr Soc.Res.Child Dev., 58, 1-222. Schick, B. S., Marschark, M., Spencer, P. E., & ebrary, I. (2006). Advances in the sign language development of

deaf children. Oxford: Oxford University Press. Seidenberg, M. S., MacDonald, M. C., & Saffran, J. R. (2002). Does grammar start where statistics stop?

Science, 298, 553-554. Stern, D. N., Spieker, S., Barnett, R. K., & MacKain, K. (1983). The prosody of maternal speech: Infant age and

context related changes. Journal of Child Language, 10, 1-15. Stevens, K. N. (1998). Acoustic Phonetics. Cambridge, Mass.: MIT Press. Sundberg, U. (1998). Mother tongue - Phonetic Aspects of Infant-Directed Speech. Stockholm University. Sundberg, U. & Lacerda, F. (1999). Voice onset time in speech to infants and adults. Phonetica, 56, 186-199. Tees, R. C. & Werker, J. F. (1984). Perceptual flexibility: maintenance or recovery of the ability to discriminate

non-native speech sounds. Can.J.Psychol., 38, 579-590. Tomasello, M. & Carpenter, M. (2005). The emergence of social cognition in three young chimpanzees. Monogr

Soc.Res.Child Dev., 70, vii-132. Tomasello, M., Savage-Rumbaugh, S., & Kruger, A. C. (1993). Imitative learning of actions on objects by

children, chimpanzees, and enculturated chimpanzees. Child Dev., 64, 1688-1705. van der Weijer, J. (1999). Language Input for Word Discovery. MPI Series in Psycholinguistics. Werker, J. F., Gilbert, J. H. V., Humphrey, K., & Tees, R. C. (1981). Developmental aspects of cross-language

speech perception. Child Development, 52, 349-355. Werker, J. F. & Logan, J. S. (1985). Cross-language evidence for three factors in speech perception. Perception

and Psychophysics, 37, 35-44. Werker, J. F. & Tees, R. C. (1983). Developmental changes across childhood in the perception of non-native

speech sounds. Canadian Journal of Psychology, 37, 278-296. Werker, J. F. & Tees, R. C. (1992). The organization and reorganization of human speech perception.

Annu.Rev.Neurosci., 15, 377-402. Werner, L. (1992). Interpreting Developmental Psychoacoustics. In L.Werner & E. Rubel (Eds.), Develpmental

Psycholinguistics (1 ed., pp. 47-88). Washington: American Psychological Association. Werner, S. (3-15-2006). Cellular density in human tissues.

Ref Type: Personal Communication

32

Infants’ ability to extract verbs from continuous speech

Ellen Marklund and Francisco Lacerda

Department of Linguistics Stockholm University, Stockholm, Sweden

[email protected]

ABSTRACT Early language acquisition is a result of the infant’s general associative and memory processes in combination with its’ ecological surroundings. Extracting a part of a continuous speech signal and associate it to for instance an object, is made possible by the structure provided by characteristically repetitive Infant Directed Speech. The parents adjust the way they speak to their infant based on the response they are given, which in turn is dependent on the infant’s age and cognitive development.

It seems probable that the ability to extract lexical candidates referring to visually presented actions is developed at a later stage than the ability to extract lexical candidates referring to visually presented objects – actions are more abstract and there is a time aspect involved.

Using the Visual Preference Paradigm, the ability to extract lexical candidates referring to actions was studied in infants at the age of 4 to 8 months. The results suggest that while the ability at this age is not greatly apparent, it seems to increase slightly with age.

1. INTRODUCTION This study is a part of the MILLE-project (Modeling Interactive Language Learning), which is a collaboration between the Department of Linguistics at Stockholm University (Sweden), the Department of Speech Music and Hearing at the Royal Institute of Technology (Sweden) and the Department of Psychology at Carnegie Mellon University (PA, USA). The project’s aim is to investigate the processes behind early language acquisition and to implement them in computational models.

1.1 Infant directed speech According to the Ecological Theory of Language Acquisition (ETLA) early language acquisition can be seen as an emergent consequence of the multi-modal interaction between the infant and its surroundings. Parents verbally interpret the world to their infant, and by doing so they provide the structure necessary for linguistic references and systems to emerge [1].

Infant Directed Speech (IDS) differs from Adult Directed Speech (ADS) mainly in that it is more repetitive, and has greater pitch variations. IDS varies with the age of the infant, as the parent adjusts to the cognitive and

communicative development of the infant and adapts the language accordingly [2].

At a very early age the main goal of speaking is general social interaction – to maintain the communications link with the infant, as opposed to ADS and later IDS, where the goal mainly is to convey information of some sort [3]. Recent studies at our lab shows that early IDS is also highly focused on what’s in the field-of-view of the infant; if the infant’s attention shifts, the parent tend to follow the infant’s lead and focus on the new center of attention (report in preparation). 1.2 Modeling early language acquisition The principles of ETLA will be implemented in a humanoid robot, in an effort to develop a system that acquires language in a manner comparable to human language acquisition.

The system’s learning will at some stage be dependent on IDS input, and thus some of the questions that need to be answered in order to accomplish this are in which way the parent adjusts the IDS with the infant’s age, and on what cues those adjustments are based? To answer those questions, the infant’s word-learning behavior needs to be understood.

1.3 Learning lexical labels To a naive listener – for example an infant – speech is not perceived as a combination of words building up sentences, but rather as a continuous stream of sounds with no apparent boundaries. According to ETLA early word learning is based on general multi-sensory association and memory processes, which in combination with the structured characteristics of IDS provides opportunities for lexical candidates – smaller chunks of speech, referring to for example an object – to emerge and be extracted from a continuous speech signal.

Studies have shown that infants at an early age are able to integrate multi-modal information and extract lexical candidates from a flow of continuous speech, and learn to associate those with objects [4], and other studies show that they are capable of extracting and associating lexical candidates with attributes, such as colors and shapes [5].

Regarding lexical candidates referring to actions or events – verb-like lexical candidates, Echols has, by directing infants’ attention with labeled objects or events, shown that while infants of 9 months generally focus more

frasse
Text Box
Paper presented at InterSpeech 2006, September 17-21, Pittsburgh, pp1360-1362.

on objects when an event is labeled, while older infants to a greater extent focused on events or actions [6].

The goal of this experiment is to study how infants’ ability to associate visually perceived actions with linguistic references differs with age, in order to gain further understanding of the development of general cognitive processes and abilities essential for language acquisition.

2. METHOD The method used (Visual Preference Procedure) is a variation of the Intermodal Preferential Looking Paradigm; a method originally developed by Spelke [7]. The subjects were shown a film with corresponding visual and auditory stimuli while their eye-movements were recorded. The looking time for each sequence of the film and for each quadrant of the screen was calculated and analysed using Mathematica 5.0 and SPSS 14.0.

2.1 Subjects The subjects were 55 Swedish infants between 4 and 8 months old, randomly selected from the National Swedish address register (SPAR) on the basis of age and geographical criteria.

During the recording of the eye-movements, the infant was seated in the parent’s lap or in an infant chair close to the parent. The parent wore Active Noise Reduction headphones, listening to music.

The 23 infants whose total looking time did not exceed 30% of the duration of the film were not considered sufficiently exposed to the stimuli; hence those recordings were excluded leaving a total of 32 recording sessions.

2.2 Stimuli The auditory stimuli of the films were non-sense words read by a Swedish female and concatenated to three-word phrases. The films had three sequences each; a baseline, an exposure and a test. The baseline showed a split-screen with four different scenes in which cartoon figures were moving towards or away from each other in silence (Figure 1).

Figure 1: In each scene of the baseline one cartoon figure was moving towards or away from an other, stationary figure.

The exposure phase showed eight subsequent scenes similar to those in the baseline, where the direction of the target movements was balanced regarding actor and direction on the screen (Figure 2). In each of the scenes only two cartoon figures were involved – a kangaroo (K) and an elephant (E).

Figure 2: In the eight subsequent scenes of the exposure, only two cartoon figures were involved – a kangaroo and an elephant.

During the exposure, three-word non-sense phrases were played referring to the objects and actions involved. The target word, the one referring to the action, always held the medial position (Table 1). The phrases were repeated three times during each scene.

Scene Audio film 1 Audio film 2 Video Movement1 nima bysa pobi fugga pobi bysa E K Away 2 pobi fugga nima bysa nima fugga K E towards 3 nima bysa pobi fugga pobi bysa K E away 4 nima fugga pobi fugga nima bysa K E towards 5 nima fugga pobi fugga nima bysa E K towards 6 pobi bysa nima bysa pobi fugga K E away 7 pobi fugga nima bysa nima fugga E K towards 8 pobi bysa nima bysa pobi fugga K E away

Table 1: The auditory stimuli and the visual balancing of the eight scenes in the exposure phase.

The test was visually identical to the baseline, while the non-sense word referring to the action was repeated twice (Table 2).

Film Audio Target 1 bysa away 2 nima towards

Table 2: Auditory stimuli of the test phase; the word referring to the target movement was repeated twice.

2.3 Eye-tracking A Tobii Eye-Tracker integrated with a 17” TFT monitor was used to record the eye-movements of the infants. Near infra-red light-emitting diods (NIR-LEDs) and a high-resolution camera with a large field-of-view are mounted on the screen. The light from the NIR-LEDs is reflected on the surface of the subject’s eye, which is captured by the camera, as is the position of the pupil. During a calibration the subject is encouraged to look at pre-specified points at the screen while the system measures the pupils position in relation to the reflections, providing data for the system to use in order to calculate the gaze point on the screen during the recording. Software used for data storage was ClearView 2.5.1.

3. RESULTS AND DISCUSSION The results showed a slight increase with age in looking time at the target compared to the non-target (Figure 3). Regression coefficient is R=0,261, p<0,148.

Figure 3: The y-axis shows the normalized difference in gain in percentage of looking time towards the target with non-target as baseline. The x-axis is age in days.

The results suggest that the infants’ ability to extract a part of a continuous stream of speech and associate it with an action increases over age.

To expand this study to encompass older infants would be of interest in order to determine if there is a period where the ability increases drastically; hypothetically between 9 and 13 months of age, which would be in accordance with Echols’ results [6] and might indicate a leap in cognitive development.

One could also consider a replicating this study but using less complex actions as target, so the infant doesn’t have to grasp the concept of changes in spatial relationship between different agents in order to understand the event.

4. ACKNOWLEDGEMENTS The research was financed by grants from The Bank of Sweden Tercentenary Foundation and the European Commission.

5. REFERENCES

[1] F. Lacerda, E. Klintfors, L. Gustavsson, L. Lagerkvist, E. Marklund, and U. Sundberg, "Ecological Theory of Language Acquisition" Proceedings of the 4th International Workshop on Eigenetic Robotics, pp. 147-148. Genoa, 2004.

[2] U. Sundberg, "Mother Tongue: Phonetic aspects of infant-directed speech" PERILUS XXI, Institutionen för lingvistik, Stockholms Universitet, 1998.

[3] F. Lacerda, E. Marklund, L. Lagerkvist, L. Gustavsson, E. Klintfors, and U. Sundberg, "On the linguistic implications of context-bound adult-infant interactions" Proceedings of the 4th International

Workshop on Eigenetic Robotics, pp. 147-148. Genoa, 2004.

[4] L. Gustavsson, U. Sundberg, E. Klintfors, E. Marklund, L. Lagerkvist, and F. Lacerda, "Integration of audio-visual information in 8-month-old infants" Proceedings of the 4th International Workshop on Eigenetic Robotics, pp. 149-150. Genoa, 2004.

[5] E. Klintfors and F. Lacerda, "Potential relevance of audio-visual integration in mammals for computational modeling" This conference, 2006.

[6] C. H. Echols, "Attentional predispositions and linguistic sensitivity in the acquisition of object words.," Paper presented at the Biennial Meeting of the Society for Research in Child Development. New Orleans, LA, 1993.

[7] E. S. Spelke and C. J. Owsley, "Intermodal exploration and knowledge in infancy," Infant Behavior and Development, vol. 2, pp. 13-24, 1979.

Interaction Studies 7:2 (2006), 97–23.issn 1572–0373 / e-issn 1572–0381 © John Benjamins Publishing Company

Understanding mirror neuronsA bio-robotic approach

1Giorgio Metta, 1Giulio Sandini, 1Lorenzo Natale, 2Laila Craighero and 2Luciano Fadiga1LIRA-Lab, DIST, University of Genoa / 2Neurolab, Section of Human Physiology, University of Ferrara

This paper reports about our investigation on action understanding in the brain. We review recent results of the neurophysiology of the mirror system in the monkey. Based on these observations we propose a model of this brain system which is responsible for action recognition. The link between object affordances and action understanding is considered. To support our hypothesis we describe two experiments where some aspects of the model have been implemented. In the first experiment an action recognition system is trained by using data recorded from human movements. In the second experiment, the model is partially implemented on a humanoid robot which learns to mimic simple actions performed by a human subject on different objects. These experiments show that motor information can have a signifi-cant role in action interpretation and that a mirror-like representation can be developed autonomously as a result of the interaction between an individual and the environment.

Keywords: mirror neurons, neurophysiology, robotics, action recognition

. Introduction

Animals continuously act on objects, interact with other individuals, clean their fur or scratch their skin and, in fact, actions represent the only way they have to manifest their desires and goals. However, actions do not constitute a semantic category such as trees, objects, people or buildings: the best way to describe a complex act to someone else is to demonstrate it directly (Jeannerod, 1988). This is not true for objects such as trees or buildings that we describe by using size, weight, color, texture, etc. In other words we describe ‘things’ by using

98 Giorgio Metta et al.

visual categories and ‘actions’ by using motor categories. Actions are defined as ‘actions’ because they are external, physical expressions of our intentions. It is true that often actions are the response to external contingencies and/or stimuli but it is also certainly true that — at least in the case of human beings — actions can be generated on the basis of internal aims and goals; they are possibly symbolic and not related to immediate needs. Typical examples of this last category include most communicative actions.

Perhaps one of the first attempts of modeling perception and action as a whole was started decades ago by Alvin Liberman who initiated the construc-tion of a ‘speech understanding’ machine (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman & Mattingly, 1985; Liberman & Wahlen, 2000). As one can easily imagine, the first effort of Liberman’s team was di-rected at analyzing the acoustic characteristics of spoken words, to investigate whether the same speech event, as uttered by different subjects in different contexts, possessed any common invariant phonetic percept.1 Soon Liberman and his colleagues realized that speech recognition on the basis of acoustic cues alone was beyond reach with the limited computational power available at that time. Somewhat stimulated by the negative result, they put forward the hypothesis that the ultimate constituents of speech (the events of speech) are not sounds but rather articulatory gestures that have evolved exclusively at the service of language. As Liberman states:

“A result in all cases is that there is not, first, a cognitive representation of the proximal pattern that is modality-general, followed by a translation to a par-ticular distal property; rather, perception of the distal property is immediate, which is to say that the module2 has done all the hard work (Liberman & Mat-tingly, 1985, page 7)”.

This elegant idea was however strongly debated at the time mostly because it was difficult to test, validation through the implementation on a computer system was impossible, and in fact only recently has the theory gained support from experimental evidence (Fadiga, Craighero, Buccino, & Rizzolatti, 2002; Kerzel & Bekkering, 2000).

Why is it that, normally, humans can visually recognize actions (or, acous-tically, speech) with a recognition rate of about 99–100%? Why doesn’t the inter-subject variability typical of motor behavior pose a problem for the brain while it is troublesome for machines? Sadly, if we had to rank speech recogni-tion software by human standards, even our best computers would be regarded at the level of an aphasic patient. One possible alternative is for Liberman to be right and that speech perception and speech production use a common reper-

Understanding mirror neurons 99

toire of motor primitives that during production are at the basis of the genera-tion of articulatory gestures, and during perception are activated in the listener as the result of an acoustically-evoked motor “resonance” (Fadiga, Craighero, Buccino, & Rizzolatti, 2002; Wilson, Saygin, Sereno, & Iacoboni, 2004).

Perhaps it is the case that if the acoustic modality were replaced, for ex-ample, by vision then this principle would still hold. In both cases, the brain requires a “resonant” system that matches the observed/heard actions to the observer/listener motor repertoire. It is interesting also to note that an animal equipped with an empathic system of this sort would be able to automatically “predict”, to some extent, the future development of somebody else’s action on the basis of the onset of the action and the implicit knowledge of its dynam-ics. Recent neurophysiological experiments show that such a motor resonant system indeed exists in the monkey’s brain (Gallese, Fadiga, Fogassi, & Riz-zolatti, 1996). Most interesting, this system is located in a premotor area where neurons not only discharge during action execution but to specific visual cues as well.

The remainder of the paper is organized to lead the reader from the ba-sic understanding of the physiology of the mirror system, to a formulation of a model whose components are in agreement with what we know about the neural response of the rostroventral premotor area (area F5), and finally to the presentation of a set of two robotic experiments which elucidate several aspects of the model. The model presented in this paper is used to lay down knowledge about the mirror system rather than being the subject of direct neural network modelling. The two experiments cover different parts of the model but there is not a full implementation as such yet. The goal of this paper is that of present-ing these results, mainly robotics, into a common context.

2. Physiological properties of monkey rostroventral premotor area (F5)

Area F5 forms the rostral part of inferior premotor area 6 (Figure 1). Elec-trical microstimulation and single neuron recordings show that F5 neurons discharge during planning/execution of hand and mouth movements. The two representations tend to be spatially segregated with hand movements mostly represented in the dorsal part of F5, whereas mouth movements are mostly located in its ventral part. Although not much is known about the functional properties of “mouth” neurons, the properties of “hand” neurons have been extensively investigated.

200 Giorgio Metta et al.

2. Motor neurons

Rizzolatti and colleagues (Rizzolatti et al., 1988) found that most of the hand-related neurons discharge during goal-directed actions such as grasping, ma-nipulating, tearing, and holding. Interestingly, they do not discharge during finger and hand movements similar to those effective in triggering them, when made with other purposes (e.g., scratching, pushing away). Furthermore, many F5 neurons are active during movements that have an identical goal regardless of the effector used to attain them. Many grasping neurons discharge in as-sociation with a particular type of grasp. Most of them are selective for one of the three most common monkey grasps: precision grip, finger prehension, and whole hand grasping. Sometimes, there is also specificity within the same gen-eral type of grip. For instance, within the whole hand grasping, the prehension of a sphere is coded by neurons different from those coding the prehension of a cylinder. The study of the temporal relation between the neural discharge and the grasping movement showed a variety of behaviors. Some F5 neurons discharge during the whole action they code; some are active during the open-ing of the fingers, some during finger closure, and others only after the contact with the object. A typical example of a grasping neuron is shown in Figure 2. In particular, this neuron fires during precision grip (Figure 2, top) but not dur-ing whole hand grasping (Figure 2, bottom). Note that the neuron discharges both when the animal grasps with its right hand and when the animal grasps with its left hand.

F1

F1S1

PFPFG

PGAIPF4

F5

Figure .

Understanding mirror neurons 20

Taken together, these data suggest that area F5 forms a repository (a “vo-cabulary”) of motor actions. The “words” of the vocabulary are represented by populations of neurons. Each indicates a particular motor action or an aspect of it. Some indicate a complete action in general terms (e.g., take, hold, and tear). Others specify how objects must be grasped, held, or torn (e.g., precision grip, finger prehension, and whole hand prehension). Finally, some of them subdivide the action in smaller segments (e.g., fingers flexion or extension).

2.2 Visuomotor neurons

Some F5 neurons in addition to their motor discharge, respond also to the presentation of visual stimuli. F5 visuomotor neurons pertain to two com-pletely different categories. Neurons of the first category discharge when the monkey observes graspable objects (“canonical” F5 neurons, (Murata et al.,

Figure 2.

202 Giorgio Metta et al.

1997; Rizzolatti et al., 1988; Rizzolatti & Fadiga, 1998)). Neurons of the second category discharge when the monkey observes another individual making an action in front of it (Di Pellegrino, Fadiga, Fogassi, Gallese, & Rizzolatti, 1992; Gallese, Fadiga, Fogassi, & Rizzolatti, 1996; Rizzolatti, Fadiga, Gallese, & Fo-gassi, 1996). For these peculiar “resonant” properties, neurons belonging to the second category have been named “mirror” neurons (Gallese, Fadiga, Fogassi, & Rizzolatti, 1996).

The two categories of F5 neurons are located in two different sub-regions of area F5: “canonical” neurons are mainly found in that sector of area F5 bur-ied inside the arcuate sulcus, whereas “mirror” neurons are almost exclusively located in the cortical convexity of F5 (see Figure 1).

2.2. Canonical neuronsRecently, the visual responses of F5 “canonical” neurons have been re-exam-ined using a formal behavioral paradigm, which allowed testing the response related to object observation both during the waiting phase between object presentation and movement onset and during movement execution (Murata et al., 1997). The results showed that a high percentage of the tested neurons, in addition to the “traditional” motor response, responded also to the visual presentation of 3D graspable object. Among these visuomotor neurons, two thirds were selective to one or few specific objects.

Figure 3A shows the responses of an F5 visually selective neuron. While observation and grasping of a ring produced strong responses, responses to the other objects were modest (sphere) or virtually absent (cylinder). Figure 3B (object fixation) shows the behavior of the same neuron of Figure 3A during the fixation of the same objects. In this condition the objects were presented as during the task in Figure 2A, but grasping was not allowed and, at the go-signal, the monkey had simply to release a key. Note that, in this condition, the object is totally irrelevant for task execution, which only requires the de-tection of the go-signal. Nevertheless, the neuron strongly discharged at the presentation of the preferred object. To recapitulate, when visual and motor properties of F5 neurons are compared, it becomes clear that there is a strict congruence between the two types of responses. Neurons that are activated when the monkey observes small sized objects discharge also during precision grip. In contrast, neurons selectively active when the monkey looks at large objects discharge also during actions directed towards large objects (e.g. whole hand prehension).

Understanding mirror neurons 203

2.2.2 Mirror neuronsMirror neurons are F5 visuomotor neurons that activate when the monkey both acts on an object and when it observes another monkey or the experi-menter making a similar goal-directed action (Di Pellegrino, Fadiga, Fogassi, Gallese, & Rizzolatti, 1992; Gallese, Fadiga, Fogassi, & Rizzolatti, 1996). Re-cently, mirror neurons have been found also in area PF of the inferior pari-etal lobule, which is bidirectionally connected with area F5 (Fogassi, Gallese, Fadiga, & Rizzolatti, 1998). Therefore, mirror neurons seem to be identical to canonical neurons in terms of motor properties, but they radically differ from the canonical neurons as far as visual properties are concerned (Rizzolatti & Fadiga, 1998). The visual stimuli most effective in evoking mirror neurons dis-charge are actions in which the experimenter’s hand or mouth interacts with objects. The mere presentation of objects or food is ineffective in evoking mir-ror neurons discharge. Similarly, actions made by tools, even when conceptu-ally identical to those made by hands (e.g. grasping with pliers), do not activate the neurons or activate them very weakly. The observed actions which most often activate mirror neurons are grasping, placing, manipulating, and hold-ing. Most mirror neurons respond selectively to only one type of action (e.g. grasping). Some are highly specific, coding not only the type of action, but

Figure 3.

204 Giorgio Metta et al.

also how that action is executed. They fire, for example, during observation of grasping movements, but only when the object is grasped with the index finger and the thumb.

Typically, mirror neurons show congruence between the observed and ex-ecuted action. This congruence can be extremely precise: that is, the effective motor action (e.g. precision grip) coincides with the action that, when seen, triggers the neurons (e.g. precision grip). For other neurons the congruence is somehow weaker: the motor requirements (e.g. precision grip) are usually stricter than the visual ones (any type of hand grasping). One representative of the highly congruent mirror neurons is shown in Figure 4.

3. A model of area F5 and the mirror system

The results summarized in the previous sections tell us of the central role of F5 in the control and recognition of manipulative actions: the common interpre-tation proposed by (Luppino & Rizzolatti, 2000) and (Fagg & Arbib, 1998) con-siders F5 a part of a larger circuit comprising various areas in the parietal lobe (a large reciprocal connection with AIP), indirectly from STs (Perrett, Mistlin, Harries, & Chitty, 1990) and other premotor and frontal areas (Luppino & Riz-zolatti, 2000). Certainly F5 is strongly involved in the generation and control of action indirectly through F1, and directly by projecting to motor and medullar interneurons in the spinal cord (an in-depth study is described in (Shimazu, Maier, Cerri, Kirkwood, & Lemon, 2004)). A good survey of the physiology of the frontal motor cortex is given in (Rizzolatti & Luppino, 2001).

A graphical representation of the F5 circuitry is shown in Figure 5. In par-ticular, we can note how the response of F5 is constructed of various elements these being the elaboration of object affordances (canonical F5 neurons and AIP), of the visual appearance of the hand occurring in the Superior Temporal sulcus region (STs), and of the timing, synchronization of the action (Luppino & Rizzolatti, 2000). Parallel to F5-AIP, we find the circuit formed by F4-VIP that has been shown to correlate to the control of reaching. Further, a degree of coordination between these circuits is needed since manipulation requires both a transport and a grasping ability (Jeannerod, Arbib, Rizzolatti, & Sakata, 1995).

In practice, the parieto-frontal circuit can be seen as the transformation of the visual information about objects into its motor counterpart, and with the addition of the Inferior Temporal (IT) and Superior Temporal Sulcus (STs) areas, this description is completed with the semantic/identity of objects and

Understanding mirror neurons 205

Figure 4.

206 Giorgio Metta et al.

the state of the hand. This description is also in good agreement with previous models such as the FARS (Fagg, Arbib, Rizzolatti, and Sakata model) and MN1 (Mirror Neuron model 1) (Fagg & Arbib, 1998). Our model is different in try-ing to explicitly identify a developmental route to mirror neurons by looking at how these different transformations could be learned without posing implau-sible constraints.

The model of area F5 we propose here revolves around two concepts that are likely related to the development of this unique area of the brain. The mir-ror system has a double role being both a controller (70% of the neurons of F5 are pure motor neurons) and a “classifier” system (being activated by the sight of specific grasping actions). If we then pose the problem in terms of un-derstanding how such a neural system might actually autonomously develop (be shaped and learned by/through experience during ontogenesis), the role of canonical neurons — and in general that of contextual information specifying the goal of the action — has to be reconsidered. Since purely motor, canonical, and mirror neurons are found together in F5, it is very plausible that local con-nections determine part of the activation of F5.

3. Controller–predictor formulation

For explanatory purpose, the description of our model of the mirror system can be divided in two parts. The first part describes what happens in the actor’s brain, the second what happens in the observer’s brain when watching the ac-tor (or another individual). As we will see the same structures are used both when acting and when observing an action.

Figure 5.

Understanding mirror neurons 207

We consider first what happens from the actor’s point of view (see Figure 6): in her/his perspective, the decision to undertake a particular grasping action is attained by the convergence in area F5 of many factors including context and object related information. The presence of the object and of contextual infor-mation bias the activation of a specific motor plan among many potentially relevant plans stored in F5. The one which is most fit to the context is then enacted through the activation of a population of motor neurons. The motor plan specifies the goal of the motor system in motoric terms and, although not detailed here, we can imagine that it also includes temporal information. Con-textual information is represented by the activation of F5’s canonical neurons and by additional signals from parietal (AIP for instance) and other frontal areas (mesial or dorsal area 6) as in other models of motor control systems (Fagg & Arbib, 1998; Haruno, Wolpert, & Kawato, 2001; Oztop & Arbib, 2002). For example, the contextual information required to switch from one model to another (determining eventually the grasp type) is represented exactly by the detection of affordances performed by AIP-F5.

With reference to Figure 6, our model hypothesizes that the intention to grasp is initially “described” in the frontal areas of the brain in some internal reference frame and then transformed into the motor plan by an appropriate controller in premotor cortex (F5). The action plan unfolds mostly open loop (i.e. without employing feedback). A form of feedback (closed loop) is required though to counteract disturbances and to learn from mistakes. This is obtained by relying on a forward or direct model that predicts the outcome of the action as it unfolds in real-time. The output of the forward model can be compared with the signals derived from sensory feedback, and differences accounted for (the cerebellum is believed to have a role in this (Miall, Weir, Wolpert, & Stein, 1993; Wolpert & Miall, 1996)). A delay module is included in the model to take into account the different propagation times of the neural pathways carrying the predicted and actual outcome of the action. Note that the forward model is relatively simple, predicting only the motor output in advance: since motor commands are generated internally it is easy to imagine a predictor for these signals. The inverse model (indicated with VMM for Visuo-Motor Map), on the other hand, is much more complicated since it maps sensory feedback (vi-sion mainly) back into motor terms. Visual feedback clearly includes both the hand-related information (STs response) and the object information (AIP, IT, F5 canonical). Finally the predicted and the sensed signals arising from the motor act are compared and their difference (feedback error) sent back to the controller.

208 Giorgio Metta et al.

There are two ways of using the mismatch between the planned and ac-tual action: (i) compensate on the fly by means of a feedback controller, and (ii) adjust over longer periods of time through learning (Kawato, Furukawa, & Suzuki, 1987).

The output of area F5, finally activates the motor neurons in the spinal cord (directly or indirectly through medullar circuits) to produce the desired action. This is indicated in the schematics by a connection to appropriate muscular synergies representing the spinal cord circuits.

Learning of the direct and inverse models can be carried out during onto-genesis by a procedure of self-observation and exploration of the state space of the system: grossly speaking, simply by “detecting” the sensorial consequences of motor commands — examples of similar procedures are well known in the literature of computational motor control (Jordan & Rumelhart, 1992; Kawato, Furukawa, & Suzuki, 1987; Wolpert, 1997). Learning of the affordances of ob-jects with respect to grasping can also be achieved autonomously by a trial and error procedure, which explores the consequences of many different actions of the agent’s motor repertoire (different grasp types) to different objects. This includes things such as discovering that small objects are optimally grasped by a pinch or precision grip, while big and heavy objects require a power grasp.

In the observer situation (see Figure 7) motor and proprioceptive informa-tion is not directly available. The only readily available information is vision or sound. The central assumption of our model is that the structure of F5 could be co-opted in recognizing the observed actions by transforming visual cues into motor information as before. In practice, the inverse model is accessed by visu-al information and since the observer is not acting herself, visual information

Figure 6.

Understanding mirror neurons 209

is directly reaching in parallel the sensorimotor primitives in F5. Only some of them are actually activated because of the “filtering” effect of the canonical neurons and other contextual information (possibly at a higher level, knowl-edge of the actor, etc.). A successive filtering is carried out by considering the actual visual evidence of the action being watched (implausible hand postures should be weighed less than plausible ones). This procedure could be used then to recognize the action by measuring the most active motor primitive. It is im-portant to note that canonical neurons are known to fire when the monkey is fixating an object irrespective of the actual context (tests were performed with the object behind a transparent screen or the monkey restrained to the chair).

Comparison is theoretically done, in parallel, across all the active motor primitives (actions); the actual brain circuitry is likely to be different with visual information setting the various F5 populations to certain equilibrium states. The net effect can be imagined as that of many comparisons being per-formed in parallel and one motor primitive resulting predominantly activated (plausible implementations of this mechanism by means of a gating network is described in (Y. Demiris & Johnson, 2003; Haruno, Wolpert, & Kawato, 2001). While the predictor–controller formulation is somewhat well-established and several variants have been described in the literature (Wolpert, Ghahramani, & Flanagan, 2001), the evidence on the mirror system we reviewed earlier seems to support the idea that many factors, including the affordances of the target object, determine the recognition and interpretation of the observed action.

The action recognition system we have just outlined can be interpreted by following a Bayesian approach (Lopes & Santos-Victor, 2005). The details of the formulation are reported in the Appendix I.

3.2 Ontogenesis of mirror neurons

The presence of a goal is fundamental to elicit mirror neuron responses (Gal-lese, Fadiga, Fogassi, & Rizzolatti, 1996) and we believe it is also particularly

Figure 7.

20 Giorgio Metta et al.

important during the ontogenesis of the mirror system. Supporting evidence is described in the work of Woodward and colleagues (Woodward, 1998) who have shown that the identity of the target is specifically encoded during reach-ing and grasping movements: in particular, already at nine months of age, in-fants recognized as novel an action directed toward a novel object rather than an action with a different kinematics, thus showing that the goal is more fun-damental than the enacted trajectory.

The Bayesian interpretation we detail in Appendix I is substantially a su-pervised learning model. To relax the hypothesis of having to “supervise” the machine during training by indicating which action is which, we need to re-mind ourselves of the significance of the evidence on mirror neurons. First of all, it is plausible that the ‘canonical’ representation is acquired by self explora-tion and manipulation of a large set of different objects. F5 canonical neurons represent an association between objects’ physical properties and the actions they afford: e.g. a small object affords a precision grip, or a coffee mug affords being grasped by the handle. This understanding of object properties and the goal of actions is what can be subsequently factored in while disambiguating visual information. There are at least two levels of reasoning: i) certain actions are more likely to be applied to a particular object — that is, probabilities can be estimated linking each action to every object, and ii) objects are used to perform actions — e.g. the coffee mug is used to drink coffee. Clearly, we tend to use actions that proved to lead to certain results or, in other words, we trace backward the link between action and effects: to obtain the effects apply the same action that earlier led to those effects.

Bearing this is mind, when observing some other individual’s actions; our understanding can be framed in terms of what we already know about actions. In short, if I see someone drinking from a coffee mug then I can hypothesize that a particular action (that I know already in motor terms) is used to obtain that particular effect (of drinking). This link between mirror neurons and the goal of the motor act is clearly present in the neural responses observed in the monkey. It is a plausible way of autonomously learning a mirror representation. The learning problem is still a supervised3 one but the information can now be collected autonomously through a procedure of exploration of the environ-ment. The association between the canonical response (object-action) and the mirror one (including vision) is made when the observed consequences (or goal) are recognized as similar in the two cases — self or the other individual acting. Similarity can be evaluated following different criteria ranging from kinematic (e.g. the object moving along a certain trajectory) to very abstract (e.g. social consequences such as in speech).

Understanding mirror neurons 2

Finally, also the Visuo-Motor Map (VMM) as already mentioned can be learned through a procedure of self exploration. Motor commands and corre-lated visual information are two quantities that are readily available to the de-veloping infant. It is easy to imagine a procedure that learns the inverse model on the basis of this information.

4. A machine with hands

The simplest way of confirming the hypothesis that motor gestures are the basis of action recognition, as amply discussed in the previous sections, is to equip a computer with means of “acting” on objects, collect visual and motor data and build a recognition system that embeds some of the principles of operation that we identified in our model (see Figure 8). In particular, the hypothesis we would like to test is whether the extra information available during learning (e.g. kinesthetic and tactile) can improve and simplify the recognition of the same actions when they are just observed: i.e. when only visual information is available. Given the current limitations of robotic systems the simplest way to provide “motor awareness” to a machine is by recording grasping actions of human subjects from multiple sources of information including joint angles, spatial position of the hand/fingers, vision, and touch. In this sense we speak of a machine (the recognition system) with hands.

For this purpose we assembled a computerized system composed of a cy-ber glove (CyberGlove by Immersion), a pair of CCD cameras (Watek 202D), a magnetic tracker (Flock of birds, Ascension), and two touch sensors (FSR). Data was sampled at frame rate, synchronized, and stored to disk by a Pentium class PC. The cyber glove has 22 sensors and allows recording the kinematics of the hand at up to 112Hz. The tracker was mounted on the wrist and provides the position and the orientation of the hand in space with respect to a base frame. The two touch sensors were mounted on the thumb and index finger to detect the moment of contact with the object. Cameras were mounted at ap-propriate distance with respect to their focal length to acquire the execution of the whole grasping action with maximum possible resolution.

The glove is lightweight and does not limit in any way the movement of the arm and hand as long as the subject is sitting not too far from the glove interface. Data recording was carried out with the subject sitting comfortably in front of a table and performing grasping actions naturally toward objects approximately at the center of the table. Data recording and storage were carried out through a custom-designed application; Matlab was employed for post-processing.

22 Giorgio Metta et al.

We collected a large data set and processing was then performed off-line. The selected grasping types approximately followed Napier’s taxonomy (Na-pier, 1956) and for our purpose they were limited to only three types: power grasp (cylindrical), power grasp (spherical), and precision grip. Since the goal was to investigate to what extent the system could learn invariances across dif-ferent grasping types by relying on motor information for classification, the experiment included gathering data from a multiplicity of viewpoints. The da-tabase contains objects which afford several grasp types to assure that recogni-tion cannot simply rely on exclusively extracting object features. Rather, ac-cording to our model, this is supposed to be a confluence of object recognition with hand visual analysis. Two exemplar grasp types are shown in Figure 9: on the left panel a precision grip using all fingers; on the right one a two-finger precision grip.

A set of three objects was employed in all our experiments: a small glass ball, a rectangular solid which affords multiple grasps, and a large sphere re-quiring power grasp. Each grasping action was recorded from six different sub-jects (right handed, age 23–29, male/female equally distributed), and moving

Figure 8.

Understanding mirror neurons 23

the cameras to 12 different locations around the subject including two different elevations with respect to the table top which amounts to 168 sequences per subject. Each sequence contains images of the scene from the two cameras syn-chronized with the cyber glove and the magnetic tracker data. This is the data set that is used for building the Bayesian classifier outlined in the Appendix I (Lopes & Santos-Victor, 2005).

The visual features were extracted from pre-processed image data. The hand was segmented from the images through a simple color segmentation algorithm. The bounding box of the segmented region was then used as a refer-ence to map the view of the hand to a standard reference size. The orientation of the color blob in the image was also used to rotate the hand to a standard ori-entation. This data set was then filtered through Principal Component Analy-sis (PCA) by maintaining only a limited set of eigenvectors corresponding to the first 2 to 15 largest eigenvalues.

One possibility to test the influence of motor information in learning ac-tion recognition is to contrast the situation where motor-kinesthetic informa-tion is available in addition to visual information with the control situation where only visual information is available.

The first experiment uses the output of the VMM as shown in Section 3.1 and thus employed motor features for classification. The VMM was approxi-mated from data by using a simple backpropagation neural network with

Figure 9.

24 Giorgio Metta et al.

sigmoidal units. The input of the VMM was the vector of the mapping of the images onto the space spanned by the first N PCA vectors; the output was the vector of joint angles acquired from the data glove.

As a control, a second experiment employed the same procedure but the VMM, and thus the classification was performed in visual space. The result of the two experiments is reported in the following Table 2.

The experiments were set up by dividing the database in two parts and train-ing the classifier on one half and testing on the other. The number of training sequences is different since we chose to train the classifier with the maximum available data but from all points of view only in the control experiment. The clearest result of this experiment is that the classification in motor space is easier and thus the classifier performs better on the test set. Also, the number of Gauss-ians required by the EM algorithm to approximate the likelihood as per equation (1) within a given precision is smaller for the experiment than the control (1–2 vs. 5–7): that is, the distribution of data is more “regular” in motor space than in visual space. This is to be expected since the variation of the visual appearance of the hand is larger and depends strongly on the point of view, while the sequence of joint angles tends to be the same across repetitions of the same action. It is also clear that in the experiment the classifier is much less concerned with the variation of the data since this variation has been taken out by the VMM.

Overall, our interpretation of these results is that by mapping in motor space we are allowing the classifier to choose features that are much better suited for performing optimally, which in turn facilitates generalization. The same is not true in visual space.

Table . Summary of the results of the experiments. Left: motor information is avail-able to the classifier. Right: only visual information is used by the classifier. Training was always performed with all available sequences after partitioning the data into equally-sized training and test set.

Experiment (motor space) Control (visual space)Training

# of sequences 24 (+ VMM) 64# of points of view 1 4Classification rate (on the training set)

98% 97%

Test# of sequences 96 32# of points of view 4 4Classification rate 97% 80%

Understanding mirror neurons 25

5. Robotic experiment

Following the insight that it might be important to uncover the sequence of de-velopmental events that moves either the machine or humans to a motoric rep-resentation of observed actions, we set forth to the implementation of a complete experiment on a humanoid robot called Cog (Brooks, Breazeal, Marjanović, & Scassellati, 1999). This is an upper-torso human shaped robot with 22 degrees of freedom distributed along the head, arms and torso (Figure 10). It lacks hands, it has instead simple flippers that could use to push and prod objects. It cannot move from its stand so that the objects it interacted with had to be presented to the robot by a human experimenter. The robot is controlled by a distributed parallel control system based on a real-time operating system and running on a set of Pentium based computers. The robot is equipped with cameras (for vision), gyroscopes simulating the human vestibular system, and joint sensors providing information about the position and torque exerted at each joint.

The aim of experimenting on the humanoid robot was that of showing that a mirror neuron-like representation could be acquired by simply relying on the information exchanged during the robot-environment interaction. This proof of concept can be used to analyze the gross features of our model or evidence

Figure 0.

26 Giorgio Metta et al.

any lacuna in it. We were especially interested in determining a plausible se-quence that starting from minimal initial hypotheses steers the system toward the construction of units with responses similar to mirror neurons. The re-sulting developmental pathway should gently move the robot through probing different levels of the causal structure of the environment. Table 2 shows four levels of this causal structure and some intuition about the areas of the brain related to these functions. It is important to note as the complexity of causation evolves from strict synchrony to more delayed effects and thus it becomes more difficult to identify and learn anything from. Naturally, this is not to say that the brain develops following this step-like progression. Rather, brain develop-ment is thought to be fluidic, messy, and above all dynamic (Thelen & Smith, 1998); the identification of “developmental levels” here simplifies though our comprehension of the mechanisms of learning and development.

The first level in Table 2 suggests that learning to reach for externally iden-tified objects requires the identification of a direct causal chain linking the gen-eration of action to its immediate and direct visual consequences. Clearly, in humans the development of full-blown reaching requires also the simultaneous development of visual acuity, binocular vision and, as suggested by Bertenthal and von Hofsten (Bertenthal & von Hofsten, 1998), the proper support of the body freeing the hand and arm from its supporting role.

Only when reaching has developed then the interaction between the hand and the external world might start generating useful and reliable responses from touch and grasp. This new source of information requires simultaneously

Table 2. degrees of causal indirection, brain areas and function in the brain (this table has been compiled from the data in (Rizzolatti & Luppino, 2001).

Level Nature of causation

Brain areas Function and behavior

Time profile

1 Direct causal chain VC-(VIP/7b)-F4-F1 Reaching Strict synchrony2 On level of

indirectionVC-AIP-F5-F1 Object affor-

dances, grasping (rolling in this experiment)

Fast onset upon contact, potential for delayed effects

3 Complex causation involving multiple causal chains

VC-AIP-F5-F1-STs-IT Mirror neurons, mimicry

Arbitrarily delayed onset and effects

4 Complex causation involving multiple instances of ma-nipulative acts

STs-TE-TEO-F5-AIP Object recogni-tion

Arbitrarily delayed onset and effects

Understanding mirror neurons 27

new means of detecting causally connected events since the initiation of an action causes certain delayed effects. The payoff is particularly rich, since in-teraction with objects leads to the formation of a “well defined” concept of objecthood — this, in robotics, is a tricky concept as it has been discussed for example in (Metta & Fitzpatrick, 2003).

It is interesting to study subsequently whether this same knowledge about objects and the interaction between the hand and objects could be exploited in interpreting actions performed by others. It leads us to the next level of causal understanding where the delay between the acquisition of object knowledge and the exploitation of this knowledge when observing someone else might be very large. If any neural unit is active in these two situations (both when acting and observing) then it can be regarded in all respects as a “mirror” unit.

Finally we mention object recognition as belonging to an even higher level of causal understanding where object identity is constructed by repetitive ex-posure and manipulation of the same object. In the following experiments we concentrate on step 2 and 3 assuming step 1 is already functional. We shall not discuss about step 4 any longer since it is relatively more advanced with respect to the scope of this paper. The robot also possesses, and we are not going to enter much into the details here, some basic attention capabilities that allows selecting relevant objects in the environment and tracking them if they move, binocular disparity which is used to control vergence and estimate distances, and enough motor control abilities to reach for an object. In a typical experi-ment, the human operator waves an object in front of the robot which reacts by looking at it; if the object is dropped on the table, a reaching action is initiated, and the robot possibly makes a contact with the object. Vision is used during the reaching and touching movement for guiding the robot’s flipper toward the object, to segment the hand from the object upon contact, and to collect information about the behavior of the object caused by the application of a certain action.

6. Learning object affordances

Since the robot does not have hands, it cannot really grasp objects from the ta-ble. Nonetheless there are other actions that can be employed in exploring the physical properties of objects. Touching, poking, prodding, and sweeping form a nice class of actions that can be used for this purpose. The sequence of images acquired during reaching for the object, the moment of impact, and the effects of the action are measured following the approach of Fitzpatrick (Fitzpatrick,

28 Giorgio Metta et al.

2003a). An example of the quality of segmentation obtained is shown in Fig-ure 11. Clearly, having identified the object boundaries allows measuring any visual feature about the object, such as color, shape, texture, etc.

Unfortunately, the interaction of the robot’s flipper (the manipulator end-point) with objects does not result in a wide class of different affordances. In practice the only possibility was to employ objects that show a characteristic behavior depending on how they are approached. This possibility is offered by rolling affordances: in our experiments we used a toy car, an orange juice bottle, a ball, and a colored toy cube.

The robot’s motor repertoire besides reaching consists of four different stereotyped approach movements covering a range of directions of about 180 degrees around the object.

The experiment consisted in presenting repetitively each of the four objects to the robot. During this stage also other objects were presented at random; the experiment ran for several days and sometimes people walked by the robot and managed to make it poke (and segment) the most disparate objects. The robot “stored” for each successful trial the result of the segmentation, the object’s principal axis which was selected as representative shape parameter, the action — initially selected randomly from the set of four approach directions —, and the movement of the center of mass of the object for some hundreds millisec-onds after the impact was detected. We grouped (clustered) data belonging to the same object by employing a color based clustering technique similar to Crowley et al. (Schiele & Crowley, 2000). In fact in our experiments the toy car was mostly yellow in color, the ball violet, the bottle orange, etc. In different situations the requirements for the visual clustering might change and more sophisticated algorithms could be used (Fitzpatrick, 2003b).

Figure 12 shows the results of the clustering, segmentation, and examina-tion of the object behavior procedure. We plotted here an estimation of the probability of observing object motion relative to the object’s own principal axis. Intuitively, this gives information about the rolling properties of the dif-ferent objects: e.g. the car tends to roll along its principal axis, the bottle at

Figure .

Understanding mirror neurons 29

right angle with respect to the axis. The training set for producing the graphs in Figure 12 consisted of about 100 poking actions per object. This “description” of objects is fine in visual terms but does not really bear any potential for action since it does not yet contain information about what action to take if the robot happens to see one of the objects.

For the purpose of generating actions a description of the geometry of poking is required. This can be easily obtained by collecting many samples of generic poking actions and estimating the average direction of displacement of the object. Figure 13 shows the histograms of the direction of movement aver-aged for each possible action. About 700 samples were used to produce the four plots. Note, for example, that the action labeled as “backslap” (moving the ob-ject with the flipper outward from the robot) gives consistently a visual object motion upward in the image plane (corresponding to the peak at −100 degrees, 0 degrees being the direction parallel to the image x axis; the y axis pointing downward). A similar consideration applies to the other actions.

Having built this, the first interesting question is then whether this infor-mation (summarized collectively in Figure 12 and Figure 13) can be re-used when acting to generate anything useful showing exploitation of the object af-fordances. In fact, it is now possible to make the robot “optimally” poke (i.e. se-lecting an action that causes maximum displacement) an observed and known object. In practice the same color clustering procedure is used for localizing

Figure 2.

220 Giorgio Metta et al.

and recognizing the object, to determine its orientation on the table, its affor-dance, and finally to select the action that it is most likely to elicit the principal affordance (roll).

A simple qualitative test of the performance determined that out of 100 trials the robot made 15 mistakes. Further analysis showed that 12 of the 15 mistakes were due to poor control of reaching (e.g. the flipper touched the object too early bringing it outside the field of view), and only three to a wrong estimate of the orientation.

Although crude, this implementation shows that with little pre-existing structure the robot could acquire the crucial elements for building knowledge of objects in terms of their affordances. Given a sufficient level of abstraction, our implementation is close to the response of canonical neurons in F5 and their interaction with neurons observed in AIP that respond to object orienta-tion (Sakata, Taira, Kusunoki, Murata, & Tanaka, 1997). Another interesting question is whether knowledge about object directed actions can be reused in interpreting observed actions performed perhaps by a human experimenter. It leads directly to the question of how mirror neurons can be developed from the interaction of canonical neurons and some additional processing.

Figure 3.

Understanding mirror neurons 22

To link with the concept of feedback from the action system, here, after the actual action has unfolded, the robot applied exactly the same procedure employed to learn the object affordances to measure the error between the planned and executed action. This feedback signal could then be exploited to incrementally update the internal model of the affordances. This feedback sig-nal is fairly similar to the feedback signal identified in our conceptual model in Section 3 (Figure 6).

7. Developing mirror neurons

In answering the question of what is further required for interpreting observed actions, we could reason backward through the chain of causality employed in the previous section. Whereas the robot identified the motion of the object because of a certain action applied to it, here it could backtrack and derive the type of action from the observed motion of the object. It can further explore what is causing motion and learn about the concept of manipulator in a more general setting (Fitzpatrick & Metta, 2003).

In fact, the same segmentation procedure cited in Section 6 could visually interpret poking actions generated by a human as well as those generated by the robot. One might argue that observation could be exploited for learning about object affordances. This is possibly true to the extent passive vision is reliable and action is not required. In the architecture proposed by Demiris and Hayes (J. Demiris & Hayes, 2002), the authors make a distinction between active and passive imitation. In the former case the agent is already capable of imitating the observed action and recognizes it by simulating it (a set of inverse and forward models are simultaneously activated). In the latter case the agent passively observes the action and memorizes the set of postures that character-izes it. This view is not in contrast with our approach. In Demiris and Hayes, in fact, the system converts the available visual information and estimates the posture of the demonstrator (joint angles). Learning this transformation dur-ing ontogenesis is indeed an active process which requires access to both visual and proprioceptive information. The advantage of the active approach, at least for the robot, is that it allows controlling the amount of information impinging on the visual sensors by, for instance, controlling the speed and type of action. This strategy might be especially useful given the limitations of artificial per-ceptual systems.

Thus, observations can be converted into interpreted actions. The action whose effects are closest to the observed consequences on the object (which

222 Giorgio Metta et al.

we might translate into the goal of the action) is selected as the most plausible interpretation given the observation. Most importantly, the interpretation re-duces to the interpretation of the “simple” kinematics of the goal and conse-quences of the action rather than to understanding the “complex” kinematics of the human manipulator. The robot understands only to the extent it has learned to act.

One might note that a refined model should probably include visual cues from the appearance of the manipulator into the interpretation process. This is possibly true for the case of manipulation with real hands where the configu-ration of fingers might be important. Given our experimental setup the sole causal relationship was instantiated between the approach/poking direction and the object behavior; consequently there was not any apparent benefit in including additional visual cues.

The last question we propound to address is whether the robot can imitate the “goal” of a poking action. The step is indeed small since most of the work is actually in interpreting observations. Imitation was generated in the following by replicating the latest observed human movement with respect to the object and irrespective of its orientation. For example, in case the experimenter poked the toy car sideways, the robot imitated him/her by pushing the car sideways. Figure 14 shows an extended mimicry experiment with different situations originated by playing with a single object.

In humans there is now considerable evidence that a similar strict interac-tion of visual and motor information is at the basis of action understanding at many levels, and if exchanging vision for audition, it applies unchanged to speech (Fadiga, Craighero, Buccino, & Rizzolatti, 2002). This implementation, besides serving as a sanity check to our current understanding of the mirror system, provides hints that learning of mirror neurons can be carried out by a process of autonomous development.

However, these results have to be considered at the appropriate level of abstraction and comparing too closely to neural structure might even be mis-leading: simply this implementation was not meant to reproduce closely the neural substrate (the neural implementation) of imitation. Robotics, we be-lieve, might serve as a reference point from which to investigate the biological solution to the same problem — and although it cannot provide the answers, it can at least suggest useful questions.

Understanding mirror neurons 223

8. Conclusions

This paper put forward a model of the functioning of the mirror system which considers at each stage plausible unsupervised learning mechanisms. In addi-tion, the results from our experiments seem to confirm two facts of the pro-posed model: first, that motor information plays a role in the recognition pro-cess — as would be following the hypothesis of the implication of feedback signals into recognition — and, second, that a mirror-like representation can be developed autonomously on the basis of the interaction between an indi-vidual and the environment.

Other authors have proposed biologically inspired architectures for im-itation and learning in robotics (Y. Demiris & Johnson, 2003; Y. Demiris & Khadhouri, 2005; Haruno, Wolpert, & Kawato, 2001). In (J. Demiris & Hayes, 2002) a set of forward-inverse models coupled with a selection mechanism is employed to control a dynamical simulation of a humanoid robot which learns to perform and imitate gestures. A similar architecture is employed on a planar robot manipulator (Simmons & Demiris, 2004) and on a mobile robot with a gripper (Johnson & Demiris, 2005).

Although with some differences, in all these examples the forward-inverse model pairs are activated both when the robot is performing a motor task and when the robot is attending the demonstrator. During the demonstration of an action, the perceptual input from the scene is fed in parallel to all the inverse models. In turn, the output of the inverse models is sent to the forward models which act as simulators (at the same time the output of the inverse models is inhibited, so no action is actually executed by the robot). The idea is that the predictions of all the forward models are compared to the next state of the demonstrator and the resulting error taken as a confidence measure to select the appropriate action for imitation. In these respects, these various imple-mentations are very similar to the one presented in this paper. Each inverse-forward model pair gives a hypothesis to recognize the observed action. The confidence measure that is accumulated during action observation can be in-terpreted as an estimation of the probability that that hypothesis is true, given the perceptual input. More recently the same architecture has been extended to cope with the obvious differences in human versus robot morphology, and applied to an object manipulation task (Johnson & Demiris, 2005). In the ar-chitecture, in this case, the set of coupled internal inverse and forward models are organized hierarchically, from the higher level, concerned with actions and goals in abstract terms, to the lower level, concerned with the motoric conse-quences of each action.

224 Giorgio Metta et al.

The MOSAIC model by Haruno, Wolpert and Kawato (Haruno, Wolpert, & Kawato, 2001) is also close to our model in the sense of explicitly considering the likelihood and priors in deciding which of the many forward-inverse pairs to activate for controlling the robot. In this case no explicit imitation schema is considered, but the context is properly accounted for by the selection mecha-nism (a Hidden Markov Model).

Our proposal of the functioning of the mirror system includes aspects of both solutions by considering the forward-inverse mapping and the contribu-tion of the contextual information. In our model, and in contrast with Demiris’ architecture, the goal of the action is explicitly taken into account. As we dis-cussed in Section 2 the mirror system is activated only when a goal-directed action is generated or observed. Recognizing the goal is also the key aspect of the learning process and subsequently it works as a prior to bias recognition by filtering out actions that are not applicable or simply less likely to be executed, given the current context. This is related to what happens in the (Johnson & Demiris, 2005) paper where the architecture filters out the actions that can-not be executed. In our model, however, the affordances of a given object are specified in terms of the probabilities of all actions that can be performed on the object (estimated from experience). These probabilities do not only filter out impossible actions, but bias the recognition toward the action that is more plausible given the particular object. The actions considered in this paper are all transitive (i.e. directed towards objects), coherently with the neurophysi-ological view on the mirror system. In the same paper (Johnson & Demiris, 2005), moreover, the recognition of the action is performed by comparing the output of the forward models to the state of the demonstrator. In our model this is always performed in the motor space. In the experiments reported in this paper we show that this simplifies the classification problem and leads to better generalization.

The outcome from a first set of experiments using the data set collected with the cyber glove setup has shown that there are at least two advantages whether the action classification is performed in visual rather than motor space: i) simpler classifier, since the classification or clustering is much simpler in motor space, and ii) better generalization, since motor information is invari-ant to changes of the point of view. Some of these aspects are further discussed in (Lopes & Santos-Victor, 2005).

Understanding mirror neurons 225

[Figure 14 about here]

The robotic experiment shows, on the other hand, that indeed only minimal initial skills are required in learning a mirror neuron representation. In prac-tice, we only had to assume reaching to guarantee interaction with objects and a method to visually measure the results of this interaction. Surely, this is a gross simplification in many respects since, for example, aspects of the development of grasping per se were not considered at this stage and aspects of agency were neglected (the robot was not measuring the posture and behavior of the hu-man manipulator). Though, this shows that, in principle, the acquisition of the mirror neuron structure is the almost natural outcome of the development of a control system for grasping. Also, we have put forward a plausible sequence of learning phases involving the interaction between canonical and mirror neurons. This, we believe, is well in accordance with the evidence gathered by neurophysiology. In conclusion, we have embarked in an investigation that is somewhat similar to the already cited Liberman’s speech recognition attempts. Perhaps, also this time, the mutual rapprochement of neural and engineering sciences might lead to a better understanding of brain functions.

Acknowledgments

The research described in this paper is supported by the EU project MIRROR (IST-2000-28159) and ROBOTCUB (IST-2004-004370). The authors wish to thank Claes von Hofsten, Kerstin Rosander, José Santos-Victor, Manuel Cabido-Lopes, Alexandre Bernardino, Mat-teo Schenatti, and Paul Fitzpatrick for the stimulating discussion on the topic of this paper. The authors wish also to thank the anonymous reviewers for the comments on the early version of this manuscript.

Notes

. (Liberman & Mattingly, 1985)

2. The module of language perception.

3. In supervised learning, training information is previously labeled by an external teacher or a set of correct examples are available.

226 Giorgio Metta et al.

References

Bertenthal, B., & von Hofsten, C. (1998). Eye, Head and Trunk Control: the Foundation for Manual Development. Neuroscience and Behavioral Reviews, 22(4), 515–520.

Brooks, R. A., Breazeal, C. L., Marjanović, M., & Scassellati, B. (1999). The COG project: Building a Humanoid Robot. In C. L. Nehaniv (Ed.), Computation for Metaphor, Anal-ogy and Agents (Lecture Notes in Artificial Intelligence, Vol. 1562, pp. 52–87): Springer-Verlag.

Demiris, J., & Hayes, G. (2002). Imitation as a Dual-Route Process Featuring Predictive and Learning Components: A Biologically-Plausible Computational Model. In K. Dauten-hahn & C. L. Nehaniv (Eds.), Imitation in Animals and Artifacts (pp. 327–361). Cam-bridge, MA, USA: MIT Press.

Demiris, Y., & Johnson, M. H. (2003). Distributed, predictive perception of actions: a bio-logically inspired robotics architecture for imitation and learning. Connection Science, 15(4), 231–243.

Demiris, Y., & Khadhouri, B. (2005). Hierarchical Attentive Multiple Models for Execution and Recognition (HAMMER). Robotics and Autonomous Systems Journal(In press).

Di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., & Rizzolatti, G. (1992). Understand-ing motor events: a neurophysiological study. Experimental Brain Research, 91(1), 176–180.

Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). Speech listening specifically modulates the excitability of tongue muscles: a TMS study. European Journal of Neuro-science, 15(2), 399–402.

Fagg, A. H., & Arbib, M. A. (1998). Modeling parietal-premotor interaction in primate con-trol of grasping. Neural Networks, 11(7–8), 1277–1303.

Fitzpatrick, P. (2003a, October 27–31). First Contact: an active vision approach to segmen-tation. In proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2161–2166 Vol. 3, Las Vagas, Nevada, USA.

Fitzpatrick, P. (2003b). From First Contact to Close Encounters: A developmentally deep per-ceptual system for a humanoid robot. Unpublished PhD thesis, MIT, Cambridge, MA.

Fitzpatrick, P., & Metta, G. (2003). Grounding vision through experimental manipulation. Philosophical Transactions of the Royal Society: Mathematical, Physical, and Engineering Sciences, 361(1811), 2165–2185.

Fogassi, L., Gallese, V., Fadiga, L., & Rizzolatti, G. (1998). Neurons responding to the sight of goal directed hand/arm actions in the parietal area PF (7b) of the macaque monkey. Society of Neuroscience Abstracts, 24, 257.255.

Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premo-tor cortex. Brain, 119, 593–609.

Haruno, M., Wolpert, D. M., & Kawato, M. (2001). MOSAIC Model for sensorimotor learn-ing and control. Neural Computation, 13, 2201–2220.

Jeannerod, M. (1988). The Neural and Behavioural Organization of Goal-Directed Move-ments (Vol. 15). Oxford: Clarendon Press.

Jeannerod, M., Arbib, M. A., Rizzolatti, G., & Sakata, H. (1995). Grasping objects: the cortical mechanisms of visuomotor transformation. Trends in Neurosciences, 18(7), 314–320.

Understanding mirror neurons 227

Johnson, M., & Demiris, Y. (2005). Hierarchies of Coupled Inverse and Forward Models for Abstraction in Robot Action Planning, Recognition and Imitation. In proceedings of the AISB 2005 Symposium on Imitation in Animals and Artifacts, 69–76, Hertfordshire, UK.

Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16(3), 307–354.

Kawato, M., Furukawa, K., & Suzuki, R. (1987). A hierarchical neural network model for control and learning of voluntary movement. Biological Cybernetics, 57, 169–185.

Kerzel, D., & Bekkering, H. (2000). Motor activation from visible speech: Evidence from stimulus response compatibility. Journal of Experimental Psychology: Human Perception and Performance, 26(2), 634–647.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Percep-tion of the speech code. Psychological Review, 74, 431–461.

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21(1), 1–36.

Liberman, A. M., & Mattingly, I. G. (1985, page 7). The motor theory of speech perception revised. Cognition, 21(1), 1–36.

Liberman, A. M., & Wahlen, D. H. (2000). On the relation of speech to language. Trends in Cognitive Neuroscience, 4(5), 187–196.

Lopes, M., & Santos-Victor, J. (2005). Visual learning by imitation with motor representa-tions. IEEE Transactions on Systems, Man, and Cybernetics, Part B Cybernetics, 35(3), 438–449.

Luppino, G., & Rizzolatti, G. (2000). The organization of the frontal motor cortex. News in Physiological Sciences, 15, 219–224.

Metta, G., & Fitzpatrick, P. (2003). Early Integration of Vision and Manipulation. Adaptive Behavior, 11(2), 109–128.

Miall, R. C., Weir, D. J., Wolpert, D. M., & Stein, J. F. (1993). Is the cerebellum a Smith pre-dictor? Journal of Motor Behavior, 25(3), 203–216.

Murata, A., Fadiga, L., Fogassi, L., Gallese, V., Raos, V., & Rizzolatti, G. (1997). Object rep-resentation in the ventral premotor cortex (area F5) of the monkey. Journal of Neuro-physiology, (78), 2226–2230.

Napier, J. (1956). The prehensile movement of the human hand. Journal of Bone and Joint Surgery, 38B(4), 902–913.

Oztop, E., & Arbib, M. A. (2002). Schema design and implementation of the grasp-related mirror neuron system. Biological Cybernetics, 87, 116–140.

Perrett, D. I., Mistlin, A. J., Harries, M. H., & Chitty, A. J. (1990). Understanding the visual appearance and consequence of hand action. In M. A. Goodale (Ed.), Vision and action: the control of grasping (pp. 163–180). Norwood (NJ): Ablex.

Rizzolatti, G., Camarda, R., Fogassi, L., Gentilucci, M., Luppino, G., & Matelli, M. (1988). Functional organization of inferior area 6 in the macaque monkey: II. Area F5 and the control of distal movements. Experimental Brain Research, 71(3), 491–507.

Rizzolatti, G., & Fadiga, L. (1998). Grasping objects and grasping action meanings: the dual role of monkey rostroventral premotor cortex (area F5). In G. R. Bock & J. A. Goode (Eds.), Sensory Guidance of Movement, Novartis Foundation Symposium (pp. 81–103). Chichester: John Wiley and Sons.

228 Giorgio Metta et al.

Rizzolatti, G., Fadiga, L., Gallese, V., & Fogassi, L. (1996). Premotor cortex and the recogni-tion of motor actions. Cognitive Brain Research, 3(2), 131–141.

Rizzolatti, G., & Luppino, G. (2001). The cortical motor system. Neuron, 31, 889–901.Sakata, H., Taira, M., Kusunoki, M., Murata, A., & Tanaka, Y. (1997). The TINS lecture

— The parietal association cortex in depth perception and visual control of action. Trends in Neurosciences, 20(8), 350–358.

Schiele, B., & Crowley, J. L. (2000). Recognition without correspondence using multidi-mensional receptive field histograms. International Journal of Computer Vision, 36(1), 31–50.

Shimazu, H., Maier, M. A., Cerri, G., Kirkwood, P. A., & Lemon, R. N. (2004). Macaque ventral premotor cortex exerts powerful facilitation of motor cortex outputs to limb motoneurons. The Journal of Neuroscience, 24(5), 1200–1211.

Simmons, G., & Demiris, Y. (2004). Imitation of Human Demonstration Using A Biologi-cally Inspired Modular Optimal Control Scheme. In proceedings of the IEEE-RAS/RSJ International Conference on Humanoid Robots, 215–234, Los Angeles, USA.

Thelen, E., & Smith, L. B. (1998). A Dynamic System Approach to the Development of Cogni-tion and Action (3rd ed.). Cambridge, MA: MIT Press.

Wilson, S. M., Saygin, A. P., Sereno, M. I., & Iacoboni, M. (2004). Listening to speech acti-vates motor areas involved in speech production. Nature Neuroscience, 7, 701–702.

Wolpert, D. M. (1997). Computational approaches to motor control. Trends in Cognitive Sciences, 1(6), 209–216.

Wolpert, D. M., Ghahramani, Z., & Flanagan, R. J. (2001). Perspectives and problems in mo-tor learning. Trends in Cognitive Sciences, 5(11), 487–494.

Wolpert, D. M., & Miall, R. C. (1996). Forward models for physiological motor control. Neural Networks, 9(8), 1265–1279.

Woodward, A. L. (1998). Infant selectively encode the goal object of an actor’s reach. Cogni-tion, 69, 1–34.

Authors’ addresses

Giorgio MettaLIRA-Lab, University of GenoaViale Causa, 13 — 16145 — ItalyEmail: [email protected]

Giulio SandiniLIRA-Lab, University of GenoaViale Causa, 13 — 16145 — ItalyEmail: [email protected] NataleLIRA-Lab, University of GenoaViale Causa, 13 — 16145 — ItalyEmail: [email protected]

Laila CraigheroFaculty of Medicine — D.S.B.T.A. Section of Human PhysiologyVia Fossato di Mortara 17/19 — 44100 Ferrara — ItalyEmail: [email protected]

Luciano FadigaFaculty of Medicine — D.S.B.T.A. Section of Human PhysiologyVia Fossato di Mortara 17/19 — 44100 Ferrara — ItalyEmail: [email protected]

Understanding mirror neurons 229

About the authors

Giorgio Metta: Assistant Professor at the University of Genoa where he teaches the courses of “Anthropomorphic Robotics and Operating Systems”. Ph.D. in Electronic Engineering in 2000. Postdoctoral associate at MIT, AI-Lab from 2001 to 2002. Since 1993 at LIRA-Lab where he developed various robotic platforms with the aim of implementing bioinspired models of sensorimotor control. For more information: http://pasa.liralab.it and http://www.robotcub.org.

Giulio Sandini: Full Professor of bioengineering at the University of Genova where he teaches the course of “Natural and Artificial Intelligent Systems”. Degree in Bioengineering at the University of Genova; Assistant Professor at the Scuola Normale Superiore in Pisa studying aspects of visual processing in cortical neurons and visual perception in human adults and children. Visiting Research Associate at Harvard Medical School in Boston work-ing on electrical brain activity mapping. In 1990 he founded the LIRA-Lab (www.liralab.it) investigating the field of Computational and Cognitive Neuroscience and Robotics with the objective of understanding the neural mechanisms of human sensorimotor coordination and cognitive development by realizing anthropomorphic artificial systems such as human-oids.

Lorenzo Natale was born in Genoa, Italy, in 1975. He received the M.S. degree in Electronic Engineering in 2000 and Ph.D. in Robotics from the University of Genoa, Italy, in 2004. He is currently Postdoctoral Associate at MIT, Computer Science and Artificial Intelligence Laboratory. His research involves the study of sensorimotor development and motor control in artificial and natural systems.

Laila Craighero: Born in 1968, degree in Psychology, Ph.D. in Neuroscience. Assistant Pro-fessor of Human Physiology at the University of Ferrara. Scientific interests: neural mecha-nisms underlying orienting of visuospatial attention. Experience in transcranial magnetic stimulation and electrophysiological investigation in monkeys (single neurons recordings). Among her contributions, in the frame of premotor theory of attention, it is the investigation of the role played by the motor preparation on attentional orienting. More recently, she collaborated to the study of motor cortex and spi-nal cord excitability during observation of actions made by other individuals (‘mirror’ and ‘canonical’ neurons). Laila Craighero is co-investigator in E.C. funded projects on action understanding and sensorimotor coordination.

Luciano Fadiga: Born in 1961. M.D., Ph.D. in Neuroscience, Full Professor of Human Phys-iology at the University of Ferrara. He has a consolidated experience in electrophysiological investigation in monkeys (single neurons recordings) and humans (transcranial magnetic stimulation, study of spinal excitability and brain imaging). Among his contributions is the discovery of monkey mirror neurons in collaboration with the team leaded by G. Rizzolatti at the University of Parma and the first demonstration by TMS that a mirror system exists also in humans. Other fields of his research concern visual attention and speech processing. He is responsible for a research unit funded by the European Commission for the study of the brain mechanisms at the basis of action understanding.

230 Giorgio Metta et al.

Appendix I

The role of the mirror system during action recognition can be framed in a Bayesian context (Lopes & Santos-Victor, 2005). In the Bayesian formulation the posterior probabilities can be written as:

Posterior =Likelihood × Prior

(1)Normalization

and the classification is performed by looking at the maximum of the posterior probabilities with respect to the set of available actions. The normalization factor can be neglected since it does not influence the determination of the maximum. We equated the prior probabilities with the object affordances and estimated them by counting the occurrences of the grasp types vs. the object type. In this interpretation, in fact, the affordances of an object identify the set of actions that are most likely to be executed. The likelihood was approximated by a mixture of Gaussians whose number and parameters were estimated with the Expectation Maximization (EM) algorithm.More specifically the mirror activation of F5 can be thought as:

p(Ai | F, Ok) (2)

where Ai is the ith action from a motor repertoire of I actions, F are the features determining the activation of this particular action and Ok is the target object of the grasping action out of a set of K possible objects. This probability can be computed from Bayes rule as:

p(Ai | F, Ok) = p(F | Ai, Ok) p(Ai | Ok) / p(F | Ok) (3)

and a classifier can be constructed by taking the maximum over the possible actions as fol-lows:

 = maix p(Ai | F, Ok) (4)

The term p(F | Ok) can be discarded since it does not influence the maximization. For the other terms we can give the following interpretation:

p(Ai | F, Ok) Mirror neurons responses, obtained by a combination of the informa-tion as in equation 1.

p(F | Ai, Ok) The activity of the F5 motor neurons generating certain motor patterns given the selected action and the target object

p(Ai | Ok) Canonical neurons responses, that is the probability of invoking a cer-tain action given an object

The feature vector F can also be considered over a certain time period and thus we should apply the following substitution to equations 1 to 3:

F = Ft, …, Ft−N (5)

which takes into account the temporal evolution of the action.

The posterior probability distribution can be estimated using a naive approach, assuming independence between the observations at different time instants. The justification for this

Understanding mirror neurons 23

assumption is that recognition does not necessarily require the accurate modelling of the probability density functions. We have:

p(Ai | Fi, … Ft−N, Ok) =N

∏j=0

p(Ft−j | Ai, Ok) p(Ai | Ok)(6)

p(Ft−j | Ok)

This clearly does not have to be close to the actual brain responses but it was considered since it simplifies computation if compared to the full joint probabilities.

95sistemi intelligenti / a. XViii, n. 1, aprile 2006

lorenzo natale giorgio metta giulio sandini

SVILUPPO SENSOMOTORIO IN UN ROBOT UMANOIDE

1. introduzione

Spesso nei sistemi intelligenti artificiali la rappresentazione del mondo è determinata a priori durante la fase di progettazione. Questo tipo di approccio ha portato alla realizzazione di sistemi in grado di risolvere efficientemente i problemi per i quali sono stati progettati, ma che falliscono in contesti più generali. Nonostante i recenti progressi tecnologici, non si è ancora riusciti a realizzare sistemi intelligenti in grado di affrontare la complessità e l’estrema variabilità del mondo reale. Caratteristica comune a tutti i sistemi biologici, al contrario, è la capacità di adattarsi a situazioni e ad ambienti diversi. Soprattutto nel-l’uomo, la rappresentazione del mondo non è definita a priori ma viene costruita dall’individuo in maniera autonoma attraverso l’interazione fisica tra corpo e ambiente. Questa rappresentazione è estremamente ricca e include informazioni provenienti dagli organi sensoriali esterocettivi (quali visione, olfatto e udito), e propriocettivi (fusi muscolari, sistema vestibolare). I primi anni di vita rappresentano il momento di maggiore cambia-mento delle capacità cognitive e sensomotorie di un individuo. In questo periodo le abilità motorie, percettive e cognitive maturano, mentre, da un punto di vista fisico, il corpo cambia in maniera drammatica. Nonostante questi aspetti siano ancora in fase di studio, è possibile identificare un insieme di principi e regole che governano il corretto sviluppo e l’appren-dimento nei bambini. Nell’ambito della nostra ricerca abbiamo cercato di replicare alcuni di questi principi in un sistema robotico antropomorfo. In questo articolo presentiamo una panoramica dei risultati ottenuti e discutiamo le possibili implicazioni di questa ricerca per lo studio dei sistemi intelligenti.

96

2. sviluppo nei sistemi biologici

Neuroscienze e scienze cognitive sembrano concordare sull’impor-tanza dell’attività motoria e sul suo ruolo nella percezione (Arbib et al. 2000; Jeannerod 1997; Milner e Goodale 1995; Rizzolatti e Gentilucci 1988). In parte questo è dovuto alla recente scoperta nelle aree della corteccia premotoria di neuroni la cui attività è modulata dall’infor-mazione visiva (Fogassi et al. 1996; Graziano et al. 1997). La stessa conclusione è supportata da vari studi nel campo della psicologia dello sviluppo (Konczak e Dichgans 1997; Streri 1993; von Hofsten et al. 1998). Questo aspetto non è sorprendente se si pensa che alcune pro-prietà dell’ambiente possono essere percepite in maniera diretta solo attraverso l’interazione fisica con l’ambiente stesso. Ad esempio, la forma di un oggetto, la sua consistenza o la natura della sua superficie sono qualità che sono meglio percepite attraverso il tatto. L’acquisizione dell’informazione sensoriale pertanto è quasi sempre un processo attivo. Nell’uomo – e in tutti gli animali dotati di fovea – gli occhi sono in continuo movimento ed esplorano l’ambiente per acquisire l’informa-zione visiva in maniera efficiente. Altre modalità sensoriali, quali ad esempio il tatto, richiedono vere e proprie strategie di esplorazione che coinvolgono l’interazione tra corpo e ambiente. L’attività motoria ha pertanto un ruolo fondamentale per il corretto sviluppo percettivo dell’uomo (Bushnell e Boudreau 1993). Una possibile ipotesi è che sviluppo percettivo e motorio progrediscano di pari passo. Per esempio, sembra che l’abilità di percepire visivamente le caratteristiche degli oggetti quali ad esempio volume, durezza e tessitura, non emerga prima dei 6-9 mesi, mentre la capacità di discriminare forme tridimensionali si sviluppi solamente verso i 12-15 mesi di vita. Non è sorprendente che questa tempistica corrisponda con quella dello sviluppo motorio se si pensa che l’esplorazione degli oggetti e la conseguente abilità di percepirne determinate proprietà richiede un elevato grado di controllo del movimento. Gli esperimenti di Lederman e Klatzky (1987) dimostra-rono come negli adulti la determinazione di certe proprietà degli oggetti si accompagni a specifiche azioni esplorative. Nei bambini, durante lo sviluppo, l’impossibilità di eseguire una certa azione potrebbe quindi impedire la percezione della corrispondente proprietà. Nei primi mesi di vita le capacità motorie dei neonati sono limitate. La coordinazione e il controllo dei movimenti delle braccia risultano poco accurati; solo in rari casi e in maniera quasi casuale il bambino riesce a guadagnare il contatto con gli oggetti e afferrarli (von Hofsten 1982). Intorno ai 3 mesi queste abilità migliorano e i movimenti di prensione diventano più affidabili. In questo periodo tuttavia gli oggetti sono quasi sempre afferrati con la mano aperta sfruttando l’opposizione tra le dita e il palmo (power grasp). Questa semplice interazione con gli oggetti è comunque sufficiente a rendere accessibili proprietà quali

97

temperatura, dimensione/volume, e durezza. Infine, verso i 6-9 mesi, i bambini iniziano a manipolare gli oggetti con una certa abilità e con differenti strategie che includono l’uso delle dita in differenti configura-zioni (von Hofsten e Ronnqvist 1993). A questo punto dello sviluppo il maggior controllo sull’esplorazione degli oggetti permette ai bambini di iniziare a percepire proprietà più complesse quali la forma e la rugosità degli oggetti. Ulteriore evidenza a supporto di questa teoria si trova nel lavoro di Needham et al. (2002), dove l’abilità di alcuni bambini di 3 mesi nell’afferrare oggetti è artificialmente aumentata per mezzo di un apposito guanto ricoperto di velcro. Il guanto incrementa in modo rilevante la possibilità di afferrare ed esplorare alcuni oggetti preventivamente preparati dai ricercatori (questi oggetti erano ricoperti ai bordi con lo stesso tipo di velcro). I risultati dimostrano che i bambini sottoposti al test mostrano un maggior interesse nei confronti degli oggetti rispetto a un gruppo di riferimento. Questo interesse si manifesta, tra le altre cose, in una prolungata esplorazione visiva e permane anche in assenza del guanto. È come se l’anticipato sviluppo dell’abilità motoria, provocato artificialmente dal guanto, induca un maggior interesse verso l’esplora-zione visiva degli oggetti. L’identificazione delle regolarità nel flusso sensoriale, e quindi di certe proprietà costanti dell’ambiente, è un processo che caratterizza lo sviluppo sensomotorio che è particolarmente interessante perché ci permette di predire gli eventi che osserviamo (Rosander e von Hofsten 2004; von Hofsten et al. 1998). Le capacità predittive portano a un ulteriore incremento dell’abilità di interagire con l’ambiente, ma cosa forse più importante, forniscono una chiave per interpretare gli eventi intorno a noi. Intorno ai 9 mesi d’età, la capacità di interagire con gli oggetti in maniera ripetitiva e accurata consente di estendere ulteriormente le capacità cognitive. Le azioni di prensione sugli oggetti permettono di costruire una rappresentazione multisensoriale che combina opportuna-mente esperienza visiva, tattile e chinestetica.

3. babybot, un robot che cresce

Motivati da queste osservazioni abbiamo realizzato un robot antro-pomorfo con alcune delle caratteristiche richieste per studiare lo svi-luppo sensomotorio. Il robot, chiamato Babybot, consiste in una testa, un braccio e una mano antropomorfa. Da un punto di vista sensoriale il robot è equipaggiato con visione stereoscopica, un sistema acustico, un sistema vestibolare e sensori tattili (maggiori dettagli sono riportati in fig. 1). Lo sviluppo del robot segue un percorso che per semplicità può es-sere diviso in tre fasi: apprendimento della rappresentazione del proprio

giulio
Note
Unmarked set by giulio

98

corpo, apprendimento dell’interazione con gli oggetti e apprendimento dell’interpretazione degli eventi. Durante la prima fase il robot esplora le capacità sensoriali e motorie del proprio corpo e ne costruisce un’op-portuna rappresentazione interna. Nella seconda fase l’esplorazione si sposta verso il mondo esterno: il robot impara a controllare il corpo in modo da eseguire azioni per interagire con gli oggetti che lo circondano. Un esempio in questo senso consiste nell’abilità di muovere il braccio in modo da toccare o afferrare un oggetto con la mano. Nella terza fase il robot usa la sua conoscenza delle regolarità del mondo esterno per imparare a predire le conseguenze delle proprie azioni sull’ambiente. L’esperienza accumulata dal robot può essere usata in questa fase per interpretare gli eventi osservati sulla base di quelli generati dal robot stesso (Fitzpatrick et al. 2003). Nel seguito riportiamo alcuni esperimenti relativi alle prime due fasi.

Fig. 1. Il Babybot. Il setup utilizzato consiste in un robot antropomorfo chiamato Baby-bot (Metta 2000; Natale 2004) avente un totale di 18 gradi di libertà. Dal punto di vista sensoriale il robot è equipaggiato con una coppia di telecamere CCD, due microfoni, un sensore inerziale, sensori tattili e propriocettivi (quest’ultimi costituiti da encoder che forniscono informazioni sulla posizione dei giunti).

99

4. il corpo

L’esecuzione di un movimento richiede la trasformazione dell’in-formazione sensoriale corrispondente allo stato1 desiderato del sistema nella sequenza di comandi necessaria a eseguire il movimento stesso. In alcuni casi questa trasformazione è piuttosto semplice. Si consideri per esempio il sistema oculomotorio nell’uomo e il problema di controllare i muscoli degli occhi per dirigere lo sguardo verso un oggetto identificato visivamente. L’informazione visiva in questo caso è già espressa in un sistema di riferimento in qualche modo «compatibile» con il sistema motorio. Il sistema di controllo degli occhi deve portare l’oggetto al centro della retina o, in altre parole, azzerare l’errore retinico corri-spondente. Il problema può essere risolto con un controllore in anello chiuso, e l’apprendimento in questo caso si riduce ad una semplice stima dei guadagni del controllore. La situazione si complica, però, se l’oggetto non è visibile e la sua posizione è identificata grazie al suono che produce (ad es. un oggetto che cade al di fuori del campo visivo o in assenza di luce). In questo caso la conversione di coordinate avviene a partire dallo spazio nel quale è rappresentata la posizione dell’evento sonoro (ad es. la differenza di fase tra il suono percepito dall’orecchio destro e quello sinistro) e non è possibile impiegare un controllore in anello chiuso poiché la posizione degli occhi non influenza la percezione sonora. È quindi necessaria una trasformazione più complessa. In alcuni casi le conversioni tra i sistemi di riferimento possono essere rappresentate come una generica trasformazione Dq = f(s) dove s è il segnale sensoriale e Dq il comando motorio necessario per l’esecu-zione del compito considerato. La funzione f(·) racchiude la conoscenza sia dei parametri interni del robot che dell’ambiente esterno e viene detta trasformazione inversa (o modello inverso) poiché converte una sensazione sensoriale nel comando motorio necessario per ottenerla. In alcuni casi la funzione inversa non è definita; ad esempio nel caso di un braccio ridondante2 non esiste una funzione che, data la posizione del punto finale del braccio (mano), fornisce un unico valore dei giunti per realizzarla (in questi casi i meccanismi utilizzati dal cervello per risol-vere il problema non sono del tutto noti; un’ipotesi può essere trovata in Jordan e Rumelhart 1992). Il problema opposto riguarda invece la predizione delle conseguenze sensoriali di una data azione motoria. In questo caso la trasformazione è detta diretta e può essere scritta come una funzione s = f (Dq) che converte un comando motorio Dq nel conseguente effetto sensoriale

1 Per stato in questo caso si intende sia il corpo che l’ambiente esterno. 2 Un sistema meccanico è detto ridondante quando ha un numero di gradi di libertà superiore a quello minimo per eseguire un determinato movimento.

giulio
Cross-Out
giulio
Inserted Text
quello
giulio
Cross-Out
giulio
Inserted Text
Talvolta

100

s. In altre parole, questo modello diretto fornisce una predizione sullo stato del sistema a seguito di un comando motorio. L’aspetto interessante dei modelli diretti è che sono sempre definiti, dato che ad ogni azione corrisponde una conseguenza unica. In questi casi lo scopo finale dell’apprendimento si riduce ad un generico problema di approssimazione funzionale che può essere risolto tramite reti neurali, metodi di ottimizzazione o, talvolta, utilizzando una semplice tabella di look-up. Purtroppo non è sempre immediato realiz-zare una procedura che permetta al sistema di raccogliere in maniera autonoma gli esempi utilizzati per l’apprendimento delle trasformazioni dirette e inverse. Una possibile soluzione consiste nel generare movimenti casuali e di misurarne gli effetti sul sistema sensoriale (nel robot una procedura simile è utilizzata da Metta (2000) per i movimenti oculari e da Natale (2004) per la compensazione degli effetti della forza peso sul braccio del robot). Un’altra soluzione sostituisce i movimenti casuali con comandi generati da un semplice controllore proporzionale in anello chiuso (Gomi e Kawato 1993). In entrambi i casi il sistema utilizza una procedura di inizializzazione (movimenti casuali o controllore in anello chiuso) che permette al robot di muoversi e contemporaneamente di avviare l’apprendimento. Nel Babybot abbiamo affrontato il problema dell’acquisizione di modelli inversi per il controllo dei movimenti oculari a partire da informazione visiva, inerziale e uditiva (Metta 2000; Natale et al. 2002; Panerai et al. 2002). Un’altra possibile procedura di apprendimento, ricerca le similarità tra modalità sensoriali diverse. Questo meccanismo è stato per esempio utilizzato nell’acquisizione di un modello del corpo del robot. Il problema, in pratica, consiste nell’identificazione visiva della mano del robot e la distinzione di quest’ultima rispetto al resto dell’ambiente. Nella soluzione adottata il robot è stato programmato per eseguire, se necessario, delle azioni con la mano che permettono di semplificare l’elaborazione dei dati visivi. In questo caso l’apprendimento avviene durante una fase di esplorazione nella quale il robot esegue dei movimenti periodici con la mano, mentre un modulo dedicato si occupa di identificare movimenti periodici all’interno dei segnali propriocettivi e visivi. Se sono identi-ficati dei movimenti periodici in entrambi i canali sensoriali, il periodo di oscillazione viene estratto e confrontato. Qualora tale confronto sia positivo, i punti dell’immagine che hanno superato il test sono sele-zionati come parte della mano. La procedura permette di collezionare esempi per risolvere un problema di apprendimento supervisionato e addestrare un modello diretto che predice la posizione della mano nel campo visivo a partire dalla sola informazione propriocettiva (segnale in retroazione degli encoder dei motori). In figura 2 viene mostrato il risultato dell’apprendimento della trasformazione sensomotoria che consente la localizzazione della mano.

giulio
Cross-Out
giulio
Inserted Text
Con queste ipotesi

101

5. controllo del movimento

Nel paragrafo precedente abbiamo descritto diversi meccanismi di apprendimento che consentono al robot di imparare alcuni modelli di-retti e inversi del proprio corpo e dell’ambiente. In generale un modello inverso permette di risolvere un problema di pianificazione o controllo, mentre un modello diretto permette di anticipare la conseguenza di una determinata azione. La distinzione tuttavia non è così netta: un modello diretto può essere impiegato per risolvere un problema inverso (Jordan e Rumelhart 1992). Si consideri ad esempio il problema di controllare la mano per raggiungere un oggetto. Il problema può essere formulato in termini di una funzione inversa che lega la posizione dell’oggetto nel-l’immagine con il corrispondente comando motorio necessario a spostare il braccio. Nello spazio dei giunti il comando motorio corrisponde alla posizione del braccio che minimizza la distanza tra la mano e il bersaglio e può essere trovato, in alternativa, utilizzando una procedura iterativa che impiega la funzione diretta descritta nel paragrafo precedente (per i dettagli si veda l’Appendice). Un esempio dell’uso di questa tecnica per raggiungere un oggetto è mostrato in figura 2.

6. conclusioni

In questo articolo abbiamo presentato un approccio per la realizzazione di un robot umanoide ispirato alle prime fasi dello sviluppo sensomotorio

Fig. 2. A sinistra: apprendimento della localizzazione della mano. La croce scura rappre-senta la posizione della mano mentre l’ellisse ne approssima forma e orientamento. A destra: raggiungimento di un ipotetico bersaglio al centro dell’immagine. T0 e T1 segnano rispettivamente l’inizio e la fine della traiettoria.

giulio
Cross-Out
giulio
Inserted Text
orientazione

102

dell’uomo. La possibilità di manipolare l’ambiente circostante si ritiene essere uno degli aspetti cruciali dello sviluppo dell’intelligenza negli esseri umani. Lo studio della manipolazione in robotica è stato tradizio-nalmente finalizzato alla semplice esecuzione di compiti di assemblaggio o assistenza. Il controllo dell’interazione tra robot e ambiente è uno degli aspetti più difficili della robotica, ma allo stesso tempo il suo punto di forza poiché permette al sistema l’esplorazione autonoma dell’ambiente e del legame tra attività motoria e percezione. Abbiamo visto come questi aspetti si trovino alla base dei meccanismi che consentono all’uomo di adattarsi all’ambiente e di comprenderne struttura e regole. Per questi motivi riteniamo che l’approccio presentato offra non solo l’opportunità di realizzare robot più adattabili, ma anche un nuovo modo di affrontare lo studio dell’intelligenza nei sistemi artificiali.

appendice

Matematicamente il problema del controllo del braccio per raggiungere un oggetto può essere formulato come la minimizzazione del seguente costo:

min (f(q) – x)2

q

dove x è la posizione del bersaglio, e f(q) la posizione del braccio, calcolata utilizzando il suo modello diretto. Il valore di q che risolve l’equazione precedente corrisponde alla soluzione cercata e può essere determinato in maniera iterativa ad esempio utilizzando il metodo della discesa del gradiente:

qt+1 = qt + k · (f(q) – x) ∇f (q)

dove k è una costante positiva che regola la velocità di discesa e ∇f(q) è il gradiente del modello diretto:

∇f(q) =

Quest’ultimo può essere calcolato in maniera analitica oppure sosti-tuito con una sua stima locale. Questa tecnica non richiede un’inversione esplicita, e come tale, può essere applicata anche quando il braccio è ridondante o in presenza di singolarità.

∂f(q)

∂q

giulio
Inserted Text
funzionale di

103

riFerimenti bibliograFici

arbib M.A., billard A., iacoboni M. e oztop E. (2000), Synthetic brain ima-ging: grasping, mirror neurons and imitation, in «Neural Networks», 13, pp. 975-997.

bushnell E. e boudreau J. (1993), Motor development and the mind: the potential role of motor abilities as a determinant of aspects of perceptual development, in «Child Development», 64, pp. 1005-1021.

Fitzpatrick P., metta G., natale L., rao S. e sandini G. (2003), Learning about objects through action: Initial steps towards artificial cognition, in ieee International Conference on Robotics and Automation (icra 2003, Taipei, Taiwan.

Fogassi L., gallese V., Fadiga L., luppino G., matelli M. e rizzolatti G. (1996), Coding of peripersonal space in inferior premotor cortex (area F4), in «Journal of Neurophysiology», 76, pp. 141-157.

gomi H. e kawato M. (1993), Neural network control for a closed-loop system using feedback-error-learning, in «Neural Networks», 6, pp. 933-946.

graziano M.S.A., hu X. e gross C.G. (1997), Visuo-spatial properties of ventral premotor cortex, in «Journal of Neurophysiology», 77, pp. 2268-2292.

Jeannerod M. (1997), The cognitive neuroscience of action, Cambrige, Mass. - Oxford, Blackwell Publishers.

Jordan M.I. e rumelhart D.E. (1992), Forward models: Supervised learning with a distal teacher, in «Cognitive Science», 16, pp. 307-354.

konczak J. e dichgans J. (1997), Goal-directed reaching: development toward stereotypic arm kinematics in the first three years of life, in «Experimental Brain Research», 117, pp. 346-354.

lederman S.J. e klatzky R.L. (1987), Hand movements: A window into haptic object recognition, in «Cognitive Psychology», 19, pp. 342-368.

metta G. (2000), Babyrobot: A study into sensori-motor development, PhD Thesislira-Lab, dist, University of Genoa.

milner D.A. e goodale M.A. (1995), The visual brain in action, Oxford, Oxford University Press.

natale L. (2004), Linking action to perception in a humanoid robot: A deve-lopmental approach to grasping, PhD Thesis, dist, University of Genoa.

natale L., metta G. e sandini G. (2002), Development of auditory-evoked reflexes: Visuo-acoustic cues integration in a binocular head, in «Robotics and Autonomous Systems», 39, pp. 87-106.

needham A., barret T. e peterman K. (2002), A pick-me-up for infants’ exploratory skills: Early simulated experiences reaching for objects using «sticky mittens» enhances young infants’ object exploration skills, in «Infant Behavior & Development», 25, pp. 279-295.

panerai F., metta G. e sandini G. (2002), Learning stabilization reflexes in robots with moving eyes, in «Neurocomputing», 48, pp. 323-337.

rizzolatti G. e gentilucci M. (1988), Motor and visual-motor functions of the premotor cortex, in Neurobiology of neocortex, Chichester, Wiley, pp. 269-284.

rosander K. e von hoFsten C. (2004), Infants’ emerging ability to represent object motion, in «Cognition», 91, pp. 1-22.

giulio
Cross-Out
giulio
Inserted Text
ICRA
giulio
Cross-Out
giulio
Inserted Text
PhD Thesis, LIRA-Lab, DIST,
giulio
Cross-Out
giulio
Inserted Text
DIST

104

streri A. (1993), Seeing, reaching, touching: The relations between vision and touch in infancy, Cambridge, Mass., mit Press.

von hoFsten C. (1982), Eye-hand coordination in newborns, in «Developmental Psychology», 18, pp. 450-461.

von hoFsten C. e ronnqvist L. (1993), The structuring of neonatal arm mo-vements, in «Child Development», 64, pp. 1046-1057.

von hoFsten C., vishton P., spelke E.S., Feng Q. e rosander K. (1998), Predictive action in infancy: Tracking and reaching for moving objects, in «Cognition», 67, pp. 255-285.

Lorenzo Natale, Mit Csail, 32 Vassar Street, Cambridge, Mass., 02139, Usa e lira-Lab, Dist, Università di Genova, Viale Causa 13, 16145 Genova. E-mail: [email protected] Giorgio Metta e Giulio Sandini, lira-Lab, Dist, Università di Genova, Viale Causa 13, 16145 Genova. E-mail: [email protected] / [email protected]

La ricerca descritta in queso articolo è finanziata in parte dalla Commissione Europea con i progetti RobotCub FP6-IST-004370 e Contact NEST-5010.

giulio
Cross-Out
giulio
Inserted Text
MIT
giulio
Cross-Out
giulio
Inserted Text
MIT CSAIL,
giulio
Cross-Out
giulio
Inserted Text
USA
giulio
Cross-Out
giulio
Inserted Text
LIRA-Lab, DIST
giulio
Cross-Out
giulio
Inserted Text
LIRA-Lab, DIST,

Learning Association Fields from Natural Images

Francesco Orabona, Giorgio Metta and Giulio SandiniLIRA-Lab, DIST

University of Genoa, Genoa, ITALY 16145{bremen,pasa,giulio}@liralab.it

Abstract

Previous studies have shown that it is possible to learncertain properties of the responses of the neurons of the vi-sual cortex, as for example the receptive fields of complexand simple cells, through the analysis of the statistics of nat-ural images and by employing principles of efficient signalencoding from information theory. Here we want to go fur-ther and consider how the output signals of ‘complex cells’are correlated and which information is likely to be groupedtogether. We want to learn ‘association fields’, which are amechanism to integrate the output of filters with differentpreferred orientation, in particular to link together and en-hance contours. We used static natural images as trainingset and the tensor notation to express the learned fields. Fi-nally we tested these association fields in a computer modelto measure their performance.

1. Introduction

The goal of perceptual grouping in computer vision is toorganize visual primitives into higher-level primitives thusexplicitly representing the structure contained in the data.The idea of perceptual grouping for computer vision has itsroots in the well-known work of the Gestalt psychologistsback at the beginning of the last century who described,among other things, the ability of the human visual systemto organize parts of the retinal stimulus into ”Gestalten”,that is, into organized structures. They formulated a num-ber of so-called Gestalt laws (proximity, common fate, goodcontinuation, closure, etc.) that are believed to govern ourperception. It is logical to ask if these laws are present inthe statistics of the world.

On the other hand it has been long hypothesized that theearly visual system is adapted to the input statistics [1].Such an adaptation is thought to be the result of the jointwork of evolution and learning during development. Neu-rons, acting as coincidence detectors, can discover and useregularities in the incoming flow of sensory information,which eventually represent the Gestalt laws. It has been pro-

posed that, for example, the mechanism that link togetherthe elements of a contour is rooted in our biology, with neu-rons with lateral and feedback connections implementingthese laws.

There is a large body of literature about computationalmodeling of various parts of the visual cortex, starting fromthe assumption that certain principles guide the neural code([21] for a review). In this view it is important to understandwhy the neural code is as it is. Bell and Sejnowski [2], forexample, demonstrated that it is possible to learn receptivefields similar to those of simple cells starting from naturalimages. In particular they demonstrated that it is possibleto reproduce these receptive fields hypothesizing the spar-sity and independence of the neural code. In spite of this,there is very little literature on learning an entire hierarchyof features, that is not only the first layer, and possibly start-ing from these initial receptive fields.

A step in the construction of this hierarchy is the useof ‘association fields’ [5]. In the literature, these fields areoften hand-coded and employed in many different modelswith the aim to reproduce the human performance in con-tour integration. These fields are supposed to resemble thepattern of excitatory and inhibitory lateral connection be-tween different orientation detector neurons as found, forinstance, by Schmidt et al. [19]. In fact, Schmidt has shownthat cells with an orientation preference in area 17 of thecat are preferentially linked to iso-oriented cells. Further-more, the coupling strength decrease with the differencein the preferred orientation of pre- and post-synaptic cell.Models typically consider variations of the co-circular ap-proach [8, 9, 13], that is two oriented elements are part ofthe same curve if they are tangent to the same circle. Others[22] have considered exponential curves instead of circlesobtaining similar results.

Our question is whether it is possible to learn these asso-ciation fields from the statistics of natural images. Differentauthors have used different approaches: using a databaseof tagged images [4, 6], using motion as an implicit tagger[18] or hypothesizing certain coding properties of the corti-cal layer [10].

Figure 1. Example image of the dataset.

Our approach is similar to to one of Sigman et al. [20],which uses images as the sole input. Further, we aim toobtain precise association fields, useful to link contours in acomputer model.

The rest of the paper is organized as follows: section 2contains a description of the method. Section 3 describes afirst set of experimental results and a method to overcomeproblems due to the non-uniform distribution of the imagestatistics. In section 4 we show the fields computed withthis last modification and finally in section 5 and 6 we showthe performance of the fields in edge detection on a databaseof natural images and we draw some conclusions.

2. Learning from images

We assume the existence of a first layer that simulatesthe behavior of the complex cells; in this paper we do notaddress the issue on how to learn them since we are inter-ested in the next level of the hierarchy. Using the outputof this layer we want to estimate the mean activity aroundpoints with a given orientation. For example it is likely thatif a certain image position contains a horizontal orientation,then the adjacent pixels on the same line would be pointswith an orientation almost horizontal.

To have a precise representation of the orientations andat the same time something mathematically convenient wehave chosen to use the tensor notation. Second order sym-metric tensors can capture the information about the firstorder differential geometry of an image. Each tensor de-scribes both the orientation of an edge and its confidencefor each point. The tensor can be visualized as an ellipse,whose major axis represents the estimated tangential direc-tion and the difference between the major and minor axisthe confidence of this estimate. Hence a point on a line willbe associated with a thin ellipse while a corner with a circle.Consequently given the orientation of a reference pixel, weestimate the mean tensor associated with the surroundingpixels. The use of the tensor notation give us the possibilityto exactly estimate the preferred orientation in each point of

the field and also to quantify its strength and confidence.

We have chosen to learn a separate association field foreach possible orientation. This is done for two main rea-sons:

• It is possible to find differences between the associa-tion fields. For example, it is possible to verify thatthe association field for the orientation of 0 degrees isstronger than that of 45 degrees.

• For applications of computer vision, considering thediscrete nature of digital images, it is better to separatethe masks for each orientation, instead of combiningthe data in a single mask that has to be rotated leadingto sampling problems. The rotation can be done safelyonly if there is a mathematical formula that representsthe field, while on the other hand we are inferring thefield numerically.

We have chosen to learn 8 association fields, one for eachdiscretized orientation. The extension of the fields is cho-sen of 41x41 pixels taken around each point. It should benoted that even if we quantized the orientation of the (cen-tral) reference pixel to classify the fields, the informationabout the remaining pixels in the neighbor were not quan-tized, differently to [6, 20]. There is neither a threshold nora pre-specified number of bins for discretization and thuswe obtain a precise representation of the association field.

Images used for the experiments were taken fromthe publicly available database (Berkeley SegmentationDatabase [15]) which consists of 300 color images of321x481 and 481x321 pixels; 200 of them were convertedto black and white and used to learn the fields, collecting41x41 patches; an example image from the dataset is shownin figure 1.

Figure 2. Complex cells output to the image in figure 1 for 0 de-grees filter of formula (1).

2.1. Feature extraction stage

There are several models of the complex cells of V1, butwe have chosen to use the classic energy model [16] on theintensity channel. The response is calculated as:

Eθ =√

(I ∗ feθ )2 + (I ∗ fo

θ )2 (1)

where feθ and fo

θ are a quadrature pair of even and odd-symmetric filters at orientation θ. Our even-symmetric filteris a Gaussian second-derivative, and the corresponding odd-symmetric is its Hilbert transform. In figure 2 there is anexample of the output of the complex cells model for the 0degrees orientation.

Then the edges are thinned using a standard non-maximum suppression algorithm. This is equivalent to find-ing edges with a Laplacian of Gaussian and zero crossing.The outputs of these filters are used to construct our localtensor representation.

2.2. Tensors

In practice a second order tensor is denoted by a 2x2matrix of values:

T =[

a11 a12

a21 a22

](2)

It is constructed by direct summation of three quadraturefilter pair output magnitudes as in[11]:

T =3∑

k=1

Eθk

(43nT

k nk − 13I

)(3)

where Eθkis the filter output as calculated in (1), I is the

2x2 identity matrix and the filter directions nk are:

n1 = (1, 0)n2 =

(12 ,

√3

2

)n3 =

(− 1

2 ,√

32

) (4)

The greatest eigenvalue λ1 and its corresponding eigen-vector e1 of a tensor associated to a pixel represent respec-tively the strength and the direction of the main orientation.The second eigenvalue λ1 and its eigenvector e1 have thesame meaning for the orthogonal orientation. The differ-ence λ1 − λ2 is proportional to the likelihood that a pixelcontains a distinct orientation.

3. Preliminary results

We have run our test only for a single scale, choosingthe σ of the Gaussian filters equal to 2, since preliminarytests have shown that a similar version of the fields is ob-tained with other scales as well. Two of the obtained fields

Figure 3. Main directions for the association field for the orienta-tion of 0 degrees in the central pixel.

Figure 4. Main directions for the association field for the orienta-tion of 67.5 degrees in the central pixel.

are in figures 3 and 4. It is clear that they are somewhatcorrupted by the presence of horizontal and vertical orien-tations in any of the considered neighbors and by the factthat in each image patch there are edges that are not passingacross the central pixel. On the other hand we want to learnassociation field for curves that do pass through the centralpixel. Geisler et al. [6] used a human labeled database ofimages to infer the likelihood of finding edges with a cer-tain orientation relative to the reference point. On the otherhand, Sigman et al. [20] using only relative orientation andnot absolute ones, could not have seen this problem. In ourcase we want to use unlabeled data to demonstrate that it ispossible to learn from raw images and, as mentioned earlier,we do not want to consider only the relative orientations, butrather a different field for each orientation. We believe thatthis is the same problem that Prodohl et al. [18] experiencedusing static images: the learned fields supported collinear-ity in the horizontal and vertical orientations but hardly inthe oblique ones. They solved this problem using motion toimplicitly tag only the important edges inside each patch.

3.1. The path across a pixel

The neural way to solve the problem shown earlier isthought to be the synchrony of the firing between nearbyneurons: if stimuli co-occur, then the neurons synchronize[7]. Inspired by this we considered in each patch only pixelsthat belong to a curve that goes through the central pixel.In this way the gathered data will contain only informationabout curves connected to the central pixel. Note that weselect curves inside each patch, not inside the entire image.The simple algorithm used to select the pixels in each patchis the following:

1. put central pixel of the patch in a list;

2. tag first pixel in the list and remove it from the list. Putsurrounding pixels that are active (non-zero) in the list;

3. if the list is empty quit otherwise go to 2.

With this procedure we remove the influence of horizontaland vertical edges that are more present in the images andthat are not removed by the process of averaging. On theother hand, we are losing some information, for exampleabout parallel lines, that in any case should not be useful forthe enhancement of contours. Note that that this method iscompletely parameter free; we are not selecting the curvesfollowing some specific criterion, instead we are just prun-ing the training set from some kind of noise. It is importantto note that this method will learn the bias present in naturalimages versus horizontal and vertical edges [3], but it willnot be biased to learn only these statistics, as in Prodohl etal. [18] when using static images.

4. Results

We tested the modified procedure on the database of nat-ural images and also on random images (results not shown),to verify that the results were not an artifact due to themethod.

In figures 5, 6 there are respectively the main orienta-tions, their strengths (eigenvalues) and the strengths in theorthogonal directions of the mean estimated tensors for theorientation of 0 degrees of the central pixel. Same for fig-ures 7 and 8 for 67.5 degrees. The structure of the obtainedassociation field closely resembles the fields proposed byothers based on collinearity and co-circularity. We note thatthe size of the long-range connection far exceeds the size ofthe classical receptive field. We note also that the noisierregions in the orientation corresponds to very small eigen-values so they do not influence very much the final result.

While all the fields have the same trend, there is a cleardifference in the decay of the strength of the fields. To seethis we have considered only the values along the directionof the orientation in the center, normalizing the maximumvalues to one. Figure 9 shows this decay. It is clear that

Figure 5. Main directions for the association field for the orienta-tion of 0 degrees in the central pixel, with the modified approach.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 6. Difference between the two eigenvalues of the associa-tion field of figure 5.

Figure 7. Main directions for the association field for orientationof 67.5 degrees, with the modified approach.

fields for horizontal and vertical edges have a wider support,confirming the results of Sigman et al. [20].

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 8. Difference between the two eigenvalues of the associa-tion field of figure 7.

−20 −15 −10 −5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 degrees22.5 degrees45 degrees67.5 degrees90 degrees

Figure 9. Comparison of the decay for the various orientations. Onthe y axis there are the first eigenvalues normalized to a maximumof 1, on the x axis is the distance from the reference point alongthe main field direction.

5. Using the fields

The obtained fields can be used with any existing modelof contour enhancement, but to test them we have used thetensor voting scheme proposed by Guy and Medioni et al.[9]. The choice is somewhat logical considering to the factthat the obtained fields are already tensors. In the tensor vot-ing framework points communicate with each other in orderto refine and derive the most preferred orientation informa-tion. Differently to the original tensor voting algorithm wedon’t have to choose the right scale of the fields [12] sinceit is implicitly in the learnt fields. We compared the per-formances of the tensor voting algorithm using the learnedfields versus the simple output of the complex cell layer,using the Berkeley Segmentation Database and the method-ology proposed by Martin et al. [14, 15]. We can see theresults in figure 10: there is a clear improvement using thetensor voting and the learned association fields instead ofjust using the simulated outputs of the complex cells alone.

0 0.25 0.5 0.75 10

0.25

0.5

0.75

1

Recall

Pre

cisi

on

PG

OE

Figure 10. Comparison between tensor voting with learned fields(PG label) and the complex cell layer alone (OE label).

Figure 11. Test image contours using the complex cell layer alone.

Figure 12. Test image contours using tensor voting with thelearned fields.

An example of the results on the test image in 1, after thenon-maximum suppression procedure, are shown in figures11 and 12.

6. Conclusion

Several authors have studied the mutual dependencies ofsimulated complex cells responses to natural images. Themain result from these studies is that these responses are notindependent and they are highly correlated when they are ar-ranged collinearly or on a common circle. In the present pa-per we have presented a method to learn precise associationfield from natural images. A bio-inspired procedure to getrid of the non-uniform distribution of orientations is used,without the need of a tagged database of images [4, 6], theuse of motion [18] or supposing the cortical signals sparseand independent [10]. The learned fields were used in acomputer model, using the tensor voting method, and theresults were compared using a database of human taggedimages which helps in providing clear numerical results.

However the problem of learning useful complex fea-tures from natural images could in any case find a limitbeyond these contour enhancement networks. In fact theusefulness of a feature is not directly related to image statis-tics but supposes the existence of an embodied agent actingin the natural environment, not just perceiving it. In thissense in the future we would like to link strategies like theone used by Natale et al. [17] and the approach in the cur-rent paper, to link the first stages of unsupervised learning,to reduce the dimensionality of the inputs, to other stagesof supervised learning for the definition of the extraction ofuseful features for a given task.

Acknowledgment

This work was supported by EU project RobotCub (IST-2004-004370) and CONTACT (NEST-5010). The authorswould like to thank Harm Aarts for the discussion on theearlier version of the manuscript.

References

[1] H. B. Barlow. Possible principles underlying the trasforma-tions of sensory messages. In W. A. Rosenblith, editor, Sen-sory Communication, pages 217–234. MIT Press, 1961. 1

[2] A. J. Bell and T. J. Seinowski. The ’indipendent components’of natural scenes are edge filters. Vision Research, 37:3327–3338, 1997. 1

[3] D. M. Coppola, H. R. Purves, A. N. McCoy, and D. Purves.The distribution of oriented contours in the real world.PNAS, 95:4002–4006, 1998. 4

[4] J. H. Elder and R. M. Goldberg. Ecological statistics ofgestalt laws for the perceptual organization of contours.Journal of Vision, 2:324–353, 2002. 1, 6

[5] D. J. Field, A. Hayes, and R. F. Hess. Contour integrationby the human visual system: evidence for local ”associationfield”. Vision Research, 33(2):173–193, 1993. 1

[6] W. S. Geisler, J. S. Perry, B. J. Super, and D. P. Gallo-gly. Edge co-occurrence in natural images predicts contour

grouping performance. Vision Research, 41:711–724, 2001.1, 2, 3, 6

[7] C. M. Gray, P. Konig, A. K. Engel, and W. Singer. Oscil-latory responses in cat visual cortex exhibit inter-columnarsynchronization which reflects global stimulus properties.Nature, 338:334–336, 1989. 4

[8] S. Grossberg and E. Mingolla. Neural dynamics of percep-tual grouping: textures, boundaries, and emergent segmenta-tions. Perceptual Psychophysics, 38:141–171, 1985. 1

[9] G. Guy and G. Medioni. Inferring global perceptual contoursfrom local features. Int. J. of Computer Vision, 20:113–133,1996. 1, 5

[10] P. O. Hoyer and A. Hyvarinen. A multilayer sparse codingnetwork learns contour coding from natural images. VisionResearch, 42(12):1593–1605, 2002. 1, 6

[11] H. Knutsson. Representing local structure using tensors.In Proc. 6th Scandinavian Conference on Image Analysis,pages 244–251, Oulu, Finland, 1989. 3

[12] M.-S. Lee and G. Medioni. Grouping ., -, →, θ, into regions,curves and junctions. Journal of Computer Vision and ImageUnderstanding, 76(1):54–69, 1999. 5

[13] Z. Li. A neural model of contour integration in the primaryvisual cortex. Neural Computation, 10:903–940, 1998. 1

[14] D. Martin, C. Fowlkes, and J. Malik. Learning to detect nat-ural image boundaries using local brightness, color and tex-ture cues. IEEE Transactions on Pattern Analysis and Ma-chine intelligence, 26(5):530–549, 2004. 5

[15] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A databaseof human segmented natural images and its application toevaluating segmentation algorithms and measuring ecologi-cal statistics. In Proc. 8th Int’l Conf. Computer Vision, vol-ume 2, pages 416–423, July 2001. 2, 5

[16] M. Morrone and D. Burr. Feature detection in human vi-sion: A phase dependent energy model. Proc. Royal Soc. ofLondon B, 235:221–245, 1988. 3

[17] L. Natale, F. Orabona, G. Metta, and G. Sandini. Exploringthe world through grasping: a developmental approach. InProc. 6th CIRA Symposium, pages 27–30, June 2005. 6

[18] C. Prodohl, R. P. Wurtz, and C. von der Malsburg. Learningthe gestalt rule of collinearity from object motion. NeuralComputation, 15:1865–1896, 2003. 1, 3, 4, 6

[19] K. Schmidt, R. Goebel, S. Lowel, and W. Singer. Theperceptual grouping criterion of collinearity is reflected byanisotropies of connections in the primary visual cortex. Eu-ropean Journal of Neuroscience, 5(9):1083–1084, 1997. 1

[20] M. Sigman, G. A. Cecchi, C. D. Gilbert, and M. O. Mag-nasco. On a common circle: Natural scenes and gestalt rules.PNAS, 98(4):1935–1940, 2001. 2, 3, 4

[21] E. Simoncelli and B. Olshausen. Natural images statisticsand neural representation. Annual Review of Neuroscience,24:1193–1216, 2001. 1

[22] V. Vonikakis, A. Gasteratos, and I. Andreandis. Enhance-ment of perceptually salient contours using a parallel artifi-cial cortical network. Biological Cybernetics, 94:194–214,2006. 1

Mirror neuron: a neurological approach to empathy Giacomo Rizzolatti1,3 and Laila Craighero2

1 Dipartimento di Neuroscienze, Sezione di Fisiologia, via Volturno, 3, Università di Parma, 43100, Parma, Italy;

2 Dip. SBTA, Sezione di Fisiologia Umana, via Fossato di Mortara, 17/19, Università di Ferrara, 44100 Ferrara, Italy;

3 Corresponding author: Giacomo Rizzolatti, e-mail: [email protected]

Changeux et al. Neurobiology of Human Values © Springer-Verlag Berlin Heidelberg 2005

Summary

Humans are an exquisitely social species. Our survival and success depend criti-cally on our ability to thrive in complex social situations. But how do we under-stand others? Which are the mechanisms underlying this capacity?

In the present essay we discuss a general neural mechanism (“mirror mecha-nism”) that enables individuals to understand the meaning of actions done by others, their intentions, and their emotions, through activation of internal repre-sentations coding motorically the observed actions and emotions.

In the first part of the essay we will show that the mirror mechanism for “cold” actions, those devoid of emotional content, is localized in parieto-frontal cortical circuits. These circuits become active both when we do an action and when we ob-serve another individual doing the same action. Their activity allows the observer to understand the “what” of an action.

We will show, then, that a “chained” organization of motor acts plus the mir-ror mechanism enable the observer to understand the intention behind an action (the “why” of an action) by observing the first motor act of an action.

Finally, we will discuss some recent data showing that the mirror mechanism localized in other centers, like the insula, enables the observer to understand the emotions of others. We will conclude briefly discussing whether these biological data allow inferences about moral behavior.

Introduction

“How selfish soever man may be supposed, there are evidently some principles in his nature, which interest him in the fortune of others, and render their happiness necessary to him, though he derives nothing from it except the pleasure of seeing it.” This famous sentence by Adam Smith (1759), which so nicely describes our empathic relation with others, contains two distinct concepts.

108 Giacomo Rizzolatti and Laila Craighero

The first is that individuals are endowed with a mechanism that makes them share the “fortunes” of others. By observing others, we enter in an “empathic” re-lation with them. This empathic relation concerns not only the emotions that oth-ers feel but also their actions. “The mob, when they are gazing at a dancer on the slack rope, naturally writhe and twist and balance their own bodies, as they see him do, and as they feel that they themselves must do if in his situation.”(Smith 1759).

The second idea is that, because of our empathy with others, we are compelled to desire their happiness. If others are unhappy, we are also unhappy, because the other’s unhappiness somehow intrudes into us.

According to Smith, the way in which we enter into empathic relation may be voluntary (“As we have no immediate experience of what other men feel, we can form no idea of the manner in which they are affected, but by conceiving what we ourselves should feel in the like situation”), but also, as shown in the above-cited example of the “dancer,” may be automatically triggered by the observation of the behavior of others.

The aim of this essay is to discuss the existence of a neural mechanism, resem-bling that described by Adam Smith, that puts the individual in empathic contact with others. This mechanism – the mirror mechanism – enables the observer to understand the actions of others, the intention behind their actions, and their feelings. In a short “coda,” we will discuss the validity of the second idea of Adam Smith, the obligatory link between our happiness and that of others.

Action understanding

Humans are social beings. They spend a large part of their time observing others and trying to understand what they are doing and why. Not only humans but also apes and monkeys have a strongly developed interest in others.

How are actions recognized? The traditional view is that action recognition is based exclusively on the visual system. The understanding of an action done by another individual depends on the activity of the higher order visual areas and, in particular, of those of the superior temporal sulcus, where there are neurons selectively activated by biological motions (Perrett et al. 1989; Carey et al. 1997; Allison et al. 2000; Puce and Perrett 2003).

Another hypothesis is that an action is recognized when the observed action activates, in the observer’s brain, an analogous motor representation. The ob-server does not execute that action, because control mechanisms prevent its overt occurrence, but the evoked motor representation (“motor knowledge”) allows him to understand the meaning of what he saw (Rizzolatti et al. 2001).

It is important to note that the two hypotheses are not in contraposition. Rath-er, they describe two different ways in which an action may be understood. The “visual” hypothesis describes a “third person” relation between the observer and the observed action. The action, albeit recognized in its general meaning, is not understood in all its implications, because it does not enter into the semantic mo-tor network of the observing individual as well as in his/her private knowledge of what doing that action means. “Visual” understanding is similar to that a robot,

Mirror neuron: a neurological approach to empathy 109

able to differentiate an action from another, may have, or humans have when they see a bird flying or a dog barking (see below). In contrast, the “motor” hypothesis describes the “first person” understanding of what the individual is seeing. The observed action enters into the observer’s motor representation and recalls his/her similar experiences when doing that action. It is an empathic recognition that makes the observer share the experience of the action agent.

Mirror Neuron System and Action Understanding

The strongest evidence in favor of the “motor hypothesis” is represented by mir-ror neurons. These neurons, originally found in the monkey ventral premotor cortex (area F5), are active both when the monkey does a particular action and when it observes another individual doing a similar action. The mirror neurons do not respond to object presentation. Similarly, they do not respond to the sight of an agent mimicking actions or performing non-object-directed gestures. Mirror neurons have also been described in the monkey parietal lobe (Fogassi et al. 1998: Gallese et al. 2002).

The mirror mechanism appears to be a mechanism particularly well suited for imitation. Imitation, however, appeared only late in evolution. Monkeys, which have a well-developed mirror system, lack this capacity and even apes have it only in a rather rudimentary form (see Tomasello and Call 1997; Visalberghi and Fragaszy 2002).

The properties of monkey mirror neurons also indicate that this system ini-tially evolved not for imitation. Mirror neurons typically show a good congruence between the visual actions they respond to and the motor responses they code, yet only in a minority of them do the effective observed and effective executed actions correspond in terms of both goal and means for reaching the goal. Most of them code the goal of the action (e.g., grasping) but not the way in which the observed action is done. These neurons are, therefore, of little use for imitation in the proper sense, that is the capacity to imitate an action as it has been performed (Rizzolatti and Craighero 2004).

Summing up, it is very likely that the faculty for imitation developed on the top of the mirror system. However, its initial basic function was not imitation but enabling an individual to understand actions done by others (see Rizzolatti et al. 2001).

Evidence in favor of the notion that mirror neurons are involved in action understanding comes from a recent series of studies in which mirror neurons were tested in experimental conditions in which the monkey could understand the meaning of an occurring action but had no visual information about it. The rationale of these studies was the following: if mirror neurons mediate action un-derstanding, they should become active when the meaning of the observed action is understood, even in the absence of visual information.

The results showed that this is the case. In one study, F5 mirror neurons were recorded while the monkey was observing a “noisy” action (e.g., ripping a piece of paper) and then was presented with the same noise without seeing the action . The results showed that a large number of mirror neurons, responsive to the ob-

110 Giacomo Rizzolatti and Laila Craighero

servation of noisy actions, also responded to the presentation of the sound proper of that action, alone. Responses to white noise or to the sound of other actions were absent or much weaker than responses to the preferred action. Neurons re-sponding selectively to specific action sounds were named “audio-visual” mirror neurons (Kohler et al. 2002).

In another study, mirror neurons were tested by introducing a screen between the monkey and the location of an object (Umiltà et al. 2001). The idea underlying the experiment was that, if mirror neurons are involved in action understanding, they should also discharge in conditions in which the monkey does not see the occurring action but has sufficient clues to create a mental representation of what the experimenter does. The monkeys were tested in four conditions: 1) the experi-menter is grasping an object; 2) the experimenter is miming grasping, and 3) and 4), the monkey observes the actions of 1) and 2) but the final critical part of them (hand-object interaction) is hidden by a screen. It is known that mirror neurons typically do not fire during the observation of mimed actions. At the beginning of each “hidden “ trial, the monkey was shown whether there was or was not an ob-ject behind the screen. Thus, in the hidden condition, the monkey “knew” that the object was present behind the screen and could mentally represent the action.

The results showed that more than half of the tested neurons discharged in the hidden condition, thus indicating that the monkey was able to understand the goal of the action even in the absence of the visual aspect describing the action.

In conclusion, both the experiments show that the activity of mirror neurons correlates with action understanding. The visual features of the observed actions are necessary to trigger mirror neurons only insomuch as they allow the under-standing of the observed actions. If action comprehension is possible on other bases, mirror neurons signal the action even in the absence of visual stimuli.

The Mirror Neuron System in Humans

Evidence, based on single neuron recordings, of the existence of mirror neurons in humans is lacking. Their existence is, however, indicated by EEG and MEG studies, TMS experiments, and brain imaging studies (see Rizzolatti and Craigh-ero 2004). For the sake of space, we will review here only a tiny fraction of these studies.

MEG and EEG studies showed that the desynchronization of the motor cortex observed during active movements was also present during the observation of action done by others (Hari et al. 1998: Cochin et al. 1999). Recently, desynchro-nization of cortical rhythms was found in functionally delimited language and hand motor areas in a patient with implanted subdural electrodes, both during observation and execution of finger movements (Tremblay et al. 2004).

TMS studies showed that the observation of actions done by others determines an increase of corticospinal excitability with respect to the control conditions. This increase concerned specifically those muscles that the individuals use for producing the observed movements (e.g., Fadiga et al. 1995; Strafella and Paus 2000; Gangitano et al. 2001, 2004).

Mirror neuron: a neurological approach to empathy 111

Brain imaging studies allowed the localization of the cortical areas forming the human mirror neuron system. They showed that the observation of actions done by others activates, besides visual areas, two cortical regions whose function is classically considered to be fundamentally or predominantly a motor one: the inferior parietal lobule (area PF/PFG), and the lower part of the precentral gy-rus (ventral premotor cortex) plus the posterior part of the inferior frontal gyrus (IFG). (Rizzolatti et al. 1996; Grafton et al. 1996; Grèzes et al. 1998, 2003; Iacoboni et al. 1999, 2001; Nishitani and Hari 2000, 2002; Buccino et al. 2001; Koski et al. 2002, 2003; Manthey et al. 2003, Johnson-Frey et al. 2003). These two regions form the core of the mirror neuron system in humans.

Recently, an fMRI experiment was carried out to see which type of observed actions is recognized using the mirror neuron system (Buccino et al. 2004). Vid-eo-clips showing silent mouth actions done by humans, monkeys and dogs were presented to normal volunteers. Two types of actions were shown: biting and oral communicative actions (speech reading, lip-smacking, barking). Static images of the same actions were presented as a control.

The results showed that the observation of biting, regardless of whether done by a man, a monkey or a dog, determined the same two activation foci in the inferior parietal lobule, and in the pars opercularis of IFG and the adjacent pre-central gyrus. Speech reading activated the left pars opercularis of IFG, whereas the observation of lip smacking activated a small focus in the right and left pars opercularis of IFG. Most interestingly, the observation of barking did not produce any mirror neuron system activation.

These results strongly support the notion mentioned above that actions done by other individuals can be recognized through different mechanisms. Actions belonging to the motor repertoire of the observer are mapped on his/her mo-tor system. Actions that do not belong to this repertoire do not excite the motor system of the observer and appear to be recognized essentially on a visual basis. Thus, in the first case, the motor activation translates the visual experience into an empathic, first person knowledge, whereas this knowledge is lacking in the second case.

Intention understanding

The data discussed above indicate that the premotor and parietal cortices of pri-mates contain a mechanism that allows individuals to understand the actions of others. Typically, an individual observing an action done by another person not only understands what that person is doing, but also why he/she is doing it. Let us imagine a boy grasping a mug. There are many reasons why the boy could grasp it, but the observer usually is able to infer why he did it. For example, if the boy grasped the cup by the handle, it is likely that he wants to drink the coffee, while if he grasped it by the top it is more likely that he wants to place it in a new loca-tion.

The issue of whether the intention comprehension (the “why” of an observed action) could be mediated by the mirror neurons has been recently addressed in a study in which the motor and visual properties of mirror neurons of the infe-

112 Giacomo Rizzolatti and Laila Craighero

rior parietal lobule (IPL) were investigated (Fogassi et al., manuscript in prepara-tion).

IPL neurons that discharge during active grasping were selected. Subsequently, their motor activity was studied in two main conditions. In the first, the monkey grasped a piece of food located in front of it and brought it to its mouth (eating condition). In the second, the monkey grasped an object and placed it into a con-tainer (placing condition).

The results showed that the large majority of IPL grasping neurons (about 65% of them) were significantly influenced by the action in which the grasping was embedded. Examples are shown in Figure 1. Neurons coding grasping for eating were much more common than neurons coding grasping for placing, with a ratio of two to one.

Studies in humans showed that the kinematics of the first motor act of an action is influenced by the subsequent motor acts of that action (see Jeannerod 1988). The recordings of reaching-to-grasp kinematics of the monkeys in the two experi-mental conditions described above confirmed these findings. Reaching-to-grasp movement followed by arm flexion (bringing the food to the mouth) was faster than the same movement followed by arm abduction (placing the food into the container). To control for whether the differential discharge of grasping neurons in eating and placing conditions were due to this difference in kinematics rather than to the action goal, a new placing condition was introduced (see Fig. 1). In this condition the monkey had to grasp a piece of food and place it into a container located near its mouth. Thus, the new place condition was identical in terms of goal to the original one but required, after grasping, arm flexion rather than arm abduction. The kinematics analysis of the reaching-to-grasp movement showed that the wrist peak velocity was fastest in the new placing condition, intermediate in the eating condition, and slowest in the original placing condition.

Neuron activity showed that, regardless of arm kinematics, the neuron se-lectivity remained unmodified. Neurons selective for placing in the far contain-er showed the same selectivity for placing in the container near the monkey’s mouth. Thus, it is the goal of the action that determines the motor selectivity of IPL neurons in coding a given motor act, rather than factors related to movement kinematics.

As in the premotor cortex, there are neurons in IPL that are endowed with mirror properties, discharging both during the observation and execution of the

Fig. 1. A. Lateral view of the monkey brain showing the sector of IPL (gray shading) from which the neurons were recorded; cs, central sulcus; ips, inferior parietal sulcus. B Schematic drawing illustrating the apparatus and the paradigm used for the motor task. Left: starting position of the task. A screen prevents the monkey from seeing the target. Right: after the screen is removed, the monkey could release the hand from the starting position, reach and grasp the object to bring it to its mouth (Condition I) or to place it into a container located near the target (II) or near its mouth (III). C Activity of three IPL neurons during grasping in Conditions I and II. Rasters and the histograms are synchronized with the moment when the monkey touched the object to be grasped. Red bars indicate the moment when the monkey released its hand from the starting position. Green bars indicate the moment when the monkey touched the container. Abscissa: time, bin = 20 ms; ordinate: discharge frequency. (From Fogassi et al., in preparation)

Mirror neuron: a neurological approach to empathy 113

114 Giacomo Rizzolatti and Laila Craighero

same motor act (Fogassi et al. 1998, Gallese et al. 2002). To see whether these neu-rons also discharge differentially during the observation of the same motor act but embedded in different actions, their visual properties were tested in the same two conditions as those used for studying their motor properties. The actions were performed by an experimenter in front of the monkey. In one condition, the monkey observed the experimenter grasping a piece of food and bringing it to his mouth; in the other, the monkey observed the experimenter placing an object into the container.

The results showed that more than two-thirds of IPL neurons were differen-tially activated during the observation of grasping in placing and eating condi-tions. Examples are shown in Figure 2. Neurons responding to the observation of grasping for eating were more represented than neurons responding to the observation of grasping for placing, again with a two to one ratio.

A comparison between neuron selectivity during the execution of a motor act and motor act observation showed that the great majority of neurons (84%) have the same specificity during grasping execution and grasping observation. Thus, a mirror neuron whose discharge was higher during the observation of grasping for eating than during the observation of grasping for placing also had a higher discharge when the monkey grasped for eating than when it grasped for placing. The same was true for neurons selective for grasping for placing.

The motor and visual organization of IPL, just described, is of great interest for two reasons. First, it indicates that motor actions are organized in the pari-etal cortex in specific chains of motor acts; second, it strongly suggests that this chained organization might constitute the neural basis for understanding the in-tentions of others.

In favor of the existence of action chains in IPL, it is not only the activation of neurons coding the same motor act in one condition and not in another, but also the organization of IPL neuron-receptive fields. This organization shows that there is a predictive facilitatory link between subsequent motor acts. To give an example, there are IPL neurons that respond to the passive flexion of the forearm, have tactile receptive fields on the mouth, and in addition discharge during mouth grasping (Ferrari et al., manuscript in preparation). These neurons facilitated the mouth opening when an object touched the mouth, but also when the monkey grasped it, producing a contact between the object and the hand tactile-receptive field. Recently, several examples of predictive chained organization in IPL have been described by Yokochi et al. (2003). If one considers that a fundamental as-pect of action execution is its fluidity, the link between the subsequent motor acts forming an action and the specificity of neurons coding them appears to be an optimal solution for executing an action without intervening pauses between the individual motor acts forming it.

The presence of chained motor organization of IPL neurons has deep impli-cations for intention understanding. The interpretation of the functional role of mirror neurons, as described above, was that of action understanding. A motor act done by another individual is recognized when this act triggers the same set of neurons that are active during that act execution. The action-related IPL mirror neurons allow one to extend this concept. These neurons discriminate one motor act from another, thus activating a motor act chain that codes the final goal of the

Mirror neuron: a neurological approach to empathy 115

action. In this way the observing individual may re-enact internally the observed action and thus predict the goal of the observed action. In this way, the observer can “read” the intention of the acting individual.

This intention-reading interpretation predicts that, in addition to mirror neu-rons that fire during the execution and observation of the same motor act (“clas-sical mirror neurons”), there should be neurons that are visually triggered by a given motor act but discharge during the execution not of the same motor act, but of another one that is functionally related to the former and part of the same action chain. Neurons of this type have been previously described both in F5 (Di Pellegrino et al. 1992) and in IPL (Gallese et al. 2002) and referred to as “logically related” mirror neurons. These “logic” mirror neurons were never theoretically discussed because their functional role was not clear. The findings just discussed allow us to not only account for their occurrence but also to indicate their neces-sity, if the chain organization is at the basis of intention understanding.

While the mechanism of intention understanding just described appears to be rather simple, it would be more complex to specify how the selection of a particu-lar chain occurs. After all, what the observer sees is just a hand grasping a piece of food or an object.

There are various factors that may determine this selection. The first is the context in which the action is executed. In the study described above, the clue for possible understanding of the intention of the acting experimenter was either the presence of the container (placing condition) or its absence (eating condition). The second factor that may intervene in chain selection is the type of object that the experimenter grasped. Typically, food is grasped in order to be eaten. Thus, the observation of a motor act directed towards food is more likely to trigger grasping-for-eating neurons than neurons that code grasping for other purposes. This food-eating association is, of course, not mandatory but could be modified by other factors.

One of these factors is the standard repetition of an action. Another is, as men-tioned before, the context in which the action is performed. Context and object type were found to interact in some neurons. For example, some neurons that se-lectively discharged during the observation of grasping for eating also discharged, although weakly, during the observation of grasping for placing when the object to be placed was food, but not when it was a solid. It was as if the eating chain was activated, although slightly, by food in spite of the presence of a contextual clue indicating that placing was the most likely action. A few neurons, instead of showing an intermediate discharge when the nature of the stimulus (food) and context conflicted, decreased their discharge with time when the same action was repeated. It was as if the activity of the placing chain progressively inhibited the activity of neurons of the eating chain.

Understanding “other minds” constitutes a special domain of cognition Devel-opmental studies clearly show that this cognitive faculty has various components and that there are various steps through which infants acquire it (see Saxe et al. 2004). Brain imaging studies also tend to indicate the possibility of an involve-ment of many areas in this function (Blakemore and Decety 2001; Frith and Frith 2003; Gallagher and Frith 2003).

116 Giacomo Rizzolatti and Laila Craighero

Given the complexity of the problem, it would be naive to claim that the mech-anism described in the present study is the mechanism at the basis of mind read-ing. Yet, the present data show for the first time a neural mechanism through which an important aspect of mind reading, understanding the intention of oth-ers, may be solved.

Emotion understanding

Up to now we have dealt with the neural mechanisms that enable individuals to understand “cold actions,” that is, actions without any obvious emotional con-tent. In social life, however, equally important, and maybe even more so, is the capacity to decipher emotions. Which mechanisms enable us to understand what others feel? Is there a mirror mechanism for emotions similar to that for cold ac-tion understanding?

It is reasonable to postulate that, as for action understanding, there are two basic mechanisms for emotion understanding that are conceptually different one from another. The first consists in cognitive elaboration of sensory aspects of oth-ers’ emotional behaviors. The other consists in a direct mapping of sensory as-pects of the observed emotional behavior on the motor structures that determine, in the observer, the experience of the observed emotion.

These two ways of recognizing emotions are experientially radically differ-ent. With the first, the observer understands the emotions expressed by others but does not feel them. He deduces them. A certain facial or body pattern means

Fig. 2. Visual responses of IPL mirror neurons during the observation of grasping-to-eat and grasping-to-place done by an experimenter. Condtions as in Figure 1. (From Fogasso et al., in preparation)

Mirror neuron: a neurological approach to empathy 117

fear,another happiness, and that is it. No emotional involvement. Different is the case for sensory-motor mapping mechanism. In this case, the recognition occurs because the observed emotion triggers the feeling of the same emotion in the ob-serving person. It is a first-person recognition. The emotion of the other pene-trates the emotional life of the observer, evoking in him/her not only the observed emotion but also related emotional states and nuances of similar experiences.

As for cold action, our interest in this essay is the mechanisms underlying the direct sensory-motor mapping. For the sake of space, we will review data on one emotion only – disgust – for which rich empirical evidence has been recently ac-quired.

Disgust is a very basic emotion whose expression has an important survival values for the conspecifics. In its most basic, primitive form (“core disgust;” Rozin et al. 2000) disgust indicates that something (e.g., food) that the individual tastes or smells is bad and, most likely, dangerous. Because of its strong communicative value, disgust is an ideal emotion for testing the direct mapping hypothesis.

Brain imaging studies showed that when an individual is exposed to disgust-ing odors or tastes, there is an intense activation of two structures: the amygdala and the insula (Augustine 1996; Royet et al. 2003; Small et al. 2003; Zald et al. 1998; Zald and Pardo 2000). The amygdala is a heterogeneous structure formed by several subnuclei. Functionally, these subnuclei form two major groups: the cor-ticomedial group and the basolateral group. The former, phylogenetically more ancient, is related to the olfactory modality. It is likely that it is the signal increase in the corticomedial group that is responsible for the amygdala activation in re-sponse to disgusting stimuli.

Similarly to the amygdala, the insula is a heterogeneous structure. Anatomical connections revealed two main functional subdivisions: an anterior “visceral” sec-tor and a multimodal posterior sector (Mesulam and Mufson 1982). The anterior sector receives a rich input from olfactory and gustatory centers. In addition, the anterior insula receives an important input from the inferotemporal lobe, where, in the monkey, neurons have been found to respond to the sight of faces (Gross et al. 1972; Tanaka 1996). Recent data demonstrated that the insula is the main cortical target of interoceptive afferents (Craig 2002). Thus, the insula is not only the primary cortical area for chemical exteroception (e.g., taste and olfaction) but also for the interoceptive state of the body (“body state representation”).

The insula is not an exclusively sensory area. In both monkeys and humans, electrical stimulation of insula produces body movements (Kaada et al. 1949; Pen-field and Faulk 1955; Frontera 1956; Showers and Lauer 1961; Krolak-Salmon et al. 2003). These movements, unlike those evoked by stimulation of classical mo-tor areas, are typically accompanied by autonomic and viscero-motor responses.

Functional imaging studies in humans showed that, as in the monkey, the an-terior insula receives, in addition to olfactory and gustatory stimuli, higher order visual information. Observation of disgusted facial expressions produces signal increase in the anterior insula. (Phillips et al. 1997, 1998; Sprengelmeyer et al. 1998; Schienle et al, 2002).

Recently, Wicker et al. (2003) carried out an fMRI study in which they tested whether the same insula sites that show signal increase during the experience of

118 Giacomo Rizzolatti and Laila Craighero

disgust also show signal increase during the observation of facial expressions of disgust.

The study consisted of olfactory and visual runs. In the olfactory runs, indi-viduals inhaled disgusting and pleasant odorants. In the visual runs, the same participants viewed video-clips of individuals smelling a glass containing disgust-ing, pleasant and neutral odorants and expressing their emotions.

Disgusting odorants produced, as expected, a very strong signal increase in the amygdala and in the insula, with a right prevalence. In the amygdala, activa-tion was also observed with pleasant odorants, with a clear overlap between the activations obtained with disgusting and pleasant odorants. In the insula, pleas-ant odorants produced a relatively weak activation located in a posterior part of the right insula; disgusting odorants activated the anterior sector bilaterally. The results of visual runs showed signal increases in various cortical and subcortical centers but not in the amygdala. The insula (anterior part, left side) was activated only during the observation of disgust.

The most important result of the study was the demonstration that precisely the same sector within the anterior insula that was activated by the exposure to disgusting odorants was also activated by the observation of disgust in others (Fig. 3). These data strongly suggest that the insula contains neural populations that become active both when the participants experience disgust and when they see it in others.

The notion that the insula mediates both recognition and experience of disgust is supported by clinical studies showing that, following lesions of the insula, pa-tients have a severe deficit in understanding disgust expressed by others (Calder

Fig. 3. Illustration of the overlap (white) between the brain activation during the observation (blue) and the feeling (red) of disgust. The olfactory and visual analyses were performed sepa-rately as random-effect analyses. The results are superimposed on parasagittal slices of a stan-dard MNI brain.

Mirror neuron: a neurological approach to empathy 119

et al. 2000; Adolphs et al. 2003). This deficit is accompanied by blunted and re-duced sensation of disgust. In addition, electrophysiological studies showed that sites in the anterior insula, whose electrical stimulation produced unpleasant sen-sations in the patient’s mouth and throat, are activated by the observation of a face expressing disgust.

Taken together, these data strongly suggest that humans understand disgust, and most likely other emotions (see Carr et al. 2003, Singer et al., 2004), through a direct mapping mechanism. The observation of emotionally laden actions acti-vates those structures that give a first-person experience of the same actions. By means of this activation, a bridge is created between others and us.

The hypothesis that we perceive emotion in others by activating the same emo-tion in ourselves has been advanced by various authors (e.g., Phillips et al. 1997; Adolphs 2003: Damasio 2003a; Calder et al, 2000; Carr et al. 2003; Goldman and Sripada 2003; Gallese et al. 2004). Particulary influential in this respect has been the work by Damasio and his coworkers (Adolphs et al. 2000; Damasio 2003a, b) According to these authors, the neural basis of empathy is the activation of an “as-if-loop,” the core structure of which is the insula (Damasio 2003). These authors attributed a role in the “as-if-loop” also to somatosensory areas like SI and SII, conceiving the basis of empathy to be in the activation in the observer of those cortical areas where the body is represented.

Although this hypothesis is certainly possible, the crucial role of the insula, rather than of the primary somatosensory cortices, in emotion feeling strongly suggests that the neural substrate for emotions is not merely sensorial. It is more likely that the activation of the insula representation of the viscero-motor activity is responsible for the first-person feeling of disgust. As for the premotor cortex, it is plausible that in the insula there is a specific type of mirror neurons that match the visual aspect of disgust with its viscero-motor aspects. The activation of these (hypothetical) viscero-motor mirror neurons should underlie the first-per-son knowledge of what it means to be disgusted. The activation of these insular neurons should not necessarily produce the overt viscero-motor response. The overt response should depend on the strength of the stimuli and other factors. A neurophysiological study of insula neuron properties could be the direct test of this hypothesis.

Coda

The data reviewed in this essay show that the intuition of Adam Smith – that in-dividuals are endowed with an altruistic mechanism that makes them share the “fortunes” of others – is strongly supported by neurophysiological data. When we observe others, we enact their actions inside ourselves and we share their emo-tions.

Can we deduce from this that the mirror mechanism is the mechanism from which altruistic behavior evolved? This is obviously a very hard question to an-swer. Yet, it is very plausible that the mirror mechanism played a fundamental role in the evolution of altruism. The mirror mechanism transforms what oth-ers do and feel in the observer’s own experience. The disappearance of unhappi-

120 Giacomo Rizzolatti and Laila Craighero

ness in others means the disappearance of unhappiness in us and, conversely, the observation of happiness in others provides a similar feeling in ourselves. Thus, acting to render others happy – an altruistic behavior – is transformed into an egoistic behavior – we are happy.

Adam Smith postulated that the presence of this sharing mechanism renders the happiness of others “necessary” for human beings, “though he derives noth-ing from it except the pleasure of seeing it.” This, however, appears to be a very optimist view. In fact, an empathic relationship between others and ourselves does not necessarily bring positive consequences to the others. The presence of an unhappy person may compel another individual to eliminate the unpleasant feeling determined by that presence, acting in a way that is not necessary the most pleasant for the unhappy person.

To use the mirror mechanism – a biological mechanism – strictly in a positive way, a further – cultural – addition is necessary. It can be summarized in the pre-scription: “Therefore all things whatsoever ye would that men should do to you, do ye even so to them: for this is the law and the prophets” (Matthew 7, 12) . This “golden rule,” which is present in many cultures besides ours (see Changeux and Ricoeur 1998), uses the positive aspects of a basic biological mechanism inherent in all individuals to give ethical norms that eliminate the negative aspects that are also present in the same biological mechanism.

Acknowledgment

The study was supported by EU Contract QLG3-CT-2002–00746, Mirror, EU Con-tract IST-2000–29689, Artesimit, by Cofin 2004, and FIRB n. RBNE01SZB4.

References

Adolphs R (2003) Cognitive neuroscience of human social behaviour. Nature Rev Neurosci 4: 165–178.

Adolphs R, Damasio H, Tranel D, Cooper G, Damasio AR (2000) A role for somatosensory cor-tices in the visual recognition of emotion as revealed by three-dimensional lesion mapping. J Neurosci 20: 2683–2690.

Adolphs R, Tranel D, Damasio AR (2003) Dissociable neural systems for recognizing emotions. Brain Cogn 52: 61–69.

Allison T, Puce A, McCarthy G. (2000) Social perception from visual cues: role of the STS region. Trends Cogn Sci 4: 267–278.

Augustine JR (1996) Circuitry and functional aspects of the insular lobe in primates including humans. Brain Res Rev 22: 229–244.

Blakemore SJ, Decety J (2001) From the perception of action to the understanding of intention. Nature Rev Neurosci 2: 561.

Bruce C, Desimone R, Gross CG (1981) Visual properties of neurons in a polysensory area in superior temporal sulcus of the macaque. J Neurophysiol 46: 369–384.

Mirror neuron: a neurological approach to empathy 121

Buccino G, Binkofski F, Fink GR, Fadiga L, Fogassi L, Gallese V, Seitz RJ, Zilles K, Rizzolatti G, Freund HJ (2001) Action observation activates premotor and parietal areas in a somatotopic manner: an fMRI study. Eur J Neurosci 13: 400–404.

Buccino G, Vogt S, Ritzl A, Fink GR, Zilles K, Freund HJ, Rizzolatti G (2004) Neural circuits un-derlying imitation of hand actions: an event related fMRI study. Neuron 42: 323–34.

Calder AJ, Keane J, Manes F, Antoun N, Young AW (2000) Impaired recognition and experience of disgust following brain injury. Nature Neurosci 3: 1077–1078.

Carey DP, Perrett DI, Oram MW (1997) Recognizing, understanding and reproducing actions. In: Jeannerod M, Grafman J (eds) Handbook of neuropsychology. Vol. 11: Action and cogni-tion. Elsevier, Amsterdam.

Carr L, Iacoboni M, Dubeau MC, Mazziotta JC, Lenzi GL (2003) Neural mechanisms of empathy in humans: a relay from neural systems for imitation to limbic areas. Proc Natl Acad Sci USA 100: 5497–5502.

Changeux JP, Ricoeur P (1998) La nature et la règle. Odile Jacob, Paris.Cochin S, Barthelemy C, Roux S, Martineau J (1999) Observation and execution of movement:

similarities demonstrated by quantified electroencephalograpy. Eur J Neurosci 11: 1839–1842.

Craig AD (2002) How do you feel? Interoception: the sense of the physiological condition of the body. Nature Rev Neurosci 3: 655–666.

Damasio, A (2003a) Looking for Spinoza. Harcourt Inc.Damasio A (2003b) Feeling of emotion and the self. Ann NY Acad Sci 1001: 253–261.Di Pellegrino G, Fadiga L, Fogassi L, Gallese V, Rizzolatti G (1992) Understanding motor events:

A neurophysiological study. Exp Brain Res 91: 176–80.Fadiga L, Fogassi L, Pavesi G, Rizzolatti G (1995) Motor facilitation during action observation: a

magnetic stimulation study. J Neurophysiol 73: 2608–2611.Fogassi L, Gallese V, Fadiga L, Rizzolatti G (1998) Neurons responding to the sight of goal di-

rected hand/arm actions in the parietal area PF (7b) of the macaque monkey. Soc Neurosci Abs 24:257.5.

Frith U, Frith CD (2003) Development and neurophysiology of mentalizing. Philos Trans R Soc Lond B Biol Sci 358: 459.

Frontera JG (1956) Some results obtained by electrical stimulation of the cortex of the island of Reil in the brain of the monkey (Macaca mulatta). J Comp Neurol 105: 365–394.

Gallagher HL, Frith CD (2003) Functional imaging of ‘theory of mind’. Trends Cogn Sci 7: 77.Gallese V, Fogassi L, Fadiga L, Rizzolatti G (2002) Action representation and the inferior parietal

lobule. In: Prinz W, Hommel B (eds) Attention & Performance XIX. Common mechanisms in perception and action. Oxford University Press, Oxford.

Gallese V, Keysers C, Rizzolatti G (2004) A unifying view of the basis of social cognition. Trends Cogn Sci 8: 396–403.

Gangitano M, Mottaghy FM, Pascual-Leone A (2001) Phase specific modulation of cortical mo-tor output during movement observation. NeuroReport 12: 1489–1492.

Gangitano M, Mottaghy FM, Pascual-Leone A (2004) Modulation of premotor mirror neuron activity during observation of unpredictable grasping movements. Eur J Neurosci 20: 2193–2202.

Goldman AI, Sripada CS (2004) Simulationist models of face-based emotion recognition. Cogni-tion 94: 193–213.

Grafton ST, Arbib MA, Fadiga L, Rizzolatti G (1996) Localization of grasp representations in humans by PET: 2. Observation compared with imagination. Exp Brain Res 112: 103–111.

Grèzes J, Costes N, Decety J (1998) Top-down effect of strategy on the perception of human biological motion: a PET investigation. Cogn Neuropsychol 15: 553–582.

Grèzes J, Armony JL, Rowe J, Passingham RE (2003) Activations related to “mirror” and “ca-nonical” neurones in the human brain: an fMRI study. Neuroimage 18: 928–937.

Gross CG, Rocha-Miranda CE, Bender DB (1972) Visual properties of neurons in the inferotem-poral cortex of the macaque. J Neurophysiol 35: 96–111.

122 Giacomo Rizzolatti and Laila Craighero

Hari R, Forss N, Avikainen S, Kirveskari S, Salenius S, Rizzolatti G (1998) Activation of human primary motor cortex during action observation: a neuromagnetic study. Proc. Natl Acad Sci USA 95: 15061–15065.

Iacoboni M, Woods RP, Brass M, Bekkering H, Mazziotta JC, Rizzolatti G (1999) Cortical mecha-nisms of human imitation. Science 286: 2526–2528.

Iacoboni M, Koski LM, Brass M, Bekkering H, Woods RP, Dubeau MC, Mazziotta JC, Rizzolatti G (2001) Reafferent copies of imitated actions in the right superior temporal cortex. Proc Natl Acad Sci USA 98: 13995–13999.

Jeannerod M (1988) The neural and behavioural organization of goal-directed movements. Clarendon Press, Oxford.

Johnson-Frey SH, Maloof FR, Newman-Norlund R, Farrer C, Inati S, Grafton ST (2003) Actions or hand-objects interactions? Human inferior frontal cortex and action observation. Neuron 39: 1053–1058.

Kaada BR, Pribram KH, Epstein JA (1949) Respiratory and vascular responses in monkeys from temporal pole, insula, orbital surface and cingulate gyrus: a preliminary report. J Neurophy-siol 12: 347–356.

Kohler E, Keysers C, Umiltà MA, Fogassi L, Gallese V, Rizzolatti G (2002). Hearing sounds, un-derstanding actions: action Rrepresentation in mirror neurons. Science 297: 846–848.

Koski L, Wohlschlager A, Bekkering H, Woods RP, Dubeau MC (2002) Modulation of motor and premotor activity during imitation of target-directed actions. Cereb Cortex 12: 847–855.

Koski L, Iacoboni M, Dubeau MC, Woods RP, Mazziotta JC (2003) Modulation of cortical activ-ity during different imitative behaviors. J Neurophysiol 89: 460–471.

Krolak-Salmon P, Henaff MA, Isnard J, Tallon-Baudry C, Guenot M, Vighetto A, Bertrand O, Mauguiere F (2003) An attention modulated response to disgust in human ventral anterior insula. Ann Neurol 53: 446–453.

Manthey S, Schubotz RI, von Cramon DY (2003). Premotor cortex in observing erroneous ac-tion: an fMRI study. Brain Res Cogn Brain Res 15: 296–307.

Mesulam MM, Mufson EJ (1982) Insula of the old world monkey. III: Efferent cortical output and comments on function. J Comp Neurol 212: 38–52.

Nishitani N, Hari R (2000) Temporal dynamics of cortical representation for action. Proc Natl Acad Sci USA 97: 913–918.

Nishitani N, Hari R (2002) Viewing lip forms: cortical dynamics. Neuron 36: 1211–1220.Penfield W, Faulk ME (1955) The insula: further observations on its function. Brain 78: 445–

470.Perrett DI, Harries MH, Bevan R, Thomas S, Benson PJ, Mistlin AJ, Chitty AJ, Hietanen JK,

Ortega JE (1989) Frameworks of analysis for the neural representation of animate objects and actions. J Exp Bio 146: 87–113.

Phillips ML, Young AW, Senior C, Brammer M, Andrew C, Calder AJ, Bullmore ET, Perrett DI, Rowland D, Williams SC, Gray JA, David AS (1997) A specific neural substrate for perceiving facial expressions of disgust. Nature 389: 495–498.

Phillips ML, Young AW, Scott SK, Calder AJ, Andrew C, Giampietro V, Williams SC, Bullmore ET, Brammer M, Gray JA (1998) Neural responses to facial and vocal expressions of fear and disgust. Proc R Soc Lond B Biol Sci 265: 1809–1817.

Puce A, Perrett D (2003) Electrophysiological and brain imaging of biological motion. Philosoph Trans Royal Soc Lond, Series B, 358: 435–445.

Rizzolatti G, Craighero L (2004) The mirror-neuron system. Annu Rev Neurosci 27: 169–192.Rizzolatti G, Scandolara C, Matelli M, Gentilucci M (1981) Afferent properties of periarcuate

neurons in macaque monkeys. I. Somatosensory responses. Behav Brain Res 2: 125–146.Rizzolatti G, Fadiga L, Matelli M, Bettinardi V, Paulesu E, Perani D, Fazio F (1996) Localization

of grasp representation in humans by PET: 1. Observation versus execution. Exp Brain Res 111: 246–252.

Rizzolatti G, Fogassi L, Gallese V (2001) Neurophysiological mechanisms underlying the under-standing and imitation of action. Nature Rev Neurosci 2:661–670.

Mirror neuron: a neurological approach to empathy 123

Royet JP, Plailly J, Delon-Martin C, Kareken DA, Segebarth C (2003) fMRI of emotional respons-es to odors: influence of hedonic valence and judgment, handedness, and gender. Neuroim-age 20: 713–728.

Rozin R Haidt J and McCauley CR (2000) Disgust. In: Lewis M, Haviland-Jones JM (eds) Hand-book of Emotion. 2nd Edition. Guilford Press, New York, pp 637–653.

Saxe R, Carey S, Kanwisher N (2004) Understanding other minds: linking developmental psy-chology and functional neuroimaging. Annu Rev Psychol 55: 87–124.

Schienle A, Stark R, Walter B, Blecker C, Ott U, Kirsch P, Sammer G, Vaitl D (2002) The insula is not specifically involved in disgust processing: an fMRI study. Neuroreport 13: 2023–2026.

Showers MJC, Lauer EW (1961) Somatovisceral motor patterns in the insula. J Comp Neurol 117: 107–115.

Singer T, Seymour B, O’Doherty J, Kaube H, Dolan RJ, Frith CD (2004) Empathy for pain in-volves the affective but not the sensory components of pain. Science 303: 1157–1162.

Small DM, Gregory MD, Mak YE, Gitelman D, Mesulam MM, Parrish T (2003) Dissociation of neural representation of intensity and affective valuation in human gustation Neuron 39: 701–711.

Smith A (1759) The theory of moral sentiments (ed. 1976). Clarendon Press, Oxford.Sprengelmeyer R, Rausch M, Eysel UT, Przuntek H (1998) Neural structures associated with rec-

ognition of facial expressions of basic emotions Proc R Soc Lond B Biol Sci 265: 1927–1931.Strafella AP, Paus T (2000) Modulation of cortical excitability during action observation: a tran-

scranial magnetic stimulation study. NeuroReport 11: 2289–2292.Tanaka K (1996) Inferotemporal cortex and object vision. Ann Rev Neurosci. 19: 109–140.Tomasello M, Call J (1997) Primate cognition. Oxford University Press, Oxford.Tremblay C, Robert M, Pascual-Leone A, Lepore F, Nguyen DK, Carmant L, Bouthillier A, Theo-

ret H (2004) Action observation and execution: intracranial recordings in a human subject. Neurology. 63: 937–938.

Umilta MA, Kohler E, Gallese V, Fogassi L, Fadiga L, Keysers C, Rizzolatti G (2001) “I know what you are doing”: a neurophysiological study. Neuron 32: 91–101.

Visalberghi E, Fragaszy D. (2002). Do monkeys ape? Ten years after. In: Dautenhahn K, Nehaniv C (eds) Imitation in animals and artifacts. MIT Press, Boston. Pp. 471–500

Wicker B, Keysers C, Plailly J, Royet JP, Gallese V, Rizzolatti G (2003) Both of us disgusted in my insula: the common neural basis of seeing and feeling disgust. Neuron 40: 655–664.

Yokochi H, Tanaka M, Kumashiro M, Iriki A (2003) Inferior parietal somatosensory neurons coding face-hand coordination in Japanese macaques. Somatosens Mot Res 20 : 115–125.

Zald DH, Pardo JV (2000) Functional neuroimaging of the olfactory system in humans. Int J Psychophysiol 36: 165–181.

Zald DH, Donndelinger MJ, Pardo JV (1998) Elucidating dynamic brain interactions with across-subjects correlational analyses of positron emission tomographic data: the functional connectivity of the amygdala and orbitofrontal cortex during olfactory tasks. J Cereb Blood Flow Metab 18: 896–905.

Object Processing 1

Running head: Object Processing

Object Processing during a Joint Gaze Following Task

Carolin Theuring1, Gustaf Gredebäck2, & Petra Hauf1,3

1Max-Planck Institute for Human Cognitive and Brain Sciences, Munich, Germany

2Department of Psychology, Uppsala University, Uppsala, Sweden

3Institute of Psychology, J. W. Goethe University, Frankfurt, Germany

Key words: gaze following, object processing, eye tracking, infants, social cognition

Carolin Theuring Max Planck Institute for Human Cognitive and Brain Sciences Department of Psychology/Babylab Amalienstr. 33 D-80799 München Germany [email protected]

Object Processing 2

Abstract

To investigate object processing in a gaze following task, 12-months-old infants were

presented with 4 movies in which a female model turned and looked at one of two objects.

This was followed by 2 test trials showing the two objects alone without the model. Eye

movements were recorded with a Tobii-1750 eye tracker. Infants reliably followed gaze and

displayed longer looking times towards the attended object when the model was present.

During the first test trial infants displayed a brief novelty preference for the unattended object.

These findings suggest that enhanced object processing is a temporarily restricted

phenomenon that has little effect on 12 month old infant’s long term interaction with the

environment.

Object Processing 3

Object processing during a Joint Gaze Following Task

Gaze following is an essential element of joint visual attention and an important

ingredient of the development of early social cognition (Tomasello, 1999; Moore & Corkum,

1994). It enables an individual to obtain useful information about his/her surrounding by

observing others behavior. This ability is not restricted to humans but occurs in numerous

other species, e.g. great apes, or domestic dogs (e.g. Bräuer, Call, & Tomasello, 2005; Hare,

Call, & Tomasello, 1999). Typically, gaze following has been studied in “live” social

situations, where an adult initially establishes eye contact with the infant and thereafter turns

to and fixates one of several objects. Infants’ tendency to follow others gaze is typically

measured by relating the infants first eye turn to the direction of the adults attention

(D’Entremont, Hains, & Muir, 1997; Hood, Willen, & Driver, 1998; Johnson, Slaughter, &

Carey, 1998; Moore & Corkum, 1998; Scaife & Bruner, 1975) or by comparing looking

durations towards attended and unattended locations (Reid & Striano, 2005; von Hofsten,

Dahlström, & Fredriksson, 2005).

Some authors argue that infants can follow others gaze direction from 6 months of age

(e.g. Butterworth, & Cochran, 1980; D'Entremont, Hains, & Muir, 1997; Morales, Mundy, &

Rojas, 1998). Others have not found evidence of gaze following until infants are between 9

and 12 months of age (Corkum & Moore, 1998; Flom, Deák, Phill, & Pick, 2000; Moore &

Corkum, 1998; Morissette, Ricard, and D'ecarie, 1995). The exact onset of this ability might

still be debated. However, it appears safe to say that infants at 12 months of age follow others

gaze; responding to the behavioral manifestations of others’ psychological orientations

(Moore and Corkum, 1994; see also Johnson et al., 1998; Woodward, 2003).

Despite this, little is known about how infants at this age process the object of others

attention. It is tempting to assume that an infant by following the gaze of another person to an

object naturally also understands the psychological relationship between the person and the

Object Processing 4

attended object, as if saying: “this specific thing was interesting for this person”, but de facto

little is known about the development of the actual function of gaze following beyond target

localization. This study aims at increasing our understanding of how infants object processing

is influenced by the direction of others gaze.

In a study by Reid, Striano, Kaufman, and Johnson (2004), four-month old infants

were presented with a frontal face and one object either to the left or to the right side; these

events were presented on a video screen. After one second the eyes rotated either towards

(cued) or away (non-cued) from the object and fixated that location for one additional second.

Following this period the face disappeared and the previously presented object appeared in its

place. Throughout the experiment EEG was recorded. The results demonstrate lower peak

amplitudes for slow wave ERP’s in response to previously cued as opposed to non-cued

objects. This effect was interpreted as a novelty preference to non-cued objects, resulting from

enhanced object processing to the attended (cued) compared to the unattended (non-cued)

object.

In a different experiment by Reid and Striano (2005) four-month-olds were similarly

presented with an adult face, but this time with two objects present, one to either side of the

face. The adult’s eyes were initially directed forward, shifting to one side and fixating one of

the objects after one second. After a brief intermission the two objects were presented without

the face. During this later phase infants looking durations to each of the two objects were

recorded. Infants looked reliably longer at the non-cued compared to the cued object; this

novelty preference indicates that infants were influenced by the model’s gaze shift and that

the non-cued object was more interesting than the object previously attended by the adult

face. This result again demonstrates an enhanced object processing as a function of others

gaze direction.

These studies clearly demonstrate that infants at 4 months of age are sensitive to

others gaze direction. They are however unable to inform us about what impact this enhanced

Object Processing 5

object processing has on infant’s future interaction with the attended object for two reasons.

First of all, both studies were performed on an age group that is not frequently reported to

follow others gaze direction. Reid, Striano, Kaufman, and Johnson (2004) clearly state that

“no infants overtly followed the gaze to the object location“, suggesting that the reported

effects are not the result of shifts in overt attention but rather result from subtle shifts in covert

attention without behavioral correlates. A lack of eye movements is obviously a prerequisite

of high quality ERP data (reducing the noise level by minimizing eye movements); however,

this gives us no indication that the studied infants could follow gaze direction. In the other

study Reid and Striano (2005) did not report whether infants follow the models gaze direction

or not.

In addition, it is still unclear what is meant by “enhanced object processing”. One

interpretation is that infants’ simply follow the models gaze direction without attributing any

meaning to the models behavior. In this case infants’ preference for non-attended objects

should appear as a temporarily restricted novelty preference with little bearing on infants’

long term interaction with the attended object. Another interpretation is that infants’ attribute

some kind of meaning to the attended object, perhaps infants understand the relationship

between the model’s attention and the object, and as a consequence thereof differentiate this

object from the other object over time.

If one wishes to gain a better understanding of how others gaze direction influences

infants’ object processing it is important to focus on an age group that has a well documented

ability to follow others gaze direction and to use a paradigm with a high spatial-temporal

resolution that allows us to evaluate whether infants follow the models gaze and to evaluate

the temporal restraints of infants’ object processing. The current study attempts to answer

these questions by measuring gaze following and object processing in 12-month-old infants

using high resolution eye tracking technology.

Object Processing 6

Method

Participants.

Sixteen 12 month old healthy term infants (mean age 11 months and 28 days, SD

= 2 days) participated in this study. Four infants were eliminated because of refusal to remain

seated or because of technical problems. Participants were contacted by mail based on birth

records. As compensation for their participation each family received either two movie tickets

or a CD voucher with a total value of 15 €.

Stimuli and apparatus.

Gaze was measured using a Tobii 1750 near infrared eye tracker with an infant

add-on (Tobii Technology; www.Tobii.se). The system records the reflection of near infrared

light in the pupil and cornea of both eyes at 50 Hz (Accuracy = 0.5°, Spatial resolution =

0.25°) as participants watch an integrated 17 inch monitor. During calibration infants were

presented with a looming and sounding blue and white sphere (extended diameter = 3.3°;

Gredebäck & von Hofsten, 2004). The calibration stimulus could be briefly exchanged for an

attention grabbing movie (horizontal and vertical extension = 5.7°) if infants temporarily

looked away from the calibration stimulus.

The stimuli set consisted of 30 movies. Twenty included a female model that

shifted her gaze (head and eyes) to one of two toys placed on either side of the model (Figure

1A). Each movie included two toys randomly selected from a set of 5 unique toy pairs. The

toy pairs were counterbalanced with regards to which toy the model attended and their

respective locations, which equal 5 x 2 x 2 variations. Ten additional movies without the

model were also included (Figure 1B). Again the five toy pairs were counterbalanced for

location, which equal 5 x 2 variations.

Object Processing 7

Movies that included the model lasted on average for 13 seconds. Each movie

started with a 5 second period in which the model looked down at a table in front of her with

the two toys visible on either side (see Figure 1-A1). After 5 seconds the model moved her

face upwards and fixated straight ahead (see Figure 1-A2 for endpoint of upwards motion),

this movement lasted 1.5 seconds. During these 6.5 seconds the model had not yet attended to

either toy. This part is referred to as the Baseline block. The following 1.5 seconds involved a

sideways downwards movement of the model’s head and eyes towards one of the toys,

accompanied by a smile. The movie ended with a 5 second interval in which the model kept

fixating the toy. These last 6.5 seconds of the sequence is referred to as the Gaze Direction

block (see Figure 1-A3). The purpose of the current study is to investigate the depth of

infants’ object processing during a joint gaze following task. The stimulus material was

therefore designed to simulate the natural interaction between an infant and an adult. As such,

the model shifted her gaze and directed positive regard to a near object. For the same purpose

the action was presented over the extended period of 13 seconds altogether. The infants

should been given time to encode and react to the full situation, as if in any playful situation

with an adult. Movies that did not include a model lasted for 10 seconds. In these movies the

two objects are displayed on their own (Figure 1B).

--- insert Figure 1 about here ---

Procedure.

Infants were seated in a safety car seat and placed on the parents’ lap with their

eyes about 60 cm from the eye tracker. During calibration the looming stimulus described

above was presented at nine different locations in a random order. If infants’ attention was

drawn to other parts of the visual field during this procedure the experimenter exchanged the

Object Processing 8

calibration stimulus for an alternative attention grabbing movie at the current location. If any

points were unsuccessfully calibrated these points were recalibrated.

Infants were always presented with four trials that included the model. During

these presentations the model always attended to the same toy but its location changed

between presentations according to an ABBA design. For example, the attended toy might

first appear to the left followed by two movies in which the toy appeared to the right, and a

final movie in which the attended toy appeared to the left. These presentations were followed

by two test trials in which the toys appeared on their own, again at switched locations. Each

infant was presented with one of two presentation orders ABBA-BA or BAAB-AB.

This presentation order was necessary to ensure the infants would interpret the models

behavior as object directed and not as directed to a particular side of their visual field. In

addition, reversing the location of the objects between the last trial with the model and the

first test trials (and between the first and second test trial) ensure that infants pay attention to

the objects and not just the disappearance of the model.

After the experiment parents were given the opportunity to look at a replay of the stimuli that

included the infants gaze. In general each family spent 15 to 20 minutes in the lab.

Data reduction.

Infants had to fixate the models face and then shifted their gaze to one of the toys

after the model started moving her head on at least three trials to be included in the analysis.

These infants further fixated each object location at least once during the Gaze Direction

block. Gaze data were transformed to fixation data with a spatial-temporal filter covering 200

ms and 50 pixels (approximately 1 visual degree) in ClearView (Tobii). All analyses were

performed using a custom made analysis program in Matlab (Mathworks).

Data from each trial that included the model were separated into a Baseline

block, including the beginning of each trial until the model had raised her head, and a Gaze

Object Processing 9

Direction block, including the entire head movement and the models subsequent fixation on

an object. Data from the Test block were analyzed as a whole. Each trial of these blocks was

analyzed separately. The blocks were used in an area-of-interest (AOI) analysis including

only fixations within any one of three equally sized areas, covering face, left toy and right toy.

These analyses calculated effects in two ways, according to the location of each fixation and

with respect to the AOIs covering the attended or unattended toy (see Figure 2 for

approximate size and location of AOIs). Note that the AOI analysis includes more detailed

information by separating the face from the toys and excluding fixations at other parts of the

screen than typically used preferential looking paradigms for which the screen would have

been divided in a left and a right section only. Observe that during test trials AOI analysis was

based on the AOIs covering the attended or unattended toy (see Figure 2B).

As infants differ in overall fixation duration, proportional fixation duration was

calculated (attended = fixation duration attended / fixation duration attended + fixation

duration unattended; unattended = fixation duration unattended / fixation duration attended +

fixation duration unattended). Only this proportional fixation duration is reported in the result

section below.

--- insert Figure 2 about here ---

Focus was also directed at the first fixations and individual gaze shifts. Infants’

first fixation of the models face and the two toys during the Baseline block and the Gaze

Direction block was registered (after infants had fixated the models face for at least 200 ms

during this block) together with the duration of this fixation. During Baseline and Gaze

Direction blocks the fixated target was registered and whether the attended target was the one

being fixated or not. Additionally, infants’ first gaze shift from the model’s face to a toy (as

Object Processing 10

defined by the AOIs) during the Gaze Direction Block was registered. During the Test block

infants’ proportional fixation duration was analyzed in respect to the AOIs covering the two

toys (see Figure 2B) as well as the first fixation of a toy. Also the first fixated toy was

included.

In addition a difference score (DS; cf. Corkum & Moore, 1998; Moore &

Corkum, 1998) was calculated for each infant in order to investigate reliably gaze following

during the Gaze Direction block. This DS is determined by the number of trials where the

infant’s first gaze shifted from the face to the attended toy, minus the number of trials where

infants went to the unattended toy first and thus could vary from +4 (shifting in all trials to the

attended toy) to –4 (shifting in all trials to the unattended toy).

Results

Baseline block.

The mean fixation duration during the four baseline trials within all three AOIs

equalled 4405 ms (SE = 202ms). More time was spent looking at the face (M = 49.48%, SE =

2.64%) than at either toy (Mleft = 20.52%, SE = 2.17% and Mright = 30.00%, SE = 2.92%), F(2,

45) = 27.613, p = .000, η2 = 0.551. Separated t-tests differentiated the face from both the left

toy, t(46) = 7.505, p = .0000 as well as from the right toy, t(46) = 3.794, p = .0004, which

additionally differed from each other, t(46) = 2.146, p = .037. Note that even though infants

spent more time looking at the right AOI than at the left AOI there were no significant

differences between the AOI that later would be the attended (45.8%, SE = 4.7%) and the

unattended AOI (54.2%, SE = 4.7%), t(47) = 0.887, p = .380 (see Figure 3).

--- insert Figure 3 about here ---

Object Processing 11

Gaze Direction block.

During the Gaze Direction block 4625 ms (SE = 259ms) was spent fixating the

three AOIs on average. More time was spent fixating the face (M = 59.11%, SE = 3.04%) than

either the left (M = 19.7%, SE = 2.8%) or the right (M = 21.2%, SE = 2.5%) toy, F(2, 45) =

34.806, p = .000, η2 = 0.607. Post-hoc tests differentiated the face from both the left, t(46) =

7.609, p = .000 as well as from the right toy, t(46) = 7.628, p = .000, which did not differ (p =

.570; see Figure 3). In contrast to the Baseline block now the attended toy was fixated to a

higher degree (M = 67.5%, SE = 4.5%) than the unattended toy (M = 32.5%, SE = 4.5%; t(43)

= 3.797, p = .0005).

Test block.

In total infants spent 6588 ms (SE = 487ms) fixating the two AOIs during the

Test block. Thereby infants looked equally long at the right (M = 50.88%, SE = 5.47%) and

the left (M = 49.12%, SE = 5.47%) toy, t(23) = 0.164, p = .871. The proportional fixation of

the toys during the Test block was analyzed by a 2 x 2 analysis of variance (ANOVA) with

the within-subject factors test (trial 1 and trial 2) and toy (attended or unattended). The

analysis yielded a marginally significant test trial x toy interaction, F(1, 11) = 3.981, p = .071,

η2 = 0.266, but no significant main effects (ps > .189). Even though infants’ overall fixation

times did not differ between the attended toy (M = 45.8%, SE = 5.4) and the unattended (M =

54.2%, SE = 5.4) toy during the entire Test block, t(23) = 0.795, p = .435, post-hoc tests

differentiated the fixation duration of the first test trial from the fixation duration of the

second test trial (see Figure 3). The previously unattended toy was fixated to a higher degree

(M = 66.2%, SE = 5.4%) than the previously attended toy (M = 33.8%, SE = 5.4%) during the

first of two test trials, t(11) = 3.166, p = .009. However during the second test trial infants

fixated both toys equally long (previously unattended toy: M = 42.2%, SE = 8.4%; previously

attended toy: M = 57.8%, SE = 8.4%; t(11) = 0.971, p = .352). In addition, the fixation

Object Processing 12

duration during the first test trial did not correlate with the fixation duration during the Gaze

Direction block, rxy = 0.103, p = .750.

The analysis of the fixation duration demonstrated that infants spent a different

amount of time looking at the two toys during both the Gaze Direction block and the Test

block. Interestingly, infants showed no preference at all during the Baseline block. An

especially detailed analysis was conducted covering the first fixation and the order of

presentation.

Analysis of first gaze shift and first fixation.

The first gaze shift from the model’s face to a toy during the Gaze Direction

block was analyzed separately. Infants’ first gaze shifts were directed to the attended toy in 38

of 46 trials, χ²(1) = 19.565, p = .000, with an averaged DS of 2.17 (SE = 0.48), indicating that

infants were reliably following the model’s gaze. This first gaze shift did not correlate with

the fixation duration during the first test trial, r = 0.089, p = .783. The first fixation of a toy

was analyzed by a 3 x 2 ANOVA with the within-subject factors block (baseline, gaze

direction, and test) and toy (attended or unattended). The analysis yielded a significant block

x toy interaction, F(2, 10) = 6.297, p = .017, η2 = 0.557, but no significant main effects (ps >

.312) indicating that the infants’ first fixations to the attended toy differed among the blocks;

see Figure 4 for the percentage of first fixation at the toys. Accordingly, separated t-tests

revealed that the first fixation was significantly directed to the attended toy during the Gaze

Direction block (M = 82.7%, SE = 6.8%), t(11) = 5.011, p = .0004, whereas no differences

occurred in the first fixation between the attended toy (M = 44.4%, SE = 8.5%) and the

unattended toy (M = 55.6%, SE = 8.5%) during the Baseline block [t(11) = 0.689, p = .505]

and the Test block [attended toy: M = 45.8%, SE = 10.1%; unattended toy: M = 54.2%, SE =

10.1%, t(11) = 0.432, p = .674. In addition the percentage of first fixation directed to the left

Object Processing 13

toy was not significantly different from the percentage of first fixation directed to the right toy

during all blocks (all ps > .522).

--- insert Figure 4 about here ---

Analysis of the presentation order.

Each infant was presented with four trials where the model always attended to the same

toy but its location was changed between trials according to an ABBA design (cf. procedure).

The subsequent test trial without the model presented the toys again in changed location (B).

In order to investigate the responsibility of the presentation order for the fixation times of both

Baseline trials and Gaze Direction trials, differences scores were calculated for the fixation

duration in each trial separately. Positive scores indicate longer fixations of the attended toy

whereas negative scores demonstrate longer fixations of the unattended toy (see Figure 5).

The differences of these scores were analyzed by an one-way Anova of the changed locations

(GD1 to B2, GD2 to B3, GD3 to B4). This analysis revealed no significant main effect, F(2,

10) = 1.907, p = .199 ruling out an overall influence of the presentation order. Of special

interest was the alteration between GD1 and B2 (A followed by B) as well as between GD3

and B4 (B followed by A). Planned contrasts yielded a significant difference of the scores

between GD1 (M = 0.49, SE = 0.17) and B2 (M = -0.20, SE = 0.18, t(11) = 3.728, p = .003)

indicating a location cueing at the beginning (see Figure 5). Nevertheless, this difference

disappeared over trials. Accordingly, the difference scores did not vary anymore during the

last changed location between GD3 (M = 0.23, SE = 0.16) and B4 (M = 0.08, SE = 0.21, t(11)

= 0.678, p = .512).

--- insert Figure 5 about here ---

Object Processing 14

Discussion

In order to investigate object processing in a gaze following task, we presented 12-

month-old infants with different movies on an eye tracker. The first four movies showed a

female model turning and looking to one of two objects, this was followed by two test trials

showing only the two objects without the model. Infants did not differentiate between the two

objects during the Baseline block. In the Gaze Direction block infants consistently followed

the models gaze to the attended object and fixated the attended longer than the unattended

object. No overall difference was found during the Test block; there was however a

significant difference in the first test trial, here infants fixated the unattended object for a

longer duration than the attended object.

First of all, it is important to emphasize that these results complement previous

findings that infants by the end of the first year of life follow gaze (Carpenter et al., 1998;

Corkum & Moore, 1995; Johnson et al., 1998; Scaife et al., 1975; von Hofsten et al., 2005),

which is an important precondition for our question at hand. In fact, even though the model

was presented on a monitor (as opposed to “live”), infants reliably followed the models gaze

to the attended object. This suggests that movies can substitute “live” interactions in order to

take advantage of the high resolution provided by current eye tracking technology.

In addition to demonstrating the basic finding of gaze following, the high spatial-

temporal resolution of the eye tracking system allowed us to obtain additional details about

the infants’ gaze following behaviour. We were able to differentiate the trials temporally into

Baseline, Gaze Direction, and Test blocks, as well as spatially into the three specific areas of

interest: face, attended and unattended objects. This made it possible to show that infants’ first

fixation of the attended toy differed among the blocks; it was only during the Gaze Direction

block significantly directed towards the attended toy. The analysis of the AOIs revealed that

the main location of attention was the model’s face where infants spent altogether up to 60%

Object Processing 15

of their looking duration. The first gaze shift from the model’s face to a toy during the Gaze

Direction block was in 83% of the trials directed towards the attended toy.

Despite the fact that infants readily followed the model’s gaze to an object, we cannot

find evidence that following another person’s gaze direction had a long term effect on infants’

object processing at 12 months of age. The only detectable difference in behaviour towards

the attended and the unattended object in the Test block was a brief novelty preference for the

previously unattended object. This effect was only evident in the first test trial. We suggest

that this result represents a temporarily restricted novelty effect caused by prolonged looking

durations to the attended object during the Gaze Direction block.

These results harmonize well with the enhanced object processing documented by

Reid and collegues (2004; 2005). Both studies demonstrate a novelty preference for the

previously unattended object. These similarities exist despite differences in the design. In the

current study infants are being presented with far more ecologically relevant stimuli that

simulate the natural interaction between an infant and an adult.

At the same time the current design allows us to look at the depth of this enhanced

object processing by presenting infants with multiple test trials. We demonstrate that this

novelty effect is brief and only influences infants’ behaviour for a single test trial. The current

finding suggests that infants do not attribute any strong meaning to the relationship between

the model’s gaze and the attended object. At least not to such a degree that it has a profound

impact on infant’s long term interaction with the attended object. This suggests that a careful

interpretation of the phrase “enhanced object processing” should be applied to future studies

of perception of gaze direction. According to our findings 12 month-old infants follow others

gaze but are unable to represent the relationship between the model’s gaze and the attended

object, or alternatively are unable to attribute any specificity to the regarded object.

The notion that infants are unable to transfer the perceived relationship between the

model and the object across trials receives further support from infants’ performance during

Object Processing 16

the Baseline block. With the exception of the first baseline infants have previous experience

with the model and have perceived her attend to one of the two objects. As such, later parts of

the Baseline block could be considered an additional test of infants’ object processing.

However, as mentioned, no difference in fixation duration was found in this part of each trial.

In addition there was no significant correlation between fixation durations in Gaze Direction

and Baseline blocks or between the first gaze shift in the Gaze Direction block and fixation

durations during the first test trial. The missing correlation between the looking behaviour

during the Gaze direction block and the looking behaviour during the Test block further

supports the argument that gaze following does not automatically goes hand in hand with

object processing.

It is of course possible that infants by the end of the first test trial have fixated the

previously unattended object enough to cancel out the novelty effect from the first test trial.

Such an alternative interpretation of the current results would produce a similar null-effect in

the second test trial. The current design is however unable to differentiate the two.

Interestingly a study of Itakura (2001) showed that 12-month-old infants looked

significantly longer to one of two objects on a screen, if their mother previously has been

drawing their attention to it through pointing and vocalizing. In this experiment the infants

were sitting on the mothers lap and followed her pointing gesture, not her gaze. The mother’s

pointing here had a clear impact on the interest of the infant towards the attended object.

Consequently, future studies should aim at describing how infants come to attribute a

more profound meaning to objects that have been attended by others and to investigate at

what age infants attribute sufficient meaning to others gaze direction to display a stable

behavioural change that enhance the acquisition of knowledge about the environment. Related

studies (e.g. Itakura, 2001) suggest that additional gestures as pointing might enhance the

processing of the object of interest.

Conclusion

Object Processing 17

The results indicate that infants at the age of 12 months reliably follow others gaze

direction. Nevertheless there is no implication that retrieving specific information about the

objects during a gaze following task is a fundamental property of gaze following at this age.

Following a models gaze does not influence infants’ looking behaviour beyond a temporarily

restricted response. The study further demonstrates that eye tracking technology represents a

promising paradigm for obtaining detailed information of infant gaze following. Further

investigations are required to observe at what age gaze following results in object processing,

which is stable enough to guide long term interactions with attended objects.

References

Bräuer, J., Call, J., & Tomasello, M. (2005). All great apes species follow gaze to distant

locations and around barriers. Journal of Comparative Psychology, 119(2), 145-154.

Butterworth, G., & Cochran, E. (1980). Towards a mechanism of joint visual attention in

human infancy. International Journal of Behavioral Development, 3, 253-272.

Carpenter, M., Nagell, K., & Tomasello, M. (1998). Social cognition, joint attention, and

communicative competence from 9 to 15 months of age. Monographs of the Society

for Research in Child Development, 63 (4, No. 255), pp. 1-143.

Corkum, V., & Moore, C. (1995). Development of joint visual attention in infants. In: C.

Moore, & P. Dunham (Eds.), Joint attention. Its origins and role in development. (pp.

61-83). Hillsdale, NJ: Lawrence Erlbaum Associates.

Corkum, V., & Moore, C. (1998). The origins of joint visual attention in infants.

Developmental Psychology, 34(1), 28-38.

Object Processing 18

D’Entremont, B., Hains, S.M.J., & Muir, D.W. (1997). A demonstration of gaze following in

3- to 6-month-olds. Infant Behavior and Development, 20(4), 569-572.

Flom, R., Deák, G.O., Phill,C.G., & Pick, A.D. (2004). Nine-month-olds’ shared visual

attention as a function of gesture and object location. Infant Behavior and

Development, 27, 181-194.

Gredebäck, G., & von Hofsten, C. (2004). Infants’ evolving representations of object motion

during occlusion: A longitudinal study of 6- to 12-month-old infants. Infancy, 6(2),

165-184.

Hare, B., Call, J., & Tomasello, M.(1999). Domestic dogs (Canis familiaris) use human and

conspecific social cues to locate hidden food. Journal of Comparative Psycholgy, 113,

1-5.

Hood, B.M., Willen, J.D., & Driver, J. (1998). Adult’s eyes trigger shifts of visual attention in

human infants. Psychological Science, 9(2), 131-134.

Itakura, S. (2001). Attention to repeated events in human infants (Homo sapiens): effects of

joint visual attention versus stimulus change. Animal Cognition, 4, 281-284.

Johnson, S., Slaughter, V., & Carey, S. (1998). Whose gaze will infants follow? The

elicitation of gaze following in 12-months-olds. Developmental Science, 1(2), 233-

238.

Moore, C., & Corkum, V. (1994). Social understanding at the end of the first year of life.

Developmental Review,14, 349-372.

Moore, C., & Corkum, V. (1998). Infants gaze following based on eye direction. British

Journal of Developmental Psychology, 16, 495-503.

Morales, M., Mundy, P., & Rojas, J. (1998). Following the direction of gaze and language

development in 6-month-olds. Infant Behavior and Development, 21(2), 373-377.

Object Processing 19

Morissette, P., Ricard, M., & Gouin Décarie, T. (1995). Joint visual attention and pointing in

infancy: A longitudinal study of comprehension. British Journal of Developmental

Psychology, 13, 163-175.

Reid, V. M., & Striano, T.(2005). Adult gaze influences infant attention and object

processing: Implications for cognitive neuroscience. European Journal of

Neuroscience, 21, 1763-1766.

Reid, V.M., Striano, T., Kaufman, J., & Johnson, M.H. (2004). Eye gaze cueing facilitates

neural processing of objects in 4-month-old infants. Neuroreport, 15(16), 2553-2555.

Scaife, M., & Bruner, J.S. (1975). The capacity for joint visual attention in the infant. Nature,

253(24), 265-266.

Tomasello, M. (1999). Joint attention and cultural learning. In: M. Tomasello (Ed.), The

cultural origins of human cognition (pp. 56-93). Cambridge, MA: Harvard University

Press.

von Hofsten, C., Dahlström, E., & Fredriksson, Y. (2005). 12-months-old infants’ perception

of attention direction in static video images. Infancy, 8(3), 217-231.

Woodward, A. (2003). Infants’ developing understanding of the link between looker and

object. Developmental Science, 6(3), 297-311.

Object Processing 20

Figure captions

Figure 1. Timeline and sample stimuli with the model present (A) and during the Test block

(B). The Baseline includes the first 5 seconds (A1) and the duration when the model looks up

(up to the frame depicted in A2). The Gaze Direction block starts when the model indicates

the direction of gaze (following the frame depicted in A2) and ends with the termination of

the movie (A3).

Figure 2. White squares represent AOIs covering the face and any of the two toys during

trials with the model present (A) and during test trials (B). AOI squares do not represent the

exact size used for analysis.

Figure 3. Average proportional fixation durations to the attended (A+) and unattended (A-)

toy, based on AOI, in Baseline (B) and Gaze Direction (GD) blocks and the two test trials.

Error bars indicate standard error and (**) depict significant differences.

Figure 4. Average percentage scores of the first saccade directed to the attended (A+) and

unattended (A-) target based on AOI, in Baseline (B), Gaze Direction (GD) and Test blocks.

Error bars indicate standard error and (**) depict significant differences.

Figure 5. Average difference scores of the fixation duration to the attended and unattended

toy, based on AOI, in Baseline trials and Gaze Direction trials in reference to the presentation

order. Positive scores indicate longer fixation to the attended toy, negative scores to the

unattended toy, respectively. Error bars indicate standard error and (**) depict significant

differences.

Object Processing 21

Fig 1.

A1 A2

A3

5 seconds ~3 seconds 5 seconds

B

10 seconds

Object Processing 22

Fig 2.

BA

Object Processing 23

Fig 3.

Object Processing 24

Fig 4.

Object Processing 25

Fig. 5

-0,75

-0,25

0,25

0,75

unat

tend

ed to

y

a

ttend

ed to

y

**

Presentation order

A followed by B B followed by A

GD1 B2 GD3 B4

Developmental Science, in press

Action in development Claes von Hofsten1

Uppsala University

1 Department of Psychology Uppsala University Box 1225, S-75142 Uppsala, Sweden tel. (46)184712133 email: [email protected] www.babylab.se

Abstract

It is argued that cognitive development has to be understood in the functional

perspective provided by actions. Actions reflect all aspects of cognitive development

including the motives of the child, the problems to be solved, and the constraints and

possibilities of the child’s body and sensori-motor system. Actions are directed into the future

and their control is based on knowledge of what is going to happen next. Such knowledge is

available because events are governed by rules and regularities. The planning of actions also

requires knowledge of the affordances of objects and events. An important aspect of cognitive

development is about how the child acquires such knowledge.

Cognition and action are mutually dependent. Together they form functional systems,

driven by motives, around which adaptive behaviour develops (von Hofsten, 1994, 2003,

2004). From this perspective, the starting point of development is not a set of reflexes

triggered by external stimuli, but a set of action systems activated by the infant. Thus,

dynamic systems are formed in which the development of the nervous system and the

development of action mutually influence each other through activity and experience. With

development, the different action systems become increasingly future oriented and integrated

with each other and ultimately each action will engage multiple coordinated action systems.

Adaptive behaviour has to deal with the fact that events precede the feedback signals

about them. Relying on feedback is therefore non-adaptive. The only way to overcome this

problem is to anticipate what is going to happen next and use that information to control ones

behaviour. Furthermore, most events in the outside world do not wait for us to act. Interacting

with them require us to move to specific places at specific times while being prepared to do

specific things. This entails foreseeing the ongoing stream of events in the surrounding world

as well as the unfolding of our own actions. Such prospective control is possible because

events are governed by rules and regularities. The most general ones are the laws of nature.

Inertia and gravity for instance apply to all mechanical motions and determine how they

evolve. Other rules are more task specific, like those that enable a child to grasp an object or

use a spoon. Finally, there are socially determined rules that we have agreed upon to facilitate

social behaviour and to enable us to communicate and exchange information with each other.

Knowledge of rules makes smooth and skilful actions possible. It is accessible to us through

our sensory and cognitive systems.

The kind of knowledge needed to control actions depend on the problems to be solved

and the goals to be attained. Therefore, motives are at the core of cognitive processes and

actions are organized by goals and not by the trajectories they form. A reach, for instance, can

be executed in an infinite number of ways. We still define it as the same action, however, if

the goal remains the same. Not only do we categorize movements in terms of goals, but the

same organizing principle is also valid for the way the nervous system represents movements

(Bizzi & Mussa-Ivaldi, 1998; Poggio & Bizzi, 2004). When performing actions, subjects

fixate goals and sub-goals of the movements before they are implemented demonstrating that

the goal state is already represented when actions are planned. (Johansson, Westling,

Bäckström & Flanagan, 2001).

The developmental primitives

An organism cannot develop without some built-in abilities. If all the abilities needed

during the lifetime are built in, however, then it does not develop either. There is an optimal

level for how much phylogeny should provide and how much should be acquired during the

life time. Most of our early abilities have some kind of built-in base. It shows up in the

morphology of the body, the design of the sensory-motor system, and in the basic abilities to

perceive and conceive of the world. One of the greatest challenges of development is to find

out what those core abilities are, what kind of knowledge about the self and the outside world

they rely on, and how they interact with experience in developing action and cognition.

Some of the core abilities are present at birth. Although newborn infants have a rather

restricted behavioural repertoire, converging evidence shows that behaviours performed by

neonates are prospective and flexible goal-directed actions rather than primitive reflexes. For

instance, the rooting response is not elicited if the infant touches him or herself on the cheek

or if they are not hungry (Rochat & Hespos, 1997). When sucking, neonates monitor the flow

of milk prospectively (Craig & Lee, 1999). They can visually control their arm movements in

space and aim them towards objects when fixating them (von Hofsten, 1982; van der Meer,

1997). Van der Meer (op.cit.) showed that when a beam of light was positioned so that the

light passed in front of the neonates without reflecting on their body, they controlled the

position, velocity and deceleration of their arms so as to keep them visible in the light beam.

The function of all these built in skills, in addition to enabling the newborn child to act, I

suggest, is to provide activity-dependent input to the sensorimotor and cognitive systems. By

closing the perception-action loop the infant can begin to explore the relationship between

commands and movements, between vision and proprioception, and discover the possibilities

and constraints of their actions. It is important to note that the core abilities rarely appear as

ready made skills but rather as something that facilitates the development of skills.

Actions directed at the outside world require knowledge about space, objects, and

people. For instance, we need some kind of mental map of the environment around us so that

enables us to move about and know where we are. We need to be able to parse the visual

array into relatively independent units with inner unity and outer boundaries that can be

handled and interacted with, i.e. objects, and we need to be able to distinguish the critical

features of other people that makes it possible to recognize them and their expressions,

communicate with them, and perceive the goal-directedness of their actions. Infants are

endowed with such knowledge providing a foundation for cognitive development.

Nevertheless, Spelke (2000) suggests that core knowledge systems are limited in a number of

ways: They are domain specific (each system represents only a small subset of the things and

events that infants perceive), task specific (each system functions to solve a limited set

problems), and encapsulated (each system operates with a fair degree of independence from

other cognitive systems).

Motives

The development of an autonomic organism is crucially dependent on motives. They define

the goals of actions and provide the energy for getting there. The two most important motives

that drive actions and thus development are social and explorative. They both function from

birth and provide the driving force for action throughout life. The social motive puts the

subject in a broader context of other humans that provide comfort, security, and satisfaction.

From these others, the subject can learn new skills, find out new things about the world, and

exchange information through communication. The social motive is so important that it has

even been suggested that without it a person will stop developing altogether. The social

motive is expressed from birth in the tendency to fixate social stimuli, imitate basic gestures,

and engage in social interaction.

There are at least 2 exploratory motives. The first one has to do with finding out about

the surrounding world. New and interesting objects (regularities) and events attract infants’

visual attention, but after a few exposures they are not attracted any more. This fact has given

rise to a much used paradigm for the investigation of infant perception, the habituation

method. The second exploratory motive has to do with finding out about ones own action

capabilities. For example, before infants master reaching, they spend hours and hours trying

to get the hand to an object in spite of the fact that they will fail, at least to begin with. For the

same reason, children abandon established patterns of behaviour in favour of new ones. For

instance, infants stubbornly try to walk at an age when they can locomote much more

efficiently by crawling. In these examples there is no external reward. It is as if the infants

knew that sometime in the future they would be much better off if they could master the new

activities. The direct motives are, of course, different. It seems that expanding ones action

capabilities is extremely rewarding in itself. When new possibilities open up as a result of, for

example, the establishment of new neuronal pathways, improved perception, or

biomechanical changes, children are eager to explore them. At the same time, they are eager

to explore what the objects and events in their surrounding afford in terms of their new modes

of action (Gibson & Pick, 2000). The pleasure of moving makes the child less focused on

what is to be achieved and more on its movement possibilities. It makes the child try many

different procedures and introduces necessary variability into the learning process.

Development of prospective control of action

Actions are organized around goals, directed into the future and rely on information

about what is going to happen next. Such prospective control develops simultaneously with

the emergence of all new forms of behaviour. Here I will discuss posture and locomotion,

looking, manual actions, and social behaviour.

Posture and locomotion.

Basic orientation is a prerequisite for any other functional activity (Gibson, 1966; Reed,

1996) and purposeful movements are not possible without it. This includes balancing the body

relative to gravity and maintaining a stable orientation relative to the environment. Gravity is

a potent force and when body equilibrium is disturbed, posture becomes quickly

uncontrollable. Therefore, any reaction to a balance threat has to be very fast and automatic.

Several reflexes have been identified that serve that purpose. Postural reflexes, however, are

insufficient to maintain continuous control of balance during action. They typically interrupt

action. Disturbance to balance are better handled in a prospective way, because if the

disturbance can be foreseen there is no need for an emergency reaction and ongoing actions

can continue. Infants develop such anticipations in parallel with their mastery of the different

modes of postural control (Barela, Jeka, and Clark, 1999; Whitherington et al., 2001).

Looking:

Although each perceptual system has its own privileged procedures for exploration, the

visual system has the most specialized one. The whole purpose of movable eyes is to enable

the visual system to explore the world more efficiently and to stabilize gaze on objects of

interest. The development of oculomotor control is one the earliest appearing skills and marks

a profound improvement in the competence of the young infant. It is of crucial importance for

the extraction of visual information about the world, for directing attention, and for the

establishment of social communication. Gaze control involves both head and eye movements

and is guided by at least three types of information: visual, vestibular, and proprioceptive.

Two kinds of task need to be mastered, moving the eyes to significant visual targets and

stabilizing gaze on these targets. Shifting gaze is done with high speed saccadic eye

movements and stabilizing them on points of interests is done with smooth pursuit eye

adjustments. The latter task is, in fact, the more complicated one. To avoid slipping away

from the target, the system is required to anticipate forthcoming events. When the subject is

moving relative to the target, which is almost always the case, the smooth eye movements

need to anticipate the upcoming movement n order to compensate for it correctly. When the

fixated object moves, the eyes must anticipate its forthcoming motion.

Maintaining a stable gaze while moving is primarily controlled by the vestibular system.

It functions at an adult level already a few weeks after birth. From at least 4 weeks of age the

compensatory eye movements anticipate the upcoming body movements (Rosander & von

Hofsten, 2000, von Hofsten & Rosander, 1996). From about 6 weeks of age, the smooth part

of the tracking improves rapidly. Von Hofsten and Rosander (1997) recorded eye and head

movements in unrestrained 1- to 5-month-old infants as they tracked a happy face moving

sinusoidally back and forth in front of them. They found that the improvement in smooth

pursuit tracking was very rapid and consistent between individual subjects. Smooth pursuit

starts to develop at around 6 weeks of age and reaches adult performance at around 14 weeks.

When attending to a moving object in the surroundings, the view of it gets frequently

interrupted by other objects in the visual field. In order to maintain attention across such

occlusions and continue to track the object when it reappears, the spatio-temporal continuity

of the observed motion must somehow be represented over the occlusion interval. From about

4 months of age, infants show such ability (Rosander & von Hofsten, 2004; von Hofsten,

Kochukhova, & Rosander, 2006). The tracking is typically interrupted by the disappearance

of the object and just before the object reappears, gaze moved to the reappearance position.

This behaviour is demonstrated over a large range of occlusion intervals suggesting that the

infants track the object behind the occluder in their “minds eye”. This representation might

not, however, have anything to do with the notion of a permanent object that exists over time

or with infants’ conscious experience of where the occluded object is at a specific time behind

the occluder. It could rather be expressed as a preparedness of when and where the object

will reappear. In support of the hypothesis that infants’ represent the velocity of the occluded

object are the findings that object velocity is represented in the frontal eye field (FEF) of

rhesus monkeys during the occlusion of a moving object (Barborica & Ferrera, 2003).

Reaching and manipulation.

In the act of reaching for an object there are several problems that need to be dealt with

in advance, if the encounter with the object is going to be smooth and efficient. The reaching

hand needs to adjust to the orientation, form, and size of the object. The securing of the target

must be timed in such a way that the hand starts to close around the object in anticipation of

and not as a reaction to the encounter. Such timing has to be planned and can only occur

under visual control. Infants do this from the age they begin to successfully reach for objects

around 4-5 months of age (von Hofsten and Rönnqvist, 1988).

From the age when infants start to reach for objects they have been found to adjust the

orientation of the hand to the orientation of an elongated object reached for (Lockman,

Ashmead, & Bushnell, 1984; von Hofsten, & Fazel-Zandy, 1984). When reaching for a

rotating rod, infants prepare the grasping of it by aligning the orientation of the hand to a

future orientation of the rod (von Hofsten and Johansson, 2006). Von Hofsten and Rönnqvist

(1988) found that 9 and 13 month-old infants, but not 5-month-olds, adjusted the opening of

the hand to the size of the object reached for. They also monitored the timing of the grasps.

For each reach it was determined when the distance between thumb and index finger started to

diminish and when the object was encountered. It was found that all the infants studied began

to close the hand around the object before it was encountered. For the 5- and 9-month-old

infants the hand first moved to the vicinity of the target and then started to close around it. For

the 13 month-olds, however, the grasping action typically started during the approach, well

before touch. In other words, at this age grasping started to become integrated with the reach

to become one continuous reach-and-grasp act.

A remarkable ability of infants to time their manual actions relative to an external

event is demonstrated in early catching behavior (von Hofsten, 1980, 1983; von Hofsten,

Vishton, Spelke, Feng & Rosander al., 1998). Von Hofsten (1980) found that infants

reached successfully for moving objects at the very age they began mastering reaching for

stationary ones. Eighteen-week-old infants were found to catch an object that moved at 30

cm/sec. The reaches were aimed towards the meeting point with the object and not towards

the position were the object was seen at the beginning of the reach. In experiments by von

Hofsten et al. (1998) 6-month-old infants were presented with an object that approached

them on one of four trajectories. On 2 of them the trajectories were linear: The object

began from a position above and to the left or right of the infant, it moved downward on a

diagonal path through the center of the display, and it entered the infant’s reaching space on

the side opposite to its starting point. The other two trajectories were nonlinear: The object

moved to the center as on the linear trials and then turned abruptly and moved into the

infant’s reaching space on the same side as its starting point. Because the object was out of

reach until after it had crossed the central point where the paths intersected, infants could

only reach for the object by aiming appropriately to the left or right side of their reaching

space. Because the object moved rapidly, however, infants could only hope to attain it if

they began their reach before the object arrived at the central intersection point. In this

situation, 6-month-old infants reached predictively by extrapolating a linear path of object

motion. On linear trials that began on the left, for example, infants began to reach for the

object when it was still on the left side of the display, aiming their reach to a position on the

right side of reaching space and timing the reach so that their hand intersected the object as

it arrived on that side of reaching space. On nonlinear trials that began on the left, infants

showed the same pattern of rightward reaching and aiming until about 200 ms after the

object had turned leftward, and therefore they typically missed the object: a pattern that

provides further evidence that the infants assumed that the object would continue to move

on a linear path.

Manipulation

When manipulating objects, the subject needs to imagine the goal state of the manipulation

and the procedures of how to get there. Von Hofsten & Örnkloo (2006) studied how infants

develop their ability to insert blocks into apertures. The task was to insert elongated objects

with various cross-sections (circular, square, rectangular, elliptic, and triangular) into

apertures in which they fitted snugly. All objects had the same length and the difficulty was

manipulated by using different cross sections. The objects were both presented standing up

and lying down. It was found that although infants 18 months and younger understood the

task of inserting the blocks into the apertures and tried very hard, they had no idea how to do

it. They did not even raise up elongated blocks, but just put them on the aperture and tried to

press them in. The 22-month-old children, however, systematically rose up the horizontally

placed objects when transporting them to the aperture and the 26-month-olds turned the

objects before arriving at the aperture, in such a way that they approximately fit the aperture.

This achievement is the end point of several important developments that includes motor

competence, perception of the spatial relationship between the object and the aperture, mental

rotation, anticipation of goal states, and an understanding of means-end relationships. The

results indicate that a pure feedback strategy does not work for this task. The infants need to

have an idea of how to reorient the objects to make them fit. Such an idea can only arise if the

infants can mentally rotate the manipulated object into the fitting position. The ability to

imagine objects at different positions and in different orientations greatly improves the child’s

action capabilities. It enables them to plan actions on objects more efficiently, to relate objects

to each other, and plan actions involving more than one object.

Development of social abilities

There is an important difference between social actions and those used for negotiating

the physical world. The fact that one’s own actions affect the behaviour of the person towards

whom they are directed creates a much more dynamic situation than when actions are directed

towards objects. Intentions and emotions are readily displayed by elaborate and specific

movements, gestures, and sounds which become important to perceive and control. Some of

these abilities are already present in newborn infants and reflect their preparedness for social

interaction. Neonates are very attracted by people, especially to the sounds, movements, and

features of the human face (Johnson & Morton, 1991; Maurer, 1985). They also engage in

social interaction and turn-taking that, among other things, is expressed in their imitation of

facial gestures. Finally, they perceive and communicate emotions such as pain, hunger and

disgust through their innate display systems (Wolff, 1987). These innate dispositions give

social interaction a flying start and open up a window for the learning of the more intricate

regularities of human social behaviour.

Important social information is provided by vision. Primarily, it has to do with

perceiving the facial gestures of other people. Such gestures convey information about

emotions, intentions, and direction of attention. Perceiving what another person is looking at

is an important social skill. It facilitates referential communication. One can comment on

objects and immediately be understood by other people, convey information about them, and

communicate emotional attitudes towards them (see e.g. Corkum & Moore, 1998;

D´Entremont, Hains, & Muir, 1997; Striano & Tomasello, 2001). Most researchers agree that

infants reliably follow gaze from 10-12 months of age (see e.g. Corkum and Moore, 1998;

Deák, Flom, & Pick, 2000; von Hofsten, Dahlström, & Fredriksson, 2005). Gaze following,

however, seems to be present much earlier. Farroni, Mansfield, Lai & Johnson (2003)

showed that 4-month-olds moved gaze in the same direction as a model face shown in front of

them.

The understanding of other people’s actions seems to be accomplished by the

same neural system by which we understand our own actions. A specific set of neurons,

‘mirror neurons’, are activated when perceiving as well as when performing an action

(Craighero & Rizzolatti, 2004). They were first demonstrated in rhesus monkeys. These

neurons are specific to the goals of actions. They fire equally when a monkey observes a hand

reaching for a visible object or for an object hidden behind an occluder (Umilta et al, 2001).

However, the same neurons did not fire when the hand mimed the reaching action, that is, the

movement was the same but there was no object to be reached for. When humans perform an

action, like moving an object from one position to another, they consistently shift gaze to the

goals of these actions before the hand arrives there with the object (Johansson et al., 2001,

Land,Mennie & Rusted, 1999). When observing someone else perform the actions, adults

(Flanagan & Johansson, 2003) move gaze predictively to the goal in the same manner as

when they perform the actions themselves. Such predictive gaze behaviour is not present

when moving objects are observed outside the context of an agent (Flanagan & Johansson,

2003).

The ability to predict other people’s action goals by looking there ahead of time seems

to develop during the first year of life. Falck-Ytter, Gredebäck & von Hofsten, (2006) found

that 12-month-old but not 6-month-old infants predict other people’s manual actions in this

way. Thus, infants, like adults, seems to understand others actions by mapping them onto their

motor representations of the same actions. The suggestion that cognition is anchored in action

at an inter-individual level early in life has important implications for the development of

such complex and diverse social competences as imitation, language and empathy.

Conclusions

Cognitive development cannot be understood in isolation. It has to be related to the

motives of the child, the action problems to be solved, and the constraints and possibilities of

the child’s body and sensori-motor system. Actions cannot be constructed ad-hoc. They must

be planned. From a functional perspective, cognitive development has to do with expanding

the prospective control of actions. This is done by extracting more appropriate information

about what is going to happen next, by getting to understand the rules that govern events, and

by becoming able to represent events that are not directly available to our senses. Cognitive

development, however, has to be understood in a still wider context. We need to relate our

own actions to the actions of other people. Recent research shows that we spontaneously

perceive the movements of other people as actions, that specific areas in the brain encode our

own and other people’s actions alike, and that this forms a basis for understanding how the

actions of others are carried out as well as the goals and motives that drive them.

References

Barela, J.A., Jeka, J.J. and Clark, J.E. (1999). The use of somatosensory information during

the aquisition of independent upright stance. Infant Behavior and Development, 22, 87-

102.

Barborica, A. and Ferrera, V.P. (2003) Estimating invisible target speed from neuronal

activity in monkey frontal eye field. Nature Neuroscience, 6, 66–74.

Bizzi, E. and Mussa-Ivaldi, F.A. (1998) Neural basis of motor control and its cognitive

implications. Trends in Cognitive Science, 2, 97-102.

Corkum, V., & Moore, C. (1998). The origins of joint visual attention in infants.

Developmental Psychology, 34, 28–38.

Craig, C.M. and Lee, D.N. (1999) Neonatal control of sucking pressure: evidence for an

intrinsic tau-guide. Experimental Brain Research, 124, 371–382.

Deák, G. O., Flom, R. A.,&Pick, A. D. (2000). Effects of gesture and target on 12- and 18-

month-olds’ joint visual attention to objects in front of or behind them. Developmental

Psychology, 36, 511–523.

D’Entremont, B., Hains, S. M. J., & Muir, D. W. (1997). A demonstration of gaze following

in 3- to 6-month-olds. Infant Behavior and Development, 20, 569–572.

Falck-Ytter T., Gredebäck G. & von Hofsten C.(2006) Infants predict other people’s action

goals. Nature Neuroscience, 9, 878 – 879.

Farroni, T., Mansfield, E.M.,Lai, C. and. Johnson M.H. Infants perceiving and acting on the

eyes: Tests of an evolutionary hypothesis. Journal of Experimental Child Psychology, 85,

199-212.

Flanagan J-R; Johansson R-S. (2003)Action plans used in action observation. Nature, 424,

769-71.

Gibson, J.J. (1966) The senses considered as perceptual systems. New York: Houghton

Mifflin.

Gibson, E.J. and Pick, A. (2000) An Ecological Approach to Perceptual Learning and

Development, Oxford University Press

Jacobsen, K., Magnussen, S. And Smith, L. (1997). Hidden visual capabilities in mentally

retarded subjects diagnosed as deaf-blind. Vision Research, 37, 2931-2935.

Johansson, RS, Westling G, Bäckström A, Flanagan J R. (2001) Eye-hand coordination in

object manipulation. Journal of Neuroscience, 21, 6917-6932.

Johnson, M.H. and Morton, J. (1991). Biology and Cognitive Development: The case of face

recognition. Oxford: Blackwell.

Land M.F., Mennie N., Rusted J. (1999). The roles of vision and eye movements in the

control of activities of daily living. Perception 28, 1311 - 1328.

Lockman, J.J., Ashmead, D.H. & Bushnell, E.W. (1984) The development of anticipatory

hand orientation during infancy. Journal of Experimental Child Psychology, 37, 176 –

186.

Maurer, D. (1985). Infants’ perception of faceness. In T.N. Field and N. Fox (Eds.) Social

Perception in Infants. (pp. 37-66). Hillsdale, N.J.: Lawrence Erlbaum Associates.

Örnkloo, H. & von Hofsten, C. (2006) Fitting objects into holes: on the development of

spatial cognition skills. Developmental Psychology, in press.

Poggio, T. Bizzi, E. (2004) Generalization in vision and motor control. Nature, 431, 768-774.

Reed, E. S. (1996) Encountering the world: Towards an Ecological Psychology. New York:

Oxford University Press.

Rizzolatti, G. and Craighero, L. (2004) The mirror-neuron system. Annual Review of

Neuroscience, 27, 169–92

Rochat, P. and Hespos, S.J. (1997) Differential rooting reponses by neonates: evidence for an

early sense of self. Early Development and Parenting 6, 105–112.

Rosander, K. and von Hofsten, C. (2002) Development of gaze tracking of small and large

objects. Experimental Brain Research, 146, 257-264.

Rosander, R. and von Hofsten, C. (2004) Infants' emerging ability to represent object motion.

Cognition, 91, 1-22.

Spelke, E.S. (2000) Core knowledge. American Psychologist, 55, 1233-1243.

Striano, T., & Tomasello, M. (2001). Infant development: Physical and social cognition. In

International encyclopedia of the social and behavioral sciences (pp. 7410–7414).

Umilt`a MA, Kohler E, Gallese V, Fogassi L, Fadiga L, et al. 2001. “I know what you are

doing”: a neurophysiological study. Neuron 32, 91–101

van der Meer, A.L.H. (1997) Keeping the arm in the limelight: advanced visual control of arm

movements in neonates. Eur. J. Paediatr. Neurol. 4, 103–108

von Hofsten, C. (1980) Predictive reaching for moving objects by human infants. Journal of

Experimental Child Psychology, 30, 369-382.

von Hofsten, C. (1982). Eye-hand coordination in newborns. Developmental Psychology , 18,

450-461.

von Hofsten, C.. (1983). Catching skills in infancy. Journal of Experimental Psychology:

Human Perception and Performance, 9, 75-85.

von Hofsten, C. (1993) Prospective control: A basic aspect of action development. Human

Development, 36, 253-270.

von Hofsten, C., Dahlström, E., & Fredriksson, Y. (2005) 12-month–old infants’ perception

of attention direction in static video images. Infancy, 8, 217-231.

von Hofsten, C. and Fazel-Zandy, S.(1984). Development of visually guided hand orientation

in reaching. Journal of Experimental Child Psychology, 38, 208-219.

von Hofsten, C. & Johansson, K. (2006) Infants’ emerging ability to capture a rotating rod.

Unpublished manuscript.

von Hofsten, C., Kochukhova, O., & Rosander, K. (2006) Predictive occluder tracking in 4-

month-old infants. Submitted manuscript.

von Hofsten, C. and Rosander, K. (1996) The development of gaze control and predictive

tracking in young infants. Vision Research, 36, 81-96.

von Hofsten, C and Rosander, K. (1997) Development of smooth pursuit tracking in young

infants. Vision Research, 37, 1799-1810.

von Hofsten, C. and Rönnqvist, L. (1988) Preparation for grasping an object: A

developmental study. Journal of Experimental Psychology: Human Perception and

Performance, 14, 610-621.

von Hofsten, C., Vishton, P., Spelke, E.S., Feng, Q., and Rosander, K. (1998) Predictive

action in infancy: Tracking and reaching for moving objects. Cognition, 67, 255-285.

von Hofsten, C. (2003) On the development of perception and action. In J. Valsiner and K. J.

Connolly (Eds.) Handbook of Developmental Psychology. London: Sage. pp: 114-140.

von Hofsten, C. (2004) An action perspective on motor development. Trends in Cognitive

Science, 8, 266-272.

Witherington, D. C., von Hofsten, C., Rosander K., Robinette, A.Woollacott, M. H., and

Bertenthal, B. I. (2002) The Development of Anticipatory Postural Adjustments in

Infancy. Infancy, 3, 495-517.

Wolff, P.H. (1987). The development of behavioral states and the expression of emotions in

early infancy. Chicago: Chicago University Press.

1

Developmental Science, in press

Predictive tracking over occlusions by 4-month-old infants

Claes von Hofsten, Olga Kochukhova, and Kerstin Rosander* Department of Psychology, Uppsala University

* Address for correspondence: Department of Psychology, Uppsala University, Box1225, S-75142 Uppsala, Sweden. Email: [email protected], [email protected], [email protected]

2

Abstract Two experiments investigated how 16-20-week-old infants visually track an object that

oscillated on a horizontal trajectory with a centrally placed occluder. To determine the

principles underlying infants’ tendency to shift gaze to the exiting side before the object

arrives, occluder width, oscillation frequency, and motion amplitude were manipulated

resulting in occlusion durations between 0.20 and 1.66 s. Through these manipulations, we

were able to distinguish between several possible modes of behavior underlying ‘predictive’

actions at occluders. Four such modes were tested. First, if passage-of-time determines when

saccades are made, the tendency to shift gaze over the occluder is expected to be a function of

time since disappearance. Secondly, if visual salience of the exiting occluder edge determines

when saccades are made, occluder width would determine the pre-reappearance gaze shifts

but not oscillation frequency, amplitude, or velocity. Thirdly, if memory of the duration of the

previous occlusion determines when the subjects shift gaze over the occluder, it is expected

that the gaze will shift after the same latency at the next occlusion irrespective of whether

occlusion duration is changed or not. Finally, if infants base their pre-reapperance gaze shifts

on their ability to represent object motion (cognitive mode), it is expected that the latency of

the gaze shifts over the occluder are scaled to occlusion duration. Eye and head movements

as well as object motion were measured at 240 Hz. In 49% of the passages, the infants shifted

gaze to the opposite side of the occluder before the object arrived there. The tendency to make

such gaze shifts could not be explained by the passage of time since disappearance. Neither

could it be fully explained in terms of visual information present during occlusion, i.e.

occluder width. On the contrary, it was found that the latency of the pre-reappearance gaze

shifts was determined by the time of object reappearance and that it was a function of all the 3

factors manipulated. The results suggest that object velocity is represented during occlusion

and that infants track the object behind the occluder in their “minds eye”.

3

Introduction

The observation of objects that move in a cluttered environment is frequently

interrupted as they pass behind other objects. To keep track of them requires that their motion

is somehow represented over such periods of non-visibility. The present article asks whether

4-month-old infants have this ability, and if so, how it is accomplished.

Piaget (1954) was one of the first to reflect on the problem of early object representation

and he hypothesized that it would start to evolve from about 4 months of age. Later research

supports and expands Piaget’s assumptions. One frequently used paradigm has studied

looking at displays where the spatio-temporal continuity has been violated with the

assumption that infants will look longer at such displays. These experiments strongly suggest

that 4-month-olds and maybe even younger infants maintain some kind of representation of

the motion of an occluded object. For instance, Baillargeon and associates (Baillargeon &

Graber, 1987; Baillargeon & deVos, 1991, Aguiar & Baillargeon, 1999) habituated infants to

a tall and a short rabbit moving behind a solid screen. This screen was then replaced by one

with a gap in the top. The tall rabbit should have appeared in the gap but did not. Infants from

2.5 month of age looked longer at the tall rabbit event suggesting that they had detected a

discrepancy between the expected and the actual event in that display. These studies indicate

that infants are somehow aware of the motion of a temporary occluded moving object but not

how.

Visual tracking studies are also informative about infants’ developing ability to

represent object motion. The eyes are not subject to drift under normal circumstance, not even

in young infants, and smooth gradual eye movements only appear when a moving object is

pursued (von Hofsten & Rosander, 1997; Rosander & von Hofsten, 2002). Tracking is

interrupted when the motion is stopped. Muller & Aslin (1978) found that 2-, 4-, and 6-

month-old infants stopped tracking when an object that moved on a horizontal path stopped

4

just in front of an occluder. When the pursued object is occluded, the smooth eye movements

get effectively interrupted (Rosander& von Hofsten, 2004). This is also valid for visual

tracking in adults (Lisberger, Morris, & Tyschen, 1987; Kowler, 1990). To continue tracking

when the object reappears, subjects shift gaze saccadicly across the occluder. As smooth

pursuit and saccades are radically different kinds of eye movements requiring different motor

programs, the saccades cannot be accounted for by a continuation of the smooth pursuit (see

e.g. Leigh, R. J., & Zee, D.S., 1999). If they are elicited before the object reappears they have

to be guided by some kind of anticipation of the object’s continuing motion. In the past, only

a few studies have measured infants’ eye movements to determine at what age such

anticipations appear. Van der Meer, van der Weel and Lee (1994) found that 5-month-old

infants observing a temporarily occluded, linearly moving object looked towards the

reappearance side in an anticipatory way. In their study the object was occluded during 0.3 -

0.6 s.

Two recent studies have addressed the question of when such ability appears and

the role of learning in its establishment (Johnson, Amso & Slemmer, 2003, Rosander &

von Hofsten, 2004). Rosander & von Hofsten (2004) examined the tracking over occlusion

in 7-, 9-, 12-, 17- and 21-week-old infants. The infants viewed an object oscillating on a

horizontal track that either had an occluder placed at its center where the object

disappeared for 0.3 s, or one at one of the trajectory end points where it became occluded

for 0.6 s. In the former case, the object reappeared on the opposite side of the occluder and

in the latter case on the same side. The level of performance in the central occluder

condition improved rapidly between the ages studied and several of the 17-week-old

infants moved gaze to the reappearing side of the occluder from the first passage in a trial.

When the object was occluded at the endpoint of the trajectory, there was an increasing

tendency with age to shift gaze over the occluder. These results indicate that 4-5-month-old

5

infants have anticipations of where and when an occluded moving object will reappear

when moving on a linear trajectory. All the infants in Rosander & von Hofsten (2004)

experienced two trials with un-occluded motion before the occlusion trials began.

Johnson, Amso & Slemmer (2003) examined the eye tracking of 4- and 6-month-old

infants who viewed an object that oscillated back and forth behind an occluder. Some of the

infants experienced an initial period of un-occluded motion and others received no experience

of such a stimulus. Johnson et al. (2003) reported that 4-month-old infants, who saw the un-

occluded motion, showed anticipatory eye movements over the occluder, but not the infants

who did not see the un-occluded motion.

One problem with studies of anticipatory gaze shifts at temporary occlusions is that

anticipations do not occur on every trial. In Rosander & von Hofsten (2004) the 4-month-old

infants moved gaze over the occluder ahead of time in 47% of the trials and in Johnson et al.

(2003) it happened in 29-46% depending on condition and age. Is it possible that infants stop

tracking at the occluder edge when the object disappears, wait there for a while and then shift

gaze anyway. Some of those spontaneous gaze shifts might arrive to the other side of the

occluder before the object reappears there. Thus, they appear to be predictive although they

are not.

In experiments with only one occlusion duration, it is not possible to determine whether

the saccades arriving at the reappearance edge before the object are truly predictive or just

appear to be so. In order to determine the principles underlying infants’ tendency to shift gaze

to the reappearing side of an occluder before the object arrives there, a different set of studies

is required where occlusion duration is systematically varied. Such manipulations are able to

distinguish between several modes of behavior underlying ‘predictive’ actions at occluders. It

is the gaze shifts that arrive before the reappearing object that are informative in this respect.

They are here called pre-reappearance gaze shifts (PRGS). Those that arrives later need to be

6

excluded from this analysis because they may have been elicited by the sight of the

reappearing object.

First, let’s entertain the possibility that infants stop their gaze at the occluder edge

when the object disappears, linger a little there and then shift attention to the other occluder

edge. If passage-of-time determines when saccades are made, the tendency to shift gaze over

the occluder is expected to be a function of time since disappearance. An implication of this

mode is that the longer the occlusion duration, the higher the number of these saccades will be

counted as predictive purely because the average mean is based on a larger range of values.

Secondly, it is possible that subjects move gaze over the occluder before the object

reappears because of the visual salience of the exiting occluder edge. The greater the visual

salience of a stimulus, the earlier will gaze be shifted to this stimulus. It is expected that

visual salience is greater for high contrast than for low contrast stimuli. As contrast sensitivity

decreases with increasing eccentricity in the visual field, it is possible that PRGS in the

presence of a wide occluder will have a longer latency, not because the subject expects the

object to reappear later, but because the visual salience of the exiting occluder edge is then

lower. Thus, varying just the width of the occluder will not distinguish the visual salience

hypothesis from a more cognitive one. Another problem with a salient occluder is that it may

attract attention to such a degree that tracking is interrupted altogether and instead fixate the

salient occluder (Muller & Aslin, 1978). To avoid effects of visual salience of the occluder it

is therefore of utmost importance to use occluders that deviate from the background as little

as possible. In the present experiments both the occluder and the background had the same

white colour.

If the salience of the exiting occluder edge varies depending on how far into the

peripheral visual field it is presented, it is expected that this would also be valid for any

salient visual stimulus. The colour and luminance of the reappearing object deviated quite

7

distinctly from the background and saccadic latency to it should therefore be related to its

eccentricity in the visual field. If it is, the visual salience hypothesis would be strengthened.

If, on the other hand, non-visual factors such as oscillation frequency and object velocity

affect the latency of the PRGS, it means that the attractiveness of the exiting occluder edge is,

at least, moderated by these other variables. Thus the visual salience hypothesis cannot be

properly tested unless both occluder width and object velocity are manipulated. This was done

in the present study by varying oscillation frequency and motion amplitude.

Another possible explanation of why PRGS are performed is that it may just be a

contingency effect without any anchoring in continuous physical events. The subjects could

just remember the duration of the previous occlusion and expect the object to be absent for the

same period at the next occlusion (the memory hypothesis). Haith and associates (Canfield &

Haith, 1991; Haith, 1994; Haith, Hazan, & Goodman, 1988) investigated infants’ anticipatory

saccades to when and where the next picture in a predictable left–right sequence was going to

be shown. Saccades arriving at the position of the next picture within 200 ms of its

presentation were considered anticipatory. From about 3 months of age, such anticipations

were found to be significantly more frequent for a predictable than for an unpredictable

sequence (Canfield & Haith, 1991). As the pictures were distinctly different, it is doubtful

whether they were perceived as one continuous event. In addition, the left–right sequence of

pictures followed an arbitrary rule rather than a physical principle. In spite of this, temporal

regularities were rapidly learnt suggesting that infants are sensitive to spatio-temporal

contingencies and use them to predict what is going to happen next. Whether infants’ gaze

shifts over the occluder are determined in this way cannot be tested with a single duration of

occlusion. If several durations are included in the experiment, however, it is expected that the

subjects will systematically misjudge the duration every time it is changed. When duration is

made shorter, the subjects will overestimate it and consequently shift gaze too late. If the

8

duration is made longer, they will underestimate duration and shift gaze over the occluder too

early.

Finally, the PRGS might be based on information about occlusion duration derived

from oscillation frequency and amplitude, object velocity, and occluder width. This will be

called the cognitive hypothesis. Object velocity is linearly related to oscillation frequency

and motion amplitude. If frequency or amplitude is doubled, so is velocity. Occlusion

duration is obtained by dividing occluder width with object velocity. Instead of having

explicit knowledge of this relationship, infants could simply maintain a representation of the

object motion while the object is occluded and shift gaze to the other side of the occluder

when the conceived object is about to arrive there. In support of this alternative are the

findings that object velocity is represented in the frontal eye field (FEF) of rhesus monkeys

during the occlusion of a moving object (Barborica & Ferrera, 2003). It is only by

simultaneously varying object velocity and occluder width that such a strategy can be

revealed.

In the past, there are very few developmental studies that varied both object velocity

and occluder width. Muller & Aslin (1978) showed 6-month-old infants a sphere (8

cm/diameter) that moved with either10 or 20 cm/s on a horizontal trajectory and disappeared

temporarily behind a 20 or 40 cm wide occluder. They had no possibility to measure exact

timing of the gaze shifts over the occluder but found that the instances of tracking

interruptions larger than 1 s increased dramatically when the complete occlusion duration

increased from 0.6 to 1.2 s ( i.e. partial occlusion increased from 1 to 2 s). This indicates that

the length of the tracking interruptions was related to occlusion duration. Gredebäck, von

Hofsten & Boudreau, (2002) studied 9-month-old infants who tracked an object moving on a

circular trajectory. The results showed that the infants’ performance was determined by

occlusion duration, that is, by the joint relationship between object velocity and occluder

9

width. For a range of occlusion durations from 0.25 to 5 s, gaze was shifted over the occluder

at around 2/3 of the occlusion time. Similar results were obtained by Gredebäck & von

Hofsten (2004) with 6-month-old infants and older. In that study, however, occluder width

was not varied and the joint effect of these two variables could therefore not be evaluated. The

purpose of the present study was to determine how 4-month-old infants behave in a situation

with temporarily occluded objects when occluder width and object velocity are

simultaneously manipulated.

Two experiments were performed in which infants were presented with objects

oscillating on a horizontal trajectory. We created a set of occlusion durations by systematic

manipulation of oscillation frequency, occluder width (Experiment 1 and 2), and motion

amplitude (Experiment 2). In order to minimize the effect of visual salience of the exiting

occluder edge, the occluder had the same white colour as the background (the inside of the

cylinder). Three oscillation frequencies and 3 occluders were used in Experiment 1 creating 7

different occlusion durations between 0.22 and 0.61 s. In Experiment 2, oscillation amplitude

in addition to oscillation frequency and occluder width was varied creating 8 different

occlusion durations between 0.2 and 1.66 s. The object velocities used ranged from 15 to

30°/s in Experiment 1 and from 8 to 30°/s in Experiment 2. Earlier research has shown that

infants track objects optimally within this interval (von Hofsten & Rosander, 1996,

Mareschal, Harris & Plunkett, 1997). The infants were between 16 and 20 weeks of age. Our

earlier studies indicate that PRGS emerge during this age period (Rosander & von Hofsten,

2004)

General method

Subjects.

10

Twenty-three 16-20-week-old infants participated in the study. They were all healthy and

born within 2 weeks of the expected date. The study was in accordance with the ethical

standards specified in the 1964 Declaration of Helsinki.

Apparatus.

The experiments were performed with the same apparatus as earlier described in detail (von

Hofsten & Rosander, 1996, 1997; Rosander & von Hofsten, 2002, 2004). The infants were

placed in a specially designed infant chair positioned at the center of a cylinder, 1 m in

diameter and 1 m high. This cylinder shielded off essentially all potentially distracting stimuli

around the infant. The chair was positioned in the cylinder in such a way that its axis

corresponded approximately to the body axis of the infant. His or her head was lightly

supported with pads. In the experiments the setup was comfortably inclined at an angle of 40

deg. Figure 1 shows a subject in the cylinder.

------------------------Insert Figure 1 about here----------------------- Stimuli.

In front of the infant the cylinder had a narrow horizontal slit where the movable object was

placed. A circular orange colored "happy" face was used as the stimulus to be tracked. It was

5 cm in diameter corresponding to a visual angle of approximately 7°, with black eyes, a

black smiling mouth, and a black wide contour around it. In the middle of the schematic face,

at the position of the nose, a mini video camera (Panasonic WV-KS152) was placed. Its black

front had a diameter of 15 mm or 2.2° of visual angle. Below the face, a red LED was placed.

It could be blinked to attract the infant’s gaze. The motion of the object was controlled by a

function generator (Hung Chang, model 8205A). In the experimental trials, the "happy face"

object oscillated in front of the infant according to a triangular function (i.e. constant velocity

that abruptly reversed direction at the endpoints of the trajectory). To minimize visual

salience of the non-moving parts of the setup, the moving object was occluded by a

rectangular piece of cardboard at the center of its trajectory with the same white color as the

11

inner side of the cylinder (see Figure 1). The edges of the slit in which the object moved were

also white. Finally, from the infant’s point of view, the background behind the slit was the

white ceiling of the experimental room. Different sized occluders were used in the different

conditions. They were placed over the motion-trajectory at a distance of 2 cm from the

cylinder wall to which it was attached to by means of Velcro.

Measurements

Eye movements: Electrooculogram (EOG) was used to measure eye movements. The

electrodes were of miniature type (Beckman) and had been soaked in physiological saline for

at least 30 minutes before use. They were then filled with conductive electrode cream

(Synapse, Med-Tek Corp.) and attached to the outer canthi of each eye. The ground electrode,

a standard EEG child electrode, was attached to the ear lobe. The signals from the two

electrodes were fed via 10 cm long cables into a preamplifier attached to the head of the infant

(see Figure 1). The preamplifier was introduced to eliminate the impedance problems of

traditional EOG measurements associated with long wires (Westling, 1992). The amplifier

was powered by a battery and completely isolated from the rest of the equipment for safety

reasons. The signal was relayed to the outer system by means of optical switches. The EOG

signal was low-pass filtered at 20 Hz by means of a third-order Butterworth filter. The

position accuracy of the filtered EOG data was +0.4° visual angle or better. This introduced a

constant time delay of 0.015 s that was compensated for in the data analysis. As EOG

measurements are based on the difference in potential level between the electrodes, it is

subject to a slow drift over time. This drift was eliminated before analysis. The drift, however,

could cause the amplified potential to reach saturation values in the recording window.

Therefore, the base level of EOG was reset between trials. This does not affect the calibration

of the EOG or distort the data in any other way (cf Juhola 1987). For more detailed

12

information about the EOG setup, see Rosander & von Hofsten (2000, 2002, 2004) and von

Hofsten & Rosander (1996, 1997).

Head and object motions: A motion analyzing system, “Proreflex (Qualisys)”, was used to

measure the movements in 3-D space. This system uses high-speed cameras that register the

position of passive reflective markers illuminated by infrared light positioned around the lens

of each camera. In the present setup, 3 cameras were used, each of which had an infrared light

source. Three markers were placed on the head to make it possible to record its translational

as well as rotational movements and one marker was placed at the center of the moving

object. The position accuracy of each measured marker was 0.5 mm. The markers had a

diameter of 4 mm and can be seen in Figure 1. The EOG and the head movement data were

recorded in synchrony at 240 Hz.

Procedure

Two experiments with identical procedures were carried out. First, the accompanying parent

was informed about the experiment and signed the consensus form. Then it was assured that

the infant had been recently fed and was in an alert state. The video camera at the center of

the stimulus monitored the face of the infant so that the parent and the experimenter could

observe the infant during the experiment. As the camera was always directed towards the face

of the infant, it gave excellent records of whether the infant attended to the object or not in

between occlusions. Each experimental session was video-recorded with this device. If the

infant fussed or fell asleep during a trial, the experiment was interrupted temporarily, after

which the last trial was repeated and the experiment continued. Such interruptions were

uncommon. Before each trial the stimulus parameters were set to the appropriate values. This

included changing the occluder and setting the amplitude and frequency of the oscillations to

be used on the function generator. The duration of each trial was 25 s and the whole

experimental session had duration of around 7 minutes.

13

Calibration of EOG: This procedure has been described earlier (Rosander & von Hofsten,

2002, 2004; von Hofsten & Rosander, 1996). The experimenter moved the object swiftly from

one extreme position to the middle of the path, and then to the extreme distance on the other

side (Finocchio, Preston, &Fuchs, 1990). Each position was defined and measured by the

motion analyzing system (Qualisys). Each stop lasted 1-2 s and during that time the

experimenter flashed the red LED placed below the face. This attracted the infant’s gaze. The

flashing of the LED also made it possible to determine if the infant fixated the object

(Rosander & von Hofsten, 2002). During the 25 s period of calibration, the experimenter

covered about 4 cycles of motion corresponding to 16 fixation stops.

Data analysis

The video records were inspected to determine whether the subjects attended to the

object both before and after the occlusion of the object at each passage. Gaze shifts over the

occluder launched earlier than 0.3 s before total occlusion or later than 1 s after the object had

reappeared were regarded as unrelated to the occlusion event and were therefore not included

in the analysis. They constituted merely 2.5% of the total number of gaze shifts. Data were

analyzed with programs written in the MATLAB environment. The statistical analysis was

performed in SPSS. A specific trial was considered to have started when the infant first

directed its gaze toward the object. Thus, if the infant looked somewhere else after the object

had started to move and therefore missed one occluder passage, this passage was not included

in the analysis of the trial. Instead the 2nd passage was considered as the first. The gaze

direction was calculated as a sum of head and eye direction, where the head direction was

calculated from the relative positions of the three markers on the infant's head.

Calculation of gaze direction: All calculations were performed in the plane of rotation (xz-

plane) of the cylinder which had its origin at the axis of rotation (the y-axis), see Figure 2.

14

The position of the object on the cylinder was transferred to angular coordinates. We can find

the time dependent angles of the head (αhead) and the object (βobject) using Equation 1 and 2:

( ) ( ) [ ] [ ]{ } )1....(arctan21

21

21arctan

21

BCCBBCABCAhead XXZZZZZXXX −−+⎭⎬⎫

⎩⎨⎧

⎥⎦⎤

⎢⎣⎡ +−⎥⎦

⎤⎢⎣⎡ +−=α

and

βobject = ( ) ( )⎭⎬⎫

⎩⎨⎧

⎥⎦⎤

⎢⎣⎡ +−⎥⎦

⎤⎢⎣⎡ +− BCDBCD ZZZXXX

21

21arctan ……..………………(2)

where X and Z are time dependent coordinates of the markers A, B, C and D (see Figure 2);

--------------------Figure 2 about here--------------------------------------

The calibration factor of the EOG in mV/deg was obtained by dividing the

difference between the values of the EOG signal at different calibration positions with the

difference in degrees between head and object at those positions. A linear relation was then

assumed between the EOG signal and the radial displacement according to Davis & Shackel

(1960), Finocchio, et al. (1990), and von Hofsten & Rosander (1996). Eye movements were

calibrated in the same angular reference system as the head, i.e. in terms of how much head

movement would be required to shift gaze an equal amount. Amplitudes of eye movements

were regarded as proportional to the corresponding head movements. If the eyes are displaced

6 cm relative to the rotation axis of the head (this value was estimated from measurements on

a few of the participating infants) and the object is positioned 44-46 cm from head axis as was

the case in the present experiment, the exact relationship between head and eye angles

deviates from proportionality with less than 0.5% within the interval used (+24°). The gaze

15

was estimated as the sum of eye and head positions. Angular velocities were estimated as the

difference between consecutive co-ordinates. The velocities of the eyes, head, and gaze were

routinely calculated in this way.

Analysis of visual tracking over the occluder: First, the occluder boundaries were identified

from the object motion record. This happened when the reflective marker on the object

became occluded and dis-occluded. Then the interval where the object was completely

occluded was identified. As the marker on the object was placed at its center, the position in

space-time of the full occlusions corresponded to a marker position half an object width inside

the occluder (see Figure 3). Gaze was considered to have arrived to the other side of the

occluder when it was within 2° of its opposite boundary. The boundary will then be within the

foveal region of the infants’ visual field (Yudelis &Hendrickson, 1986). The time when gaze

arrived at the exiting occluder edge relative to the time when the object started to reappear

was defined as the lag or lead of the gaze shift. The time from the total disappearance of the

object to the time when gaze arrived to the opposite boundary was defined as the gaze shift

latency. When the lag was less than 0.2 s the gaze shift was defined as a pre-reappearance

gaze shift (PRGS), as this is the minimum time required to program a saccade to an

unexpected event in adults (Engel, Anderson, & Soechting, 1999). As PRGS is a different

mode of responding than reacting to the seen object after it has reappeared, separate analyses

were conducted on each of these two subsets of data.

----------------Figure 3 about here -----------------

Data was evaluated with correlation analysis and General linear model repeated

measurement ANOVAS. To control for sphericity the Geisser-Greenhouse correction was

used when needed. Individual trial subject means were then used as input.

16

Experiment 1

The purpose of Experiment 1 was to investigate how 3.5 to 4.5-month-old infants are

able to take occlusion duration into account when making pre-reappearance gaze shifts

(PRGS). Duration was varied by orthogonal manipulation of Oscillation Frequency and

Occluder Width. This made it also possible to evaluate each of the 4 possible strategies

outlined in the introduction, the passage-of-time, visual salience, memory mode, and the

cognitive mode. Apart from being a determining factor of occlusion duration, object velocity

might be of importance for another reason. A higher velocity gives rise to more rapid

accretion and deletion of the visible object at the occluder edges. Gradual accretion and

deletion has been shown to be of importance for being able to perceive continued motion

behind an occluder (Scholl & Pylyshyn, 1999, Kaufman, Csibra & Johnson, 2005). It is

therefore possible that the strength and clarity of the represented motion improves when the

deletion is made more gradual.

Although earlier research (Johnson et al., 2003; Rosander & von Hofsten, 2004)

showed un-occluded motion at the beginning of the experiment, this was not done in the

present experiment. Piloting indicated that presenting a trial with un-occluded motion at the

beginning of the experiment was not necessary for attaining predictive tracking over

occlusion in 4-month-old infants. In experimental situations, young infants tend to get rapidly

bored and therefore we wanted to keep the experiment as short as possible. It is important to

point out that the calibration procedure does not correspond to a learning trial with un-

occluded motion. Instead of being moved with constant velocity as in the experimental trials,

the object was moved rapidly between positions where it stopped for 1-2 s. For instance, it

stopped at the center of the trajectory. If the infants learned the temporal pattern of this

procedure, they would expect the object to stop behind the occluder. However, although the

calibration procedure did not provide a learning opportunity for the temporal properties of the

oscillatory motion, it made the infants familiar with the horizontal trajectory used.

17

Method

Experiment 1 included 5 girls and 6 boys. Their age varied from 15:6 to 19:5 weeks

with a mean age of 18 weeks. The amplitude of the oscillating motion was always the same

(24° visual angle). Occluder Width was 8.2, 11, or 14 cm corresponding to 11.6°, 15.5° and

20° of visual angle at a distance of 45 cm from the eyes. Oscillation Frequency was 0.15, 0.21

or 0.30 Hz. Before each trial these values were adjusted by hand on the function generator and

occlusion durations could therefore deviate somewhat at a given condition for the different

subjects (+8%). The resulting angular velocities of the object were 15°/s, 21.4°/s and 30°/s at

45 cm viewing distance.

Seven combinations of Oscillation Frequency and Occluder Width were included in

the experiment instead of the possible 9. The largest occluder was not combined with the

slowest velocity and the smallest occluder was not combined with the fastest velocity. The

decision to exclude those two conditions was made in order to reduce the number of

conditions. It was judged that 9 conditions were going to make the infants too tired and

deteriorate their performance. The measured duration of total occlusion was, on the average

0.22, 0.28, 0.31, 0.39, 0.43, 0.56, and 0.61 s. These figures are given in Table 1. The

conditions were presented in randomized order. The slowest Oscillation Frequency (0.15 Hz)

included 7-8 occluder events per trial, the middle Oscillation Frequency (0.22 Hz) 11-12

occluder events, and the fastest Oscillation Frequency y (0.30 Hz) 14-15 occluder events.

Results

One boy did not complete the experiment because of inattention leaving 10 subjects to

be analyzed (5 girls and 5 boys). Seven of the 70 individual trials from these subjects were

also excluded for the same reason (4 subjects missed one trial and one missed 3 trials). In the

remaining trials, the infants attended to the object both before and after the occlusion in 559

passages (83%), while at 113 passages they were distracted and looked away. In 294 cases,

18

gaze arrive at the reappearance side of the occluder more than 0.2 s after the object had started

to reappear and those were considered reactive saccades. The reactive saccades arrived at the

reappearance side on the average 0.45 s (sd = 0.16) after the object started to appear.

In 265 cases (47.4 %) the gaze shifts over the occluder arrived at the opposite side less

than 0.2 s after the object arrived there (PRGS). In 234 of these passages (88%) the gaze

stopped at the occluder edge until making saccadic shift(s) over to the other side. In the 31

remaining cases (12%) the subject made a gaze shift over to the other side of the occluder and

then returned to the original edge where gaze stayed until making a final shift over the

occluder. In those cases, it was the second gaze shift that was analyzed. Table 1 shows the

average duration of occlusion, number of presented occluder passages, number of passages

analyzed, number and percentage of PRGS, and mean latency of these gaze shifts in

Experiment 1.

-------------------Insert Table 1 about here-----------------

Determinants of gaze shifts over the occluder

The passage-of-time hypothesis: Infants did not show an overall greater tendency to shift gaze

over the occluder during occlusion with the passage of time. There was no correlation

between occlusion duration and proportion of PRGS for individual subjects (r = -0.02, n=63).

Neither were there significant correlations between proportion of PRGS and Occluder Width

(r= -0.192, n=27) for the middle frequency or between proportion of PRGS and Oscillation

Frequency for the middle Occluder Width (r= 0.156, n=27).

------------------Figure 4 about here--------------------

The memory hypothesis: To test whether the saccade latencies were based on the remembered

duration of the previous occlusion, we compared the performance on the last occlusion in one

condition with the first occlusion in the next condition. If infants follow this rule, they should

overestimate the occlusion duration and arrive too late every time the duration is made

19

shorter, and underestimate the duration and arrive too early every time the duration is made

longer. When the duration of occlusion increased between conditions there were 12 cases of

PRGS on the first passage and 14 cases of post-reappearance gaze shifts. When the duration

of occlusion decreased between the conditions there were 13 cases of PRGS and 15 cases of

post-reappearance ones on the first passage. Thus, altogether there were 27 cases in

accordance with the carry-over hypothesis and 27 against that hypothesis. On the average,

gaze arrive 0.22 s after object reappearance at the first passage after occlusion was made

longer and 0.21 s after object reappearance on the first passage after occlusion was made

shorter.

The visual salience and the cognitive hypotheses: The effects of occluder width and

oscillation frequency on PRGS are related to both of these hypotheses. The PRGS were not

determined by the time of disappearance but rather on the time of reappearance of the

occluded object. This can be seen in Figure 4. The correlation between individual mean

latencies for the different conditions and occlusion duration was 0.658 (p<0.001, n=62). The

effects of Occluder Width and Oscillation Frequency were tested in two separate general

linear model repeated measurements ANOVAs. Due to the design of the experiment, each

variable was tested for the middle condition of the other variable. We tested the effect of

Occluder Width on the timing of PRGS for the middle Oscillation Frequency (21°/s). It was

found that the effect of Occluder Width was statistically significant (F(2,18) = 13.68, p<0.01,

η2 = 0.60). The effect of Oscillation Frequency was tested for the middle Occluder Width

(16°). It was found that the effect of Oscillation Frequency was statistically significant

(F(2,18) = 5.539, p<0.02, η2 = 0.38).

The reactive saccades: To decide whether visual salience of a stimulus presented at the exiting

occluder edge affected the saccade latency to it, a general linear model repeated

measurements ANOVA based on the mean individual latencies of post reappearance saccades

20

for the different Occluder Widths was performed. Only the middle velocity (21°/s) was

included in this analysis. It indicated that the eccentricity in the visual field of the exiting

occluder edge had no systematic effect on the latency of these saccades (F(1,9) <1.0). A

similar analysis of the effect of velocity for the middle Occluder Width (16°) gave similar

non-significant results (F(1,9)<1.0).

Discussion

The results of Experiment 1 suggested that neither passage-of-time, nor visual salience

of the opposite occluder edge, or memory of the previous occlusion duration determine

PRGS. The infants did not show a greater tendency to shift gaze over the occluder before the

object reappeared when the occlusion duration was longer. Two kinds of evidence weakened

the hypothesis that visual salience of the exiting occluder edge determines the timing of the

PRGS. First, it was found that Oscillation Frequency (i.e. Object Velocity) had a significant

effect on the timing of PRGS and this effect cannot be derived from visual salience. In

addition, the reappearance position in the peripheral visual field of the salient moving object

did not have a significant effect on the timing of the post reappearance saccades. Memory of

the previous occlusion neither affected the number of PRGS on the first occlusion in the next

trial nor the latency of the saccades to the exiting side of the occluder. In summary, the results

of Experiment 1 suggest that the PRGS were determined by the combination of Occluder

Width and Object Velocity, i.e. by the Occlusion Duration.

The factorial design of Experiment 1 was incomplete. It was judged that the young

infants would not be able to keep optimal attention for more than 7 trials. The results,

however, proved that failing attention was not a problem with the present experimental setting

and design. Therefore a complete factorial design was used in Experiment 2.

Experiment 2

21

The purpose of Experiment 2 was to expand the range of occlusion durations to confirm

the relationships found in Experiment 1 (see Figure 4) and expand them to include faster and

slower motions and thus shorter and longer durations. This was made possible by

manipulating motion amplitude in addition to oscillation frequency and occluder width. By

varying amplitude, object velocity and oscillation frequency could also be disentangled.

Furthermore, it is possible that subjects need to view a moving object for a certain distance

and/or time in order to form stable representations that would last over the temporary

occlusion. Sekuler, Sekuler, & Sekuler (1990) measured response times (RTs) to changes in

target motion direction following initial trajectories of varying time and distance. When the

initial trajectories were at least 500 ms, the directional uncertainty ceased to affect RTs.

Infants may require more time. Two amplitudes were used in Experiment 2. The larger was

the same as in Experiment 1 and the smaller was 50% of it. It should be added that the object

was never fully visible when the smaller amplitude was used in combination with the larger

occluder.

Method

Experiment 2 included 6 girls and 7 boys ranging in age from 17:1 to 20:3 weeks, with a

mean age of 18:6 weeks. Altogether 8 stimuli were shown to each subject in a factorial design

with Occluder Width, Oscillation Frequency, and Motion Amplitude as factors. Occluder Width

was either 8.2 or 14 cm, corresponding to 11.6° and 20° of visual angle at 45 cm viewing

distance. Oscillation Frequency was 0.15 or 0.30 Hz. Motion Amplitude was 24° or 12.8° at

45 cm viewing distance. Both the oscillation frequency and the motion amplitude were

adjusted manually on a function generator before each trial and occlusion durations could

therefore deviate somewhat at a given condition for the different subjects (+12%). At the

higher amplitude, the velocities were 15 and 30°/s and at the lower amplitude they were 8 and

16°/s on the average.

22

Results

Three boys and one girl provided data in less than 6 of the 8 conditions and were

therefore excluded from the data analysis. The data from the remaining 9 subjects (5 girls and

4 boys) were analyzed. Three of the 72 individual trials of these subjects were excluded

because of fussing or inattention. On the remaining occluder passages, infants attended to the

object both before and after the occlusion in 559 cases (77%), while in 167 cases they were

distracted and looked away. There were 287 gaze shifts over the occluder that arrived more

than 0.2 s after the object had started to reappear.

There were 280 cases of PRGS (50%). In 249 of these passages (89%) the gaze stopped

at the occluder until a saccadic shift was made over to the other side. In the 31 remaining

cases (11%) the subject made a gaze shift over to the other side of the occluder just after the

object disappeared and then returned to the original edge where gaze stayed until making a

final shift over the occluder. In these cases, the second shift was analyzed. Table 2 shows the

average occlusion duration, number of presented occluder passages, number of passages

analyzed, number and percentage of PRGS and mean latency of these gaze shifts for the

different conditions in Experiment 2.

------------------------ Insert Table 2 about here ----------------------

Determinants of gaze shifts over the occluder.

The passage-of time hypothesis was not supported by Experiment 2 (see Table 2). There

was no significant correlation between occlusion duration and proportion of PRGS (r = 0.116,

n = 69, p=0.34). The proportion of PRGS was not greater for the shorter occlusions than for

the longer ones. The visual salience hypothesis stating that the timing of PRGS is determined

by Occluder Width was further weakened. Strong effects of Oscillation Frequency and

Motion Amplitude on the timing of PRGS were obtained. Furthermore, it was shown that

there was an effect of visual salience of the reappearing moving object on the saccadic

23

reaction time to it, but that this effect was small compared to the effect of occluder width on

PRGS. The memory hypothesis stating that PRGS is determined by the memory of the

previous occlusion duration did not receive any support by the results of Experiment 2. The

proportion of PRGS at the first occlusion when its duration was increased was not greater

than when it was decreased and the timing relative to object reappearance was unaffected.

Finally, the cognitive hypothesis was strengthened. It was found that the timing of the PRGS

was strongly determined by the time of reappearance of the object. Figure 5 shows the

relationship between PRGS and occlusion duration for individual subjects. The correlation

between individual mean latencies for the different conditions and occlusion duration was

very high (r = 0.902 p<0.0001, n = 66). The analysis of the relative timing of PRGS

(independent of occlusion duration) showed that subjects shifted gaze to the exiting side after,

on the average, after 89% of the occlusion interval. The analysis of the relative timing also

showed that PRGS were elicited relatively earlier for the longer than for the shorter

occlusions (r = -0.35, p<0.01, n = 66). The statistical analyses of these effects are further

elaborated below.

The passage-of-time hypothesis: A general linear model repeated measurements ANOVA

showed that there were no main effects of either Occluder Width, Oscillation Frequency, or

Motion Amplitude (all F:s<1.0). At the high and low frequency, the proportion PRGS was

0.48 and 0.53, respectively. At the high and low amplitude the proportion PRGS was 0.50 and

0.51, respectively. Finally, for the narrow and wide occluder the proportion PRGS was 0.49

and 0.53, respectively. The same analysis showed that there were no interactions except for

the one between Oscillation Frequency and Motion Amplitude (F(1,8)=8.01, p<0.05, η2 =

0.50). At the lower frequency, there was a greater average proportion of PRGS for the higher

(0.55) than for the lower amplitude (0.51). At the higher frequency, there was a greater

proportion of PRGS for the lower (0.54) than for the higher amplitude (0.42).

24

-----------------Insert Figure 5 about here------------------

The memory hypothesis: The results of Experiment 2 did not support the memory hypothesis.

Gaze did not arrive at the exiting side of the occluder systematically later when the duration

was made shorter and systematically earlier when it was made longer. When the duration of

occlusion increased between conditions there were 18 PRGS on the first passage and 16 post-

reappearance gaze shifts. When the duration of occlusion decreased between the conditions

there were 13 PRGS and 14 cases of post-reappearance ones on the first passage. Thus,

altogether there were 32 cases in accordance with the carry-over hypothesis and 29 against it.

On the average, gaze arrive 0.21 s after object reappearance when occlusion was made longer

and 0.20 s after object reappearance when occlusion was made shorter. This difference was

not statistically significant (p=0.9).

The visual salience and the cognitive hypotheses are both addressed by the analyses of

absolute and relative latency of the PGRS. Because the proportions of PRGS for the different

conditions were not significantly different, it was possible to compare their timing (with the

exception for the interaction between Oscillation Frequency and Motion Amplitude). A

general linear model repeated measurements ANOVA was performed on the PRGS latencies.

Significant effects were obtained of Occluder Width (F(1,8) = 120.2, p < 0.0001, η2 = 0.94),

Motion Amplitude (F(1,8)=87.35, p<0.0001, η2 = 0.92), and Oscillation Frequency (F(1,8) =

15.20, p<0.005, η2 = 0.66). The average latency of PRGS for the narrow and wide occluder

was 0.33 s and 0.79 respectively. Corresponding figures for the high and low frequency was

0.46 and 0.66, and for the high and low amplitude 0.39 and 0.73. There was a significant

interaction between Occluder Width and Oscillation Frequency (F(1,8) = 16.35, p < 0.005, η2

= 0.67). While the latency of the PRGS for the wider occluder decreased with oscillation

frequency, this was not the case for the narrow occluder. There also an interaction between

25

Motion Amplitude and Occluder Width (F(1,6) = 28.28, p <0.001, η2 = 0.78). The effect of

amplitude was greater for the wide than for the narrow occluder. Finally, there was an

interaction between Oscillation Frequency and Motion Amplitude (F(1,6) = 9.483, p <0.02, η2

= 0.54). The effect of Oscillation Frequency was smaller for the large amplitude than for the

small one.

To determine how the PRGS were distributed within the occlusion interval, their relative

timing was analyzed by dividing PRGS with occlusion duration. This set of analyses show the

relative distribution of PRGS independent of occlusion duration. If the PRGS were

completely determined by the reappearance time of the object, the relative latencies would be

the same independently of the condition. This was not the case. A general linear model

repeated measurements ANOVA tested these relative latencies. Significant effects were

obtained of Occluder Width (F(1,8) = 9.296, p < 0.02, η2 = 0.54) and Oscillation Frequency

(F(1,8) = 29.07, p<0.001, η2 = 0.78), but not for motion amplitude. On the average, gaze

arrived at the reappearance side of the occluder after 0.96 of the occlusion duration for the

narrow and at 0.83 for the wider occluder. Corresponding figures were 0.72 for the lower

frequency and 1.06 for the higher one. An interaction was obtained between Occluder Width

and Oscillation Frequency (F(1,8) = 9.779, p < 0.02, η2 = 0.55). For the low frequency, the

relative latency of PRGS was the similar for the wide and the narrow occluder, but for the

high frequency, the relative latency was larger for the narrow occluder than for the wide one.

This is depicted in Figure 6. There was also an interaction between Occluder Width and

Motion Amplitude (F(1,8) = 5.791, p <0.05, η2 = 0.42). For the small amplitude, the relative

latency of PRGS was the same for both the wide and the narrow occluder 0.85, but for the

large amplitude the relative latency was larger for the narrow (1.06) than for the wide

occluder (0.80. Another way to describe these interactions is to say that the shortest occlusion

durations produced relatively longer relative latencies.

26

--------------------- Insert Figure 6 about here ------------------------

The reactive saccades: A general linear model repeated measurements ANOVA was

performed on the reactive saccades (latencies 0.2 s longer than the occlusion duration).

Significant effects were only obtained of Occluder Width (F(1,8) = 11.46, p<0.01, η2 = 0.56).

The average reaction time for the narrow occluder was 0.45 s and for the wider one 0.54 s.

The difference in reaction time for the narrow and wide occluder was much smaller than the

difference in PRGS for the same occluders (F(1,8)=71.72, p<0.0001, η2 = 0.90).

General Discussion

When 4-month-old infants visually track an object that moves behind an occluder, they

tend to shift gaze to the reappearing side ahead of the object. The present experiments asked

whether the PRGS reflect an ability to preserve information about the moving object while it

is hidden or if they are just an expression of low-level visual processing or a randomly

determined process. The PRGS cannot just be a continuation of the pre-occlusion visual

tracking, because the smooth pursuit used to track a visible object and the saccades used to

reconnect to the object on the other side of the occluder are radically different modes of eye

movements (see e.g. Lisberger, Morris, & Tyschen, 1987; Leigh, R. J., & Zee, D.S., 1999).

None of the infants pursued the object smoothly over the occluder. They all used saccades to

shift gaze to the other side.

The determinants of PRGS

The passage-of-time hypothesis: The PRGS could not be explained in terms of passage-of

time. The subjects’ gaze did not just linger at the disappearing edge after the object

disappeared and then, after a random time interval, turn to the next thing along the path – the

reappearing edge. If this had been the case, proportionally more saccades would end up on the

other side of the occluder within the occlusion interval for the longer occlusions. On the

contrary, the proportion of PRGS showed no such relationship with occlusion duration in

27

either Experiment 1 or 2. Furthermore, there was no indication in that either Occluder Width

and Oscillation Frequency or Occluder Width and Motion Amplitude had opposite effects that

counteracted each other.

Visual salience hypothesis: What is the likelihood that Occluder Width has an effect on

“predictive saccades” purely because of the visual salience of the exiting occluder edge and

not because a wider occluder implies a longer occlusion? This is not easily tested because

both hypotheses have the same implications on the latency of saccades over the occluder but

for different reasons. The most important precaution to ensure that salience does not

determine saccade latency is to make the occluder as non-salient as possible. In the reported

experiments this was done by matching the white color of the occluder to the white

background (the inside of the cylinder). Thus, although it can be argued that the exiting

occluder edge is visible, it was not salient.

One argument against the visual salience hypothesis comes from the reactive saccades.

They are by definition elicited by the detection of the reappearing object in the periphery of

the visual field. The results indicate that the latency of those saccades is influenced by the

eccentricity in the visual field of the reappearing stimulus. It can thus be argued that although

efforts were made to make the occluder edges as non-salient as possible, the visibility of the

exiting occluder edge could have influenced the effects of Occluder Width on PRGS.

However, the difference in reactive saccade latency for the two occluders in Experiment 2

was small (0.45 s for the narrow and 0.54 for the wide) in comparison to the difference in the

latency of the PRGS (0.33 s for the narrow and 0.79 for the wide occluder). It is therefore

unlikely that visual salience of the exiting occluder edge is the only determinant of the

difference in saccade latency for the different occluder widths. One can, of course, argue that

a non-salient stimulus in the periphery of the visual field like the occluder edge will take

longer to detect than a salient one like the reappearing object. However, the results show that

28

the latency of PRGS for the narrow occluder is shorter than the reaction time to the salient

reappearing object in the same condition.

A second argument against the visual salience hypothesis comes from the relative

latencies of PRGS. They showed that gaze arrived proportionally later in the occlusion

interval for the narrow occluder than for the wide occluder. This is contrary to what would be

expected from the visual salience hypothesis. Visual salience of the occluder may also have a

different effect on tracking over occlusion. Muller and Aslin (1978) found that a salient

occluder made infants interrupt tracking and look at the occluder instead.

Finally and most importantly, visual salience could not be the only determinant of

PRGS. The effects of oscillation frequency and motion amplitude were found to be just as

important. Motion amplitude and oscillation frequency refer to variables that are not visually

present during occlusion and therefore it is inevitable that some information from the seen

pre-occlusion motion is preserved during occlusion.

The memory hypothesis: Neither of the two experiments supported the hypothesis that infants

remembered the duration of the previous occlusion and expected the object to be absent for

the same period at the next occlusion. The saccade latency at the first passages of a new

condition was not influenced in a systematic way by the last occlusion of the previous

condition with a different duration.

The cognitive hypothesis: This hypothesis states that the subjects moderate their PRGS from

information that is not visually present during the occlusion. The cognitive hypothesis was

strongly supported by the results. There was not a single stimulus variable like Occluder

Width, Oscillation Frequency, or Motion Amplitude that determined the latency of the PRGS.

On the contrary, the timing of the PRGS was determined by a combination of these variables

that resulted in a rather close fit between the latency of PRGS and occlusion duration.

Occlusion duration could be obtained by dividing occluder width with object velocity but it

29

seems more parsimonious to assume that the infants maintained a representation of the object

motion while the object was occluded and shifted gaze to the other side of the occluder when

the conceived object was about to arrive there. Thus, instead of tracking the moving object

visually, the infants could track it in their “minds eye”. If object velocity is represented in this

way while the object is occluded, the effects of Occluder Width, Oscillation Frequency, as

well as Motion Amplitude could all be explained. However, this representation might not

have anything to do with the notion of a permanent object that exists over time or with

infants’ conscious experience of where the occluded object is at a specific time behind the

occluder. Object velocity may rather be represented by some kind of buffer during visible as

well as non-visible parts of the motion. This would also explain why the timing of PRGS

could be so good, irrespective of whether the occlusion duration was 0.2 or 1.6 s. In support

of this hypothesis are the findings that object velocity is represented in the frontal eye field

(FEF) of rhesus monkeys during the occlusion of a moving object (Barborica & Ferrera,

2003). 4-month-old infants might represent object motion during occlusion in a similar way.

The critical stimulus variable for preserving a representation of object motion over occlusion

could be accretion-deletion of the seen object at the occluder edges as suggested by Kaufman,

Csibra & Johnson, (2005). If this is the case, the accretion and deletion used in the present

experiments was never too rapid or too slow to be perceived by the infants. The most rapid

accretion-deletion had duration of 0.23 s and the slowest one a duration of 0.87 s in the

present experiments. The results also suggest that such a representation is by no means

perfect. There was a clear tendency to overestimate the duration of short occlusions and

underestimate the duration of long ones.

Other issues

The reactive gaze shifts : The subjects failed to predict the reappearance of the object on

about half the occlusions. On some of those passages the subjects might have expected the

30

object to reappear at a later time. However, the fact that the proportions of reactive gaze shifts

were not systematically greater for the short occlusions than for the long ones argues against

this suggestion. On the average, the reactive gaze shifts arrived almost half a second after

reappearance of the object. This is in accordance with other reaction time studies on infants

(Aslin & Salapatek, 1975; Canfield, Smith, Brezsnyak, and Snow, 1997; Gredebäck, Örnkloo

& von Hofsten, 2006).

Motion amplitude: The low motion amplitude in Experiment 2 did not result in impaired

timing of the PRGS. It should be noted that in the conditions with low amplitude, the object

was only visible for 50% of the total viewing time, corresponding to about 0.8 and 1.6 s

exposure on each side of the occluder. In the small amplitude and large occluder condition the

object was never fully visible. The results suggest that infants do not require much more

prolonged exposure of the moving object to anticipate its reappearance than adults do

(compare Sekuler et al., 1990).

Effects of earlier exposure to visual motion: Johnson et al. (2003) reported that such

experience was of critical importance for 4-month-old infants’ ability to predict the

reappearance of an occluded object. In the present study, the subjects had only initial

experience with the stimulus at the calibration procedure. Then the object was rapidly

transferred between the extreme positions of the trajectory and the center of it. If this in any

way primed the infants, it would be to expect the object to stop at the middle of the trajectory,

i.e. behind the occluder like it did at the calibration. Thus, it is unlikely that presentation of

un-occluded motion before a set of occlusion trials is necessary for eliciting PRGS in 4-

month-old infants.

Attention effects: Comparing the present results with that of Johnson et al. (2003) strongly

suggests that their experimental situation was more distracting than the present one. First, the

proportion of passages included and the proportion predictive gaze shifts over the occluder

31

were much lower in their study. The proportion of ‘predictive’ gaze shifts in their baseline

condition was 29% compared to 47% in Experiment 1 and 50% in Experiment 2 of the

present study. The obtained proportions are comparable to that found for older infants.

Gredebäck, von Hofsten & Boudreau (2002) found that 9-month-olds tracked predictively

over occlusion in 52% of the trials. Secondly, the number of predictive gaze shifts over the

occluder decreased substantially over the experiment in Johnson et al. (2003). Such a

deterioration of performance is a clear sign of distraction and it was not observed in the

present study. At the transition age to predictive tracking over occlusion, it is expected that

the expression of the ability is crucially dependent on the experimental setting. The present

procedure of placing the infant in a sheltered environment apparently provided more optimal

conditions for attentive tracking than in Johnson et al. (2003). Another difference that might

have an effect on attention is the fact the Johnson et al. (2003) used 2D images while the

present study used real objects.

Future research: The next sets of questions that need to be answered concern the role of

learning and neural maturation in the development of these abilities. While experience, no

doubt, plays an important role in the development of predictive tracking over occlusion, it is

also important to investigate the importance of the maturation of specific neural structures

associated with this ability. In adults, the representation of occluded objects involves several

specific parts of the cerebral cortex. These areas include the frontal eye field (Barborica &

Ferrera, 2003) the temporal area (Baker, Keyser, Jellema, Wicker & Perrett, 2001, ) and

especially the middle temporal region (MT/MST) (Olson, Gatenby, Leung, Skudlarski, &

Gore, 2004), the intraparietal sulcus (IPS) (Assad & Maunsell, 1995) and the lateral

intraparietal area (LIP) (Tian & Lynch, 1996). When and how do the different parts of this

network become engaged in the neural processing of occlusion? Kaufman et al. (2003, 2005)

32

have led the way by showing that the temporal areas of 6-month-old infants are activated

when they view object occlusion.

Conclusions

When 4-month-old infants track an object over occlusion and shift gaze to the exiting

side of the occluder ahead of time, the latency of these gaze shifts is primarily determined by

the duration of occlusion. It may be argued that because PRGS is never longer than the

occlusion interval ( plus 0.2 s), they are inevitably related to it. There are several arguments

against such a conclusion. The first has to do with the proportions of PRGS in the different

conditions. If the tendency to shift gaze ahead of time to the exiting side of the occluder was

independent of occlusion duration, there would have been much more PRGS for the long

occlusions than for the short ones. The results show that the proportions of PRGS were

similar over a range of occlusion duration from 0.2 to 1.6 s. It is, of course, possible that all

PRGS have shorter latencies than the shortest occlusion interval. Then the proportions would

be independent of occlusion duration. The results show that this was not the case. On the

contrary, the PRGS arrived close to the reappearance time independently of occlusion

duration (r = 0.902). This tendency could not be explained by the visual salience of the exiting

occluder edge. In addition, Oscillation Frequency and Motion Amplitude were equally

important determinants of PRGS latency. Neither could it be explained in terms of a memory

for the previous occluder duration. This suggests that object velocity is represented during

occlusion and that the infants track the object behind the occluder in their “minds eye”. This

brings continuity to events in a world where vision of moving objects is frequently obstructed

by other objects. Seeing the object for an extended distance before each occlusion did not

make these gaze shifts more precisely timed.

When does such intelligent behaviour appear in development? Rosander & von Hofsten

(2004) found that smooth pursuit gain and the timing of saccades when tracking a temporary

33

occluded object were highly correlated (r = 0.85). Thus the developmental onset of predictive

smooth pursuit and predictive saccadic tracking might both be related to the onset of ability to

represent occluded motion. Rosander & von Hofsten (2002) found that by 12 weeks of age

both modes of tracking are predictive. The present results show that it takes at least another

month before these predictive abilities manifest themselves in the launching of predictive

saccades over occlusion.

Finally, the present results demonstrate the advantages of using parametric studies in

infant research. A design with only one occlusion condition could not have determined

whether the pre-reappearance saccades over the occluder were predictive or not. Neither could

a design where only occluder width is manipulated. To answer the questions posed, both

occluder width and object velocity had to be manipulated.

34

Acknowledgement: We like to thank the enthusiastic parents and infants who made this

research possible. The research was supported by grants to CvH from from The Swedish

Research Council, the Tercentennial Fund of the Bank of Sweden, and EU integrated project

FP6-004370 (Robotcub).

35

References

Aguiar, A. & Baillargeon, R. (2002) Developments in young infants’ reasoning about

occluded objects. Cognitive Psychology, 45, 267-336.

Aslin, R. N. & Salapatek, P. (1975). Saccadic localization of visual targets by the very young

human infant. Perception and Pshychophysics, 17(3), 293-302.

Assad, J.A. & Maunsell, J.H.R. (1995) Neural correlates of inferred motion in primate

posterior parietal cortex. Nature, 373, 167-170.

Baillargeon, R. & Graber, M. (1987) Where’s the rabbit? 5.5-month-old infants’

representation of the height of a hidden object. Cognitive Development, 2, 375-392.

Baillargeon, R. & deVos, J. (1991) Object permanence in young infants: Further evidence.

Child Development, 62, 1227-1246.

Baker C.I., Keysers C., Jellema T., Wicker B., Perrett D.I. (2001) Neuronal representation of

disappearing and hidden objects in temporal cortex of the macaque. Experimental Brain

Research, 140, 375-81.

Barborica, A. & Ferrera, V.P. (2003) Estimating invisible target speed from neuronal activity

in monkey frontal eye field. Nature Neuroscience, 6, 66-74.

Canfield, R. L., & Haith, M. M. (1991). Young infants’ visual expectations for symmetric and

asymmetric stimulus sequences. Developmental Psychology, 27, 198–208.

Canfield, R. L., Smith, E. G., Brezsnyak, M. P., & Snow K. L. (1997) Information processing

through the first year of life: A longitudinal study using the visual expectation paradigm.

Monographs of the Society for Research in Child Development, No. 250, Vol. 62.

Davis, J.R., & Shackel, B. (1960) Changes in the electro-oculogram potential level. Brit. J.

Ophtalmology, 44, 606-614.

Engel, K. C., Anderson, J. H., & Soechting, J. F. (1999) Oculomotor tracking in two

dimensions. Journal of Neurophysiology, 81, 1597-1602.

36

Finocchio, D., Preston,K.L., & Fuchs,A.F. (1990) Obtaining a quantitative measure of eye

movements in human infants: A method of calibrating the electro-oculogram. Vision

Research, 30, 1119-1128.

Gredebäck, G., von Hofsten, C., & Boudreau, J. P. (2002) Infants' visual tracking of

continuous circular motion under conditions of occlusion and non-occlusion. Infant

Behavior and Development, 25, 161 - 182.

Gredebäck, G., Ornkloo,H., & von Hofsten, C. (in press) The development of reactive

saccade latencies. Experimental Brain Research.

Gredebäck, G., & von Hofsten, C. (2004) Infants' evolving representation of moving objects

between 6 and 12 months of age. Infancy, 6,165-184.

Haith, M. M. (1994). Visual expectation as the first step toward the development of future-

oriented processes. In M. M. Haith, J. B. Benson, R. J. Roberts, Jr., & B. Pennington

(Eds.), The development of future oriented processes. Chicago, IL: University of Chicago

Press.

Haith, M. M., Hazan, C., & Goodman, G. S. (1988). Expectation and anticipation of dynamic

visual events by 3.5-month-old babies. Child Development, 59, 467–479.

Johnson, S. P., Amos, D., & Slemmer, J. A. (2003) Development of object concepts in

infancy. Evidence for early learning in an eye tracking paradigm. Proceedings of the

National Academy of Sciences, 100, 10568-10573.

Juhola, M. (1987) On computer analysis of digital signals applied to eye movements. Report

A 48, University of Turku, Dep of Mathematical sciences, Computer Science, SF-20500

Turku, Finland

Kaufman, J., Csibra, G., & Johnson, M.H. (2003) Representing occluded objects in the human

infant brain. Proc. R. Soc. Lond. B (Suppl.), 270, 140-143.

37

Kaufman, J., Csibra, G., & Johnson, M.H. (2005) Oscillatory activity in the infant brain

reflects object maintenance. Proceedings of the National Academy of Sciences, 102,

15271-15274.

Kowler, E. (1990) The role of visual and cognitive processes in the control of eye movement.

In E. Kowler (Ed.) Eye movements and their role in visual and cognitive processes.

Reviews of Oculomotor Research, vol. 4. New york: Elsevier. Pp.1-63.

Leigh, R. J., & Zee, D.S. (1999). The neurology of eye movements (3rd ed). New York, NY:

Oxford University Press.

Lisberger, S.G., Morris, E.J., & Tyschen, L. (1987). Visual motion processing and sensory-

motor integration for smooth pursuit eye movements. Annual Review of Neuroscience,

10, 97-129.

Mareschal, D., Harris, P., & Plunkett, K. (1997) Effects of linear and angular velocity on 2-,

4-, and 6-month-olds' visual pursuit behaviours. Infant Behavior and Development, 20,

435-448.

Muller, A.A. and Aslin, R.N. (1978) Visual tracking as an index of the object concept. Infant

Behavior and Development, 1, 309 – 319.

Olson, I.R., Gatenby, J.C., Leung, H-C, Skudlarski, P., & Gore, J.C. (2004) Neuronal

representation of occluded objects in the human brain. Neuropsychologia, 42, 95-104.

Piaget, J. (1954). The construction of reality in the child. New York: Basic Books.

Rosander, K. and von Hofsten, C. (2000) Visual-vestibular interaction in early infancy.

Experimental Brain Research, 133, 321-333. .

Rosander, K. and von Hofsten, C. (2002) Development of gaze tracking of small and large

objects. Experimental Brain Research, 146, 257-264.

Rosander, R. and von Hofsten, C. (2004) Infants' emerging ability to represent object motion.

Cognition, 91, 1-22.

38

Scholl, B.J & Pylyshyn, Z.W. (1999) Tracking multiple items through occlusion: Clues to

visual objecthood. Cognitive Psychology, 38, 259-290.

Sekuler, A. B., Sekuler, R. & Sekuler, E.B. (1990) How the visual system detects change in

the direction of moving targets. Perception, 19, 181-195.

Tian, J.R. & Lynch, J.C. (1996) Corticocortical input to the smooth and saccadic eye

movement subregions of the frontal eye field in Cebus monkeys. Journal of

Neurophysiology, 76, 2754-2771.

van der Meer, A. L. H., van der Weel, F. & Lee, D.N. (1994) Prospective control in catching

by infants. Perception. 23, 287-302.

von Hofsten, C. and Rosander, K. (1996) The development of gaze control and predictive

tracking in young infants. Vision Research, 36, 81-96.

Von Hofsten, C. and Rosander, K. (1997) Development of smooth pursuit tracking in young

infants. Vision Research, 37, 1799-1810.

Westling, G. Wiring diagram of EOG amplifiers. Dep of Physiology, Umeå University,1992

Yudelis,C., & Hendrickson, A. (1986) A qualitative and quantitative analysis of the human

fovea during development. Vision Research 26 (6), 847-855.

39

Table 1. Average occlusion duration, number of presented occluder passages, number of passages analyzed, number of and percentage of pre-reappearance gaze shifts over the occluder, and mean latency of these gaze shifts for the different conditions in Experiment 1. Velocity Occluder width

15°/s

0.15Hz

21.4°/s

0.21Hz

30°/s

0.30Hz

11.6°

Occlusion duration (s) Passages Passages analyzed Pre-reappearance gaze shifts Proportion (%) Mean latency (s)

0.31 64 54 27 50

0.26

0.22 110 89 45

50.6 0.19

15.5°

Occlusion duration (s) Passages Passages analyzed Pre-reappearance gaze shifts Proportion (%) Mean latency (s)

0.57 72 55 30

54.6 0.41

0.39 85 67 42

62.7 0.29

0.28 128 115 50 43.5 0.26

20°

Occlusion duration (s) Passages Passages analyzed Pre-reappearance gaze shifts Proportion (%) Mean latency (s)

0.61 100 83 31

37.3 0.49

0.43 113 96 40

41.7 0.35

40

Table 2. Average duration of occlusion, number of presented occluder passages, number of passages analyzed, number and percentage of pre-reappearance gaze shifts over the occluder and mean latency of those gaze shifts for the different conditions in Experiment 2. 0.15 Hz 0.30 Hz

Occluder width

8°/s

small amp

15°/s

large amp

16°/s

small amp

30°/s

large amp

11.6°

Occlusion duration (s) Passages Passages analyzed Pre-reappearance gaze shifts Proportion (%) Mean latency (s)

0.64 64 49 24 49

0.35

0.35 65 60 35 58

0.25

0.34 112 88 55 63

0.36

0.20 126 117 50 43

0.27

20°

Occlusion duration (s) Passages Passages analyzed Pre-reappearance gaze shifts Proportion (%) Mean latency (s)

1.66 63 38 22 58

1.37

0.88 58 49 33 67

0.59

0.86 126 74 31 42

0.77

0.48 112 84 30 36

0.43

41

Figure captions Figure 1. The experimental situation. The inside of the cylinder and the occluders had

matched white colors. The borders of the slit in which the object moved were also white. The

slit was 60 cm long and had a movable object placed in it. Note the EOG preamplifier and the

reflective markers on the infant’s head and the reflective marker on the “happy-face “object.

The object holder visible through the slit in the picture was not visible from the infant’s point

of view.

Figure 2. Illustration of how the positions of the head (markers A,B,C) and object (marker D)

were calculated, referring to Figure 1 and equations 1 and 2.

Figure 3. An example of a trial from Experiment 1 showing gaze tracking plotted together

with the object motion. The outer horizontal lines in the figure signify the occluder

boundaries. As a centrally placed marker measured the position of the object, only half the

object was occluded when the marker passed the occluder boundaries. Therefore the inner

horizontal lines are needed to signify the position of the object marker when the object was

totally occluded. These are thus positioned half the object width inside the occluder. The

horizontal lines in between the inner and outer ones signify the 2° tolerance outside which

gaze was considered to have arrived to the other side of the occluder.

Figure 4. The relationship between occlusion duration and PRGS for individual subjects in

Experiment 1. In the figure the latencies for following adjacent occlusion durations are

averaged (0.28 and 0.31 s, 0.39 and 0.43 s, and 0.57 and 0.61 s) to make the individual curves

easier to distinguish. The dashed line shows the hypothetical relationship with saccade latency

equal to occlusion duration.

Figure 5. The relationship between occlusion duration and PRGS for individual subjects in

Experiment 2. In the figure the latencies for the following adjacent occlusion durations are

averaged (0.34 and 0.35 s, 0.48 and 0.64 s, and 0.86 and 0.88 s) to make the individual curves

42

easier to distinguish. The dashed line shows the hypothetical relationship with saccade latency

equal to occlusion duration.

Figure 6. The relative latency of PRGS for the combinations of occluder width and

oscillation frequency.

43

Figure 1.

44

Figure 2.

45

Figure 3.

46

Figure 4.

47

Figure 5.

48

Figure 6