Your topic - TDX (Tesis Doctorals en Xarxa)

192
The encoding and decoding of com- plex visual stimuli: a neural model to optimize and read out a temporal population code. Andre Luiz Luvizotto TESI DOCTORAL UPF / 2012 Director de la tesi Prof. Dr. Paul Verschure, Department of Information and Communication Technologies

Transcript of Your topic - TDX (Tesis Doctorals en Xarxa)

The encoding and decoding of com-plex visual stimuli: a neural modelto optimize and read out a temporalpopulation code.

Andre Luiz Luvizotto

TESI DOCTORAL UPF / 2012

Director de la tesi

Prof. Dr. Paul Verschure,

Department of Information and Communication Technologies

ii

By My Self and licensed under

Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported

You are free to Share – to copy, distribute and transmit the work Under

the following conditions:

• Attribution – You must attribute the work in the manner specified

by the author or licensor (but not in any way that suggests that they

endorse you or your use of the work).

• Noncommercial – You may not use this work for commercial pur-

poses.

• No Derivative Works – You may not alter, transform, or build

upon this work.

With the understanding that:

Waiver – Any of the above conditions can be waived if you get permission

from the copyright holder.

Public Domain – Where the work or any of its elements is in the public

domain under applicable law, that status is in no way affected by

the license.

Other Rights – In no way are any of the following rights affected by the

license:

• Your fair dealing or fair use rights, or other applicable copyright

exceptions and limitations;

• The author’s moral rights;

• Rights other persons may have either in the work itself or in

how the work is used, such as publicity or privacy rights.

iv

Notice – For any reuse or distribution, you must make clear to others

the license terms of this work. The best way to do this is with a link

to this web page.

The court’s PhD was appointed by the recto of the Universitat Pompeu

Fabra on .............................................., 2012.

Chairman:

Secretary:

Member:

The doctoral defense was held on ......................................................., 2012,

at the Universitat Pompeu Fabra and scored as ...................................................

PRESIDENT MEMBERS

SECRETARY

To my parents...

Acknowledgements

This dissertation has been developed in the department of technology of

the Universitat Pompeu Fabra in the research lab SPECS. My first and

sincere thanks goes to Prof. Paul Verschure who since the beginning con-

stantly encouraged and supported me in the development of my research

ideas and goals with his supervision. Also thanks to all SPECS members

for many useful insights, discussions and feedback.

My deepest gratitude to my great friend Cesar Renno-Costa for the ex-

tensive support, discussions and valuable overall collaboration that con-

tributed crucially to this thesis. I would like to thank Zenon Mathews

and Martı Sanchez-Fibla for invaluable support over these years. And my

great friend Prof. Jonatas Manzolli for linking me with Paul and SPECS.

I’m deeply thankful to my parents, Ismael and Angela, and my brothers

George and Mauly for all the love and support throughout this journey.

They have been always ready for me. And also my ”youngest parents”

Tila and Giancarlo. You have both contributed so much with your love

and care!

My deepest and heartfelt thanks to my beloved Sassi. I have not enough

words to express how important you have been to me. Your support,

company, love, lecker food and patience made everything easier and much

more beautiful! Thanks and I love you.

Finally, I thank my heroes John, Paul, George, Ringo, Hendrix, Slash,

Vaughan, EVH, Mr. Vernon and many others for providing me with the

x

xi

rock’n’roll stamina to write this dissertation!

Abstract

The mammalian visual system has a remarkable capacity of processing

a large amount of information within milliseconds under widely varying

conditions into invariant representations. Recently a model of the primary

visual system exploited the unique feature of dense local excitatory con-

nectivity of the neo-cortex to match these criteria. The model rapidly

generates invariant representations integrating the activity of spatially

distributed modeled neurons into a so-called Temporal Population Code

(TPC). In this thesis, we first investigate an issue that has persisted TPC

since its introduction: to extend the concept to a biologically compatible

readout stage. We propose a novel neural readout circuit based on wavelet

transform that decodes the TPC over different frequency bands. We show

that, in comparison with pure linear readouts used previously, the pro-

posed system provides a robust, fast and highly compact representation

of visual input. We then generalized this optimized encoding-decoding

paradigm to deal with a number of robotics application in real-world tasks

to investigate its robustness. Our results show that complex stimuli such

as human faces, hand gestures and environmental cues can be reliably en-

coded by TPC which provides a powerful biologically plausible framework

for real-time object recognition. In addition, our results suggest that the

representation of sensory input can be built into a spatial-temporal code

interpreted and parsed in series of wavelet like components by higher visual

areas.

xiv

Resumen

El sistema visual dels mamfers t una remarcable capacitat per processar

informaci en intervals de temps de mili-segons sota condicions molt vari-

ables i adquirir representacions invariants d’aquesta informaci. Recent-

ment un model del crtex primari visual explota les caracterstiques d’alta

connectivitat excitatriu local del neocortex per modelar aquestes capaci-

tats. El model integra rpidament l’activitat repartida espaialment de les

neurones i genera codificacions invariants que s’anomenen Temporal Pop-

ulation Codes (TPC). Aqu investiguem una qesti que ha persistit des de

la introducci del TPC: estudiar un procs biolgicament possible capa de

fer la lectura d’aquestes codificacions. Nosaltres proposem un nou cir-

cuit neuronal de lectura basat en la Wavelet Transform que decodifica la

senyal TPC en diferents intervals de freqncia. Monstrem que, comparat

amb lectures purament lineals utilitzades previament, el sistema proposat

proporciona una representaci robusta, rpida i compacta de l’entrada vi-

sual. Tamb presentem una generalitzaci d’aquest paradigma de codificaci-

decodificaci optimitzat que apliquem a diferents tasques de visi per com-

putador i a la visi dins del context de la robtica. Els resultats del nostre

estudi suggereixen que la representaci d’escenes visuals complexes, com

cares humanes, gestos amb les mans i senyals del medi ambient podrien

ser codificades pel TPC el qual es pot considerar un poders marc biol-

gic per reconeixement d’objectes en temps real. A ms a ms, els nostres

resultats suggereixen que la representaci de l’entrada sensorial pot ser in-

tegrada en un codi espai-temporal interpretat i analitzat en una serie de

components Wavelet per rees visuals superiors.

xv

Publications

Included in the thesis

Peer-reviewed

• Andre Luvizotto, Cesar Renno-Costa, and Paul F.M.J. Verschure.

A wavelet based neural model to optimize and read out a temporal

population code. Frontiers in Computational Neuroscience, 6(21):

14, 2012c

• Andre Luvizotto, Cesar Renno-Costa, and Paul F.M.J. Verschure.

A framework for mobile robot navigation using a temporal popula-

tion code. In Springer Lecture Notes in Computer Science - Living

Machines, page 12, 2012b

• Andre Luvizotto, Maxime Petit, Vasiliki Vouloutsi, and Et Al. Ex-

perimental and Functional Android Assistant : I . A Novel Archi-

tecture for a Controlled Human-Robot Interaction Environment. In

IROS 2012 (Submitted), 2012a

• Andre Luvizotto, Cesar Renno-Costa, Ugo Pattacini, and Paul Ver-

schure. The encoding of complex visual stimuli by a canonical model

of the primary visual cortex: temporal population coding for face

recognition on the iCub robot. In IEEE International Conference

on Robotics and Biomimetics, page 6, Thailand, 2011

xvii

xviii publications

Other publications as co-author

• Cesar Renno-Costa, Andre Luvizotto, Alberto Betella, Marti Sanchez

Fibla, and Paul F. M. J. Verschure. Internal drive regulation of sen-

sorimotor reflexes in the control of a catering assistant autonomous

robot. In Lecture Notes in Artificial Intelligence: Living Machines,

2012

• Maxime Petit, Stephane Lallee, Jean-David Boucher, Gregoire Pointeau,

Pierrick Cheminade, Dimitri Ognibene, Eris Chinellato, Ugo Pattacini,

Ilaria Gori, Giorgio Metta, Uriel Martinez-Hernandez, Hector Bar-

ron, Martin Inderbitzin, Andre Luvizotto, Vicky Vouloutsi, Yannis

Demiris, and Peter Ford Dominey. The Coordinating Role of Lan-

guage in Real-Time Multi-Modal Learning of Cooperative Tasks.

IEEE Transactions on Autonomous Mental Development (TAMD),

2012

• Cesar Renno-Costa, Andre L Luvizotto, Encarni Marcos, Armin

Duff, Martı Sanchez-Fibla, and Paul F M J Verschure. Integrat-

ing Neuroscience-based Models Towards an Autonomous Biomimetic

Synthetic. In 2011 IEEE International Conference on RObotics and

BIOmimetics (IEEE-ROBIO 2011), Phuket Island, Thailand, 2011.

IEEE, IEEE

• Armin Duff, Cesar Renno-Costa, Encarni Marcos, Andre Luvizotto,

Andrea Giovannucci, Marti Sanchez-Fibla, Ulysses Bernardet, and

Paul Verschure. From Motor Learning to Interaction Learning in

Robots, volume 264 of Studies in Computational Intelligence. Springer

Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-05180-

7. doi: 10.1007/978-3-642-05181-4. URL http://www.springerlink.

com/content/v348576tk12u628h

• Sylvain Le Groux, Jonatas Manzolli, Marti Sanchez, Andre Luvi-

zotto, Anna Mura, Aleksander Valjamae, Christoph Guger, Robert

Prueckl, Ulysses Bernardet, and Paul Verschure. Disembodied and

publications xix

Collaborative Musical Interaction in the Multimodal Brain Orches-

tra. In Proceedings of the international conference on New Interfaces

for Musical Expression, 2010. URL http://www.citeulike.org/

user/slegroux/article/8492764

Contents

Abstract xiv

Resumen xv

Publications xvii

List of Figures xxiv

List of Tables xxxiii

1 Introduction 1

1.1 Early visual areas . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Higher order visual areas: the extrastriate areas and object

recognition in the brain . . . . . . . . . . . . . . . . . . . . 9

1.3 Coding strategies and mechanisms used by the neo-cortex

to provide invariant representation of the visual information 11

1.4 How can the key components encapsulated in a temporal

code be decoded by different cortical areas involved in the

visual process? . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.5 Can TPC be used in the recognition of multi-modal sensory

input such as human faces and gestures to provide form

perception primitives? . . . . . . . . . . . . . . . . . . . . . 18

2 A wavelet based neural model to optimize and read out

a temporal population code 21

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

xxi

xxii contents

2.2 Material and methods . . . . . . . . . . . . . . . . . . . . . 27

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . 52

3 Temporal Population Code for Face Recognition on the

iCub Robot 53

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Using a temporal population to recognize gestures on

the humanoid robot iCub. 75

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2 Material and methods . . . . . . . . . . . . . . . . . . . . . 77

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 A framework for mobile robot navigation using a tem-

poral population code 89

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Conclusion 103

A Distributed Adaptive Control (DAC) Integrated into a

Novel Architecture for Controlled Human-Robot Inter-

action 109

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

A.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

contents xxiii

A.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Bibliography 133

List of Figures

1.1 Parallel pathways in the early visual system. Redrawn based

on (Nassi and Callaway, 2009) . . . . . . . . . . . . . . . . . . 5

1.2 Schematics of receptive field’s response in simple and complex

cells: The V1 cells are orientation selective, i.e the cell’s activity

increases or decreases when the orientation of edges and bars

matches the preferred orientation φ of the cell. . . . . . . . . . 8

1.3 Overview of the Temporal Population code architecture. In

the model proposed by Wyss et al. (2003b), the input stimulus

passes through an edge-detect stage that approximate the re-

ceptive field’s characteristics of LGN cells. In the next stage,

the LGN output is continuously projected to a network of later-

ally connected integrate-and-fire neurons with properties found

in the primary visual cortex. Due to the recurrent connections,

the visual stimulus becomes encoded over the network’s activ-

ity trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

xxiv

list of figures xxv

2.1 The TPC encoding model. In a first step, the input image is

projected to the LGN stage where its edges are enhanced. In

the next stage, the LGN output passes through a set of Gabor

filters that resemble the orientation selectivity characteristics

found in the receptive fields of V1 neurons. Here we show the

output response of one Gabor filter as input for the V1 spiking

model. After the image onset, the sum of the V1 network’s

spiking activity over time gives rise to a temporal representa-

tion of the input image. This temporal signature of the spatial

input is the, so called temporal population code, or TPC. . . . 28

2.2 The TPC encoding paradigm. The stimulus, here represented

by a star, is projected topographically onto a map of intercon-

nected cortical neurons. When a neuron spikes, its action po-

tential is distributed over a neighborhood of a given radius. The

lateral transmission delay of these connections is 1 ms/unit.

Because of these lateral intra-cortical interactions, the stimu-

lus becomes encoded in the network’s activity trace. The TPC

representation is defined by the spatial average of the popula-

tion activity over a certain time window. The invariances that

the TPC encoding renders are defined by the local excitatory

connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Computational properties of the two types of neurons used in

the simulations: regular (RS) and burst spiking (BS). The RS

neuron shows a mean inter spike interval of about 25 ms (40

Hz). The BS type displays a similar inter-burst interval with

a within burst inter-spike interval of approximately 7 ms (140

Hz) every 35 ms (28 Hz). . . . . . . . . . . . . . . . . . . . . . 33

xxvi list of figures

2.4 a) a) Neuronal readout circuit based on wavelet decomposition.

The buffer cells B1 and B2 integrate, in time, the network ac-

tivity performing a low-pass approximation of the signal over

two adjacent time windows given by the asynchronous inhibi-

tion received from cell A. The differentiation performed by

the excitatory and inhibitory connections to W gives rise to

a band-pass filtering process analogous to the wavelet detail

levels. b) An example of band-pass filtering performed by the

wavelet circuit where only the frequency range corresponding

to the resolution level Dc3 is kept in the spectrum. . . . . . . . 34

2.5 The stimulus classes used in the experiments after the edge

enhancement of the LGN stage. . . . . . . . . . . . . . . . . . . 36

2.6 The stimulus set. a) Image-based prototypes (no jitter in the

vertices applied) and the globally most different exemplars with

normalized distance equal one. The distortions can be very

severe as in the case of class number one. b) Histogram of the

normalized Euclidean distances between the class exemplars

and the class prototypes in the spatial domain. . . . . . . . . . 38

2.7 Baseline classification ratio using Euclidean distance among the

images from the stimulus set in the spatial domain. . . . . . . . 39

2.8 Comparison among the correct classification ratio for different

resonance frequencies of the wavelet filters for both types of

neurons RS and BS. The frequency bands of the TPC signal is

represented by the wavelet coefficients Dc1 to Ac5 in a multi-

resolution scheme. The network time window is 128 ms. . . . . 40

list of figures xxvii

2.9 Speed of encoding. Number of bits encoded by the network’s

activity trace as a function of time. The RS-TPC and BS-

TPC curves represent the bits encoded by the network’s activ-

ity trace without the wavelet circuit. The RS-Wav and BS-Wav

correspond to the bits encoded by the wavelet coefficients us-

ing the Dc3 resolution level for RS neurons and the Dc5 for

BS neurons respectively. For a time window of 128 ms the Dc3

level has 16 coefficients and the Dc5 has only 4 coefficients.

The dots in the figure represent the moment in time where the

coefficients are generated. . . . . . . . . . . . . . . . . . . . . . 42

2.10 Single-sided amplitude spectrum of the wavelet prototype for

each stimulus class used in the simulations. The signals x(t)

where reconstructed in time using the wavelet coefficients from

the Dc3 and Dc5 levels for RS and BS neurons respectively.

The shaded areas shows the optimal frequency response of the

Dc3 level (62 Hz to 125 Hz) and of the Dc5 level (15.5 Hz to 31

Hz). The less pronounced responses around 400 Hz are aliasing

effects due to the signal reconstruction to calculate the Fourier

transform (see discussion). . . . . . . . . . . . . . . . . . . . . . 45

2.11 Prototype based classification hit matrices. For each class in

the training we average the wavelet coefficients to form class

prototypes. In the classification process, the euclidean distance

between the classification set and the prototypes are calculated.

A stimulus is assigned to a the class with smaller euclidean

distance to the respective class prototype. . . . . . . . . . . . 46

2.12 Distribution of errors in the wavelet-based prototype classifica-

tion with relation to the Euclidean distances within the proto-

typed classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xxviii list of figures

3.1 Schematic of the lateral propagation using Discrete Fourier

Transform (DFT). The lateral propagation is performed as a

filtering operation in the frequency domain. The matrix of

spikes generated at time t is multiplied by the filters of lateral

propagation Fd over the next time steps t + 1 and t + 2. The

output is given by the inverse DFT of the result. . . . . . . . . 60

3.2 BASSIS is a multi-scale biomimetic architecture organized at

three different levels of control: reactive, adaptive and contex-

tual. It is based on the well established DAC architecture. See

text for further details. . . . . . . . . . . . . . . . . . . . . . . . 62

3.3 Model overview. The faces are detected and cropped from the

input provided by the iCub’s camera image. The cropped faces

are resized to a fixed resolution of 128x128 pixels and convolved

with the orientation selective filters. The output of each ori-

entation layer is processed by separated neural populations as

explained above. The spike activity is summed over a specific

time window rendering the Temporal Population Code, or TPC. 66

3.4 Cropped faces from Yale face database B used as a benchmark

to compare TPC with other methods of face recognition avail-

able in the literature (Georghiades et al., 2001) (Lee et al., 2005). 67

3.5 Speed of encoding. Average correct classification ratio given by

the network’s activity trace as a function of time for different

values of the input threshold Ti. . . . . . . . . . . . . . . . . . 68

3.6 Response clustering. The entries of the hit matrix represent

the number of times a stimulus class is assigned to a response

class over the optimal time window of 21 ms and Ti of 0.6. . . . 69

3.7 Face data set. Training set (Green), Classification set (Blue)

and Misclassified faces (Red). . . . . . . . . . . . . . . . . . . . 70

3.8 Classification ratio using a spatial subsampling strategy where

the network activity is read over multiples subregions. . . . . . 70

3.9 Classification ratio using the cropped faces from the Yale databse. 71

list of figures xxix

4.1 The gestures used in the Rock-Paper-Scissors game. In the

game, the players usually count until three before showing the

gestures. The objective is to select a gesture which defeats that

of the opponent. Rock breaks scissors, scissors cut paper and

paper covers captures rock. Unlike a truly random selection

method, like coin flipping or a dice, in the Rock-Paper-Scissors

game is possible to recognize and predict the behavior of an

opponent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Visual model diagram. In the first step, the input image is

color filtered in the hue, saturation and value (HSV) space and

segmented from the background. The segmented image is then

projected to the LGN stage where its edges are enhanced. In

the next stage, the LGN output passes through a set of Gabor

filters that resemble the orientation selectivity characteristics

found in the receptive field of V1 neurons. Here we show the

output response of one Gabor filter as input for the V1 spiking

model. After the image onset, the sum of the V1 network’s

spiking activity over time gives rise to a temporal representa-

tion of the input image. This temporal signature of the spatial

input is the so called temporal population code, or TPC. The

TPC output is then read out by a wavelet readout system. . . 78

4.3 Schematic of the encoding paradigm. The stimulus, a human

hand, is continuously projected onto the network of laterally

connected integrate and fire model neurons. The lateral trans-

mission delay of these connections is 1 ms/unit. When the

membrane potential crosses a certain threshold, a spike occur

and its action potential is distributed over a neighborhood of

a given radius. Because of these lateral intra-cortical interac-

tions, the stimulus becomes encoded in the network’s activity

trace. The TPC signal is generated by summing the total pop-

ulation activity over a certain time window. . . . . . . . . . . . 79

xxx list of figures

4.4 Computational properties after frequency adaptation of the in-

tegrate and fire neuron used in the simulations. The spikes in

this regular spiking neuron model occurs every 25 ms approxi-

mately (40 Hz). . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.5 Wavelet transform time-scale representation. At each resolu-

tion level, the number of wavelet coefficient drops by factor of

2 (dyadic representation) as well as the frequency range of the

low-pass signal given by the approximation coefficients. The

detail coefficients can be interpreted as a band-pass signal, with

frequency range equal to the difference in frequency between

the actual and previous approximation levels. . . . . . . . . . . 83

4.6 The stimulus classes used in the experiments. Here 12 exem-

plars for each class are shown. . . . . . . . . . . . . . . . . . . 84

4.7 Classification hit matrix for the baseline. In the classification

process, the correlation between the classification set (25% of

the stimuli )and the training set (75%) are calculated. A stim-

ulus is assigned to a the class with smaller mean correlation to

the respective class in the training set. . . . . . . . . . . . . . 86

4.8 Classification hit matrix. In the classification process, the eu-

clidean distance between the classification set and the proto-

types are calculated. A stimulus is assigned to a the class with

smaller euclidean distance to the respective class prototype. . 87

5.1 Architecture scheme. The visual system exchanges information

with both the attention system and the hippocampus model to

support navigation. The attention system is responsible for

determining which parts of the scene are considered for the

image representation. The final position representation is given

by the formation of place cells that show high rates of firing

whenever the robot is in a specific location in the environment,

i.e. place cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

list of figures xxxi

5.2 Visual model overview. The saliency regions are detected and

cropped from the input image provided by camera image. The

cropped areas, subregions of the visual field, have a fixed reso-

lution of 41x41 pixels. Each subregion is convolved with differ-

ence of gaussian (DoG) operator that approximates the prop-

erties of the receptive field of LGN cells. The output of the

LGN stage is processed by the simulated cortical neural popu-

lation and their spiking activity is summed over a specific time

window rendering the Temporal Population Code. . . . . . . . 95

5.3 Experimental environment. a) The indoor environment was

divided in a 5x5 grid of 0.6 m2 square bins, compromising 25

sampling positions. b) For every sampling position, a 360 de-

grees panorama was generate to simulate the rat’s visual in-

put. A TPC responses is calculated for each salient region, 21

in total. The collection of TPC vectors for the environment is

clustered into 7 classes. Finally, a single image is represented

by the cluster distribution of its TPC vectors using a histogram

of 7 bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.4 Pairwise correlation of the TPC histograms in the environment.

We calculate the correlation among all the possible combina-

tions of two positions in the environment. We average the

correlation values into distance intervals, according to the sam-

pling distances used in the image acquisition. This result sug-

gests that we can produce a linear curve of distance based on

the correlation values calculated using the TPC histograms. . . 99

5.5 Place cells acquired by the combination of TPC based visual

representations and the E%-max WTA model of the hippocam-

pus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

A.1 The iCub ready to play with the objects that are on the Re-

actable. The table provides an appealing atmosphere for social

interactions between humans and the iCub. . . . . . . . . . . . 113

xxxii list of figures

A.2 BASSIS is a multi-scale biomimetic architecture organized at

three different levels of control reactive, adaptive and contex-

tual. It is based on the well established DAC architecture. See

text for further details. . . . . . . . . . . . . . . . . . . . . . . . 115

A.3 Reactable system overview. See text for further explanation. . 117

A.4 PMP architecture: at lower level the pmpServer module com-

putes the trajectory, targets and obstacles in the space feeding

the Cartesian Interface with respect to the trajectory that has

to be followed. The user relies on a pmpClient in order to

ask the server to move, add and remove objects, and even-

tually to start new reaching tasks. At the highest level the

pmpActions library provides a set of complex actions, such as

”Push” or ”Move the object’, simply implemented as combina-

tions of primitives. . . . . . . . . . . . . . . . . . . . . . . . . . 121

A.5 A cartoon of the experimental scenario. The distances for the

left (L), middle (M) and right (R) positions are giving according

to the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.6 Box plots of GEx and GEy for the grasp experiments. The

error in placing an object on the table increases from left to

right with respect to the x axis. No significant difference was

observed for y. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.7 Box plots for the place experiment for the different conditions

analyzed: source and target. Both conditions have an impact

in the error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.8 Average time needed for the shared plan tasks. The learning

box, represents the time need for the human to create the task.

The execution, represents the time needed by the interaction

to perform the learned action. . . . . . . . . . . . . . . . . . . . 131

List of Tables

2.1 Parameters used for the simulations. . . . . . . . . . . . . . . . 32

3.1 Parameters used for the simulations. See text for further ex-

planation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2 Comparison of TPC with other face recognition methods. The

results were extracted from Lee et al. (2005) were the reader

can find the references for each method. . . . . . . . . . . . . . 72

4.1 Parameters used for the simulations. . . . . . . . . . . . . . . . 84

5.1 Parameters used for the simulations. . . . . . . . . . . . . . . . 96

A.1 Excerpt of available OPC properties . . . . . . . . . . . . . . . 118

xxxiii

Chapter 1

Introduction

In our daily life we are constantly exposed to situations where recognizing

different objects, faces, colors or even estimating the speed of a moving

object plays a key role in the way we behave. To recognize different scenes

under widely varying conditions, the visual system must solve the hard

problem of building invariant representations of available sensory infor-

mation. Invariant representation of visual input is a key element for a

number of tasks in different species. For example, It has been shown in

foraging experiments with rats that the animal relies on visual landmarks

to explore the environment and to produce an internal representation of

the world. This strategy allows not only rats, but a number of other ani-

mals, to successfully navigate over different territories while searching for

food (Tamara and Timberlake, 2011; Zoccolan et al., 2009).

Furthermore, the invariant representation of visual content must be very

efficiently performed, capturing all the rich details necessary to distin-

guish among different situations in a very fast and non-redundant way so

that prototypes of a class can emerge and be efficiently stored in memory

and/or serve on-going action. Most work done in invariance of the visual

cortex is related to the primary visual cortex. Despite intense researches

in the field of vision and neuroscience, it still remains unclear what are

1

2 introduction

the underlying cortical mechanisms responsible for building invariant rep-

resentations (Pinto et al., 2008). This question has been of great interest

to the field of computational neuroscience and also roboticists building

artificial systems acting in real-world situations (Riesenhuber and Pog-

gio, 2000). This is because, biological systems far outperform advanced

artificial systems in terms of solving real-world visual tasks.

In classical models of visual perception invariant representations emerge in

the form of activity patterns at the highest level of an hierarchical multi-

layer network of spatial feature detectors (Fukushima, 1980a; Riesenhuber

and Poggio, 1999; Serre et al., 2007). These models are extensions of

the Hubel and Wiesel simple-to-complex cell hierarchy. In this approach,

invariances are achieved at the cost of increasing the number of connec-

tions between the different layers in the hierarchy. However, these models

seem to be based on a fundamental assumption that is not consistent with

cortical anatomy.

Another inconsistency of traditional artificial neural networks, such as

the perceptron and multi-layer perceptrons (Rosenblatt, 1958; Rumelhart

et al., 1986), is their time independence. These models were designed

to deal with purely static spatial patterns of input. Time is considered

as another dimension, which makes those models extremely hard-wired.

From the biological perspective, it is clear that time is not treated as

another spatial dimension at the input level. Thus a central question in

this discussion is how time can be incorporated in the representation of a

spatial input.

In this dissertation we address this issue proposing a plausible circuitry

for cortical processing of sensory input. We develop a model of the early

visual system aiming to capture the key ingredients necessary to build up

a robust invariant and compact representation of the visual world. The

proposed model is fully grounded on physiological and anatomical proper-

ties of the mammalian visual system and gives a functional interpretation

to the dense intra-cortical connectivity observed in the cortex. In this

1.1. early visual areas 3

chapter, we start from a brief review of the brain areas involved in the

representation and categorization of objects, defining in detail the ques-

tions and proposed solutions.

1.1 Early visual areas

Retina

The sensory input organ for the visual system is the eye, where the first

steps of seeing begin in the retina (Nassi and Callaway, 2009). The light

that hits the eye passes through the cornea and the lens before arriving the

retina. The retina contains a dense array of photoreceptors, called rods

and cones, that encode the intensity of light as a function of 2D position,

wavelength and time. The rods are in charge of scotopic vision, or night

vision, and get saturated in day light. They are very sensitive to light

and can respond to a single photon. The cones are not very sensitive to

light, and thus are responsible for day light vision. Cones come in three

subtypes with distinct spectral absorption functions: short (S), medium

(M), and long (L), which form the bases for chromatic vision. A single

cone by itself is color blind in the sense that its activation depends both on

the wavelength and intensity of the stimulus. The signals from different

classes of photoreceptors must be compared in order to access the incoming

information about color (Conway et al., 2010). The existence of cone-

opponent retinal ganglion cells (RGC) that perform such comparisons is

well established in primate.

The axons of (RGC) form the optic nerve connecting the eye to the brain.

There are three types of RGC that have been particularly well character-

ized. Morphologically defined, ON and OFF Midget RGC are the origin of

the parvocellular pathway. Both midget and parvocellular cells provide a

L versus M color-opponent signal to the parvocellular layers of the lateral

geniculate nucleus (LGN), which are often associate to red-green colors.

Parasol RGC constitute the magnocellular pathway and convey a broad-

4 introduction

band, achromatic signal to the magnocellular layers of the LGN. Cells in

this pathway have large receptive fields, highly sensitive to contrast and

spatial frequency. Finally, small and large bistratified RGC compose the

koniocellular pathway and convey a S signals against L+M, or blue-yellow

color-opponent signal to koniocellular layers of the LGN (Nassi and Call-

away, 2009).

In both areas, retina and LGN, the cells have center-surround receptive

fields (Lee et al., 2000). These neurons have an ON-center/OFF-surround

responses to light intensity, or vice versa. An ON-center cell has its activity

increased when light hits its receptive field’s center and decreased activity

when light hits the center’s surround, and vice-versa for the OFF-center

cells.

Most researchers argue that the retina’s principal function is to convey

the visual image through the optic nerve to the brain. However, recent

results suggest that already in this early stage of the visual pathway the

chromatic or achromatic characteristics of a stimulus are encoded by a

complex temporal response of RGC (Gollisch and Meister, 2008) (Gollisch

and Meister, 2010).

Lateral geniculate nucleus

The output of these specialized channels are projected to the LGN of the

thalamus. The primate LGN has a laminar organization, divided into 6

layers that receives information from the RGC. The most ventral layers

(layers 1-2) receive the inputs from the magnocellular pathway, while the

remaining layers receive input from the parvocellular pathway. Wiesel and

Hubel (1966) found that Monkey neurons in the magnocellular layers of

LGN were largely color-blind, with achromatic, large and highly contrast

sensitive receptive fields. Whereas the neurons in the parvocellular lay-

ers had smaller, colour-opponent and poorly contrast sensitive receptive

fields. Further work revealed that the koniocellular pathway connects in

segregated stripes between the Parvocellular and Magnocellular layers of

1.1. early visual areas 5

LGN (reviewed by (Callaway, 2005) and (Shapley and Hawken, 2011))

(Fig. 1.1 a).

Midget

Bistratified

Parasol

Retina

V1

LGN

Midget Ganglion CellRed-Green colour

opponency

Small BistratifiedGanglion Cell

Blue-Yellow colouropponency

Parasol Ganglion Cell

Luminance

Mag

noce

llula

rLa

yers

Magnocellular

Kon

ioce

llula

r Lay

ers

Koniocellular

Parv

ocel

lula

r Lay

ers

Parvocellular

2/3

4A4B

5

6

4Cα

4Cβ

A)

B)

Figure 1.1: Parallel pathways in the early visual system. Redrawn based on(Nassi and Callaway, 2009)

In the primary visual cortex, the output of parvo, magno and koniocellular

cells from the LGN keeps segregated. Parvo cells terminate in layer 4Cβ,

whereas magno cells innervate layer 4Cα. Konio cells are the input to

6 introduction

layers 2/3. These distinct anatomical projections persuaded early investi-

gators that parvo and magno channels remain functionally isolated in V1

which is still a source of controversy (Nassi and Callaway, 2009) (Fig. 1.1

a).

The receptive field of LGN cells exhibit an ON-OFF response to light in-

tensity. Modeling studies have used filters based on difference of Gaussian

(DoG) to approximate LGN cells receptive field’s characteristics (Einevoll

and Plesser, 2011; Rodieck and Stone, 1965). From an image process-

ing perspective, DoG filters can perform edge enhancement over different

spatial frequencies.

Primary visual cortex

In the visual cortex, different properties of objects - their distances, shapes,

colors and direction of motion - are segregated into separate cortical areas

(Zeki and Shipp, 1988). The earliest stage is the primary visual cortex

(V1, striate cortex or area 17) that is characterized by columnar organiza-

tion, where neurons with similar response properties are grouped vertically

(Mountcastle, 1957).

The primary visual cortex shows the stereotypical layered structure ob-

served in the neo-cortex. It is structured into 6 layers, numbered from the

surface to the depth from 1 to 6 (Fig. 1.1 b). The projection from LGN

to V1 follows a non-random relationship between the object’s position in

visual space and in the cortex position, in retinal coordinates. This topo-

logical arrangement where nearby regions on the retina project to nearby

cortical regions is called retinotopic (Cowey and Rolls, 1974).

In V1, there are two types of excitatory cells known as pyramidal and

stellate. Both of these cell types receive direct feedforward input from cells

in magnocellular pathway layer of thalamus into the layer 4C. Pyramidal

cells have large apical dendrites that pass above layer 4B and into layer

2/3. The apical dendrites of pyramidal cells receive connections from the

1.1. early visual areas 7

parvocellular pathway to layer 4Cβ in V1 and also projects into layer 2/3.

Indeed, most of the feed-forward input into layers 2/3 originates from layer

4C and 4B. Only few inputs come from the koniocellular layer of the LGN.

Neurons in V1 exhibit selectivity for the orientation of a visual stimulus

(Hubel and Wiesel, 1962; Ringach et al., 2002; Nauhaus et al., 2009). They

are distributed in orientation maps where some neurons lie in regions of

fairly homogeneous orientation preference, while others lie in regions with

a variety of preferences called pinwheel centers (Nauhaus et al., 2008).

V1 neurons are basically divided in two classes according to the proper-

ties of their receptive fields: simple cells and complex cells (Hubel and

Wiesel, 1962). Similarly to retinal ganglion and geniculate cells, simple

cells show distinct excitatory and inhibitory subdivisions. They respond

best to elongated bars or edges in a preferred direction. Complex cells

are more concerned with the orientation of a stimulus than with its exact

position in the receptive field (Fig. 1.2).

The excitatory circuitry in the primary visual cortex is characterized by

dense local connectivity (Callaway, 1998; Briggs and Callaway, 2001). It is

estimated that 95% of the synapses in V1 are local or feedback connections

from other visual areas (Sporns and Zwi, 2004). Within the same area, the

projections can link neurons over distances of several millimeters, spatially

distributed into clusters of same stimulus preference(Stettler et al., 2002).

It has been proposed that V1 selectivity to stimuli with similar orientation

preference is mediated by long-range horizontal connections intrinsic to V1

(Stettler et al., 2002). This rule has established the notion of a like-to-like

pattern of connectivity (Gilbert and Wiesel, 1989). However some recent

experiments have found controversial results. Pyramidal neurons laying

close to a population of diffuse orientation selectivity (pinwheel centers)

reportedly connect laterally to orientation columns in a cross-oriented or

non-selective way (Yousef et al., 2001).

There are also massive feedback projections from higher visual areas, for

example V2. Compared with the feedforward and lateral connections,

8 introduction

+-

0.25 0.75 sec

Stimulus on

Spikes

Simple CellReceptive Field

Stimulus on

Complex CellReceptive Field

Ø

Figure 1.2: Schematics of receptive field’s response in simple and complex cells:The V1 cells are orientation selective, i.e the cell’s activity increases or decreaseswhen the orientation of edges and bars matches the preferred orientation φ of thecell.

the feedback projections to V1 have received less attention from the sci-

entific community(Sincich and Horton, 2005). Feedback connections from

higher cortical areas provide more diffuse and divergent input to V1 (Salin

et al., 1989). A first, anatomical studies suggested that V2 and V1 cells

are preferentially connected when they share similar preferred orientations

(Gilbert and Wiesel, 1989). In contradiction, recordings in area V1 of the

macaque monkey show that the inactivation of V2 has no effect on the

center/surround interactions of V1 neurons. The only effect observed in

the inactivation of V2 feedback is a decrease in response to a single ori-

ented bar of about 10% of V1 neurons (Hupe et al., 2001). Indeed, recent

modeling studies confirm that the tuning behavior to oriented stimulus is

highly prevalent in V1 and only if it operates in a regime of strong local

recurrence (Shushruth et al., 2012).

1.2. higher order visual areas: the extrastriate areas andobject recognition in the brain 9

Studies in monkeys have investigated the possible influences of horizontal

connections and feedback from higher cortical areas in contour integration.

The results suggest that V1 intrinsic horizontal connections provide a more

likely substrate for contour integration than feedback projections from V2

to V1. However, the exact mechanism is still unclear.

In the last years, much has been advanced in the knowledge about V1,

however the functional role of the lateral connectivity (horizontal or feed-

back) remains unknown. Specially in terms of stimuli encoding and its

relationship with the dense local connectivity observed in this area.

1.2 Higher order visual areas: the extrastriateareas and object recognition in the brain

The extrastriate cortex comprises all the visual areas after the striate

cortex, or V1. It includes the areas V2, V3, V4, IT (inferior temporal

cortex) and MT (medial temporal cortex). The earlier visual areas, up

to V4 appear to have clear retinotopic maps with cells tuned do different

features such as orientation, color or spatial frequency. The consecutive

stages reveal receptive fields with increasingly complex selectivity. This is

particularly more evident in the IT which is attributed to play a major

role in object recognition and categorization (Logothetis and Sheinberg,

1996).

Early lesion studies showed that a complete removal of both temporal lobes

in monkeys resulted in a collection of strange symptoms, but remarkably

the inability of recognize objects visually. More specifically, lesions in the

inferior temporal lobe produced severe and permanent deficits in learning

and remembering to recognize stimuli, resulting in visual agnosias. For

example, after bilateral or right hemisphere damage to IT cortex humans

reported difficulties in recognizing faces (prosopagnosia), colors (achro-

matopsia) and other more specific objects.

10 introduction

The area IT receives visual information from V1 through a serial pathway,

which is called the ventral visual pathway (V1-V2-V4-IT). The IT projects

to various brain areas outside the visual cortex, including the prefrontal

cortex, the perirhinal cortex (areas 35 and 36), the amygdala, and the

striatum of the basal ganglia (Tanaka, 1996).

The neurons in the IT have some properties that help to understand the

crucial role that this area plays in pattern recognition. The IT cells only

respond to visual stimuli. The receptive fields always include the center

of gaze and tend to be larger then in the V1, leading to a stimulus gen-

eralization within the receptive field. IT neurons usually respond more

to complex than simple shapes, but also for color. A small percentage of

IT neurons are selective for facial images. Recent FMRI studies macaque

monkeys found a specific area in the temporal lobe that is activated much

more by faces than by non-face objects (Tsao et al., 2003). Single-unit

recordings subsequently confirmed it (Tsao et al., 2006). Middle face patch

neurons detect and differentiate faces using a strategy that is both part

based (constellations of face parts) and holistic (presence of a whole, up-

right face) Freiwald et al. (2009).

These findings point towards the concept of grandmother cells or gnostic

units, i.e, hypothetical neurons that respond only to a highly complex,

specific, and meaningful stimulus, such as the image of one’s grandmother

(Gross, 2002). However, the idea of localist representations in the brain

has been controversial. On one side of the spectrum, the ones who advo-

cate in favor of localist representations claim that these theories of per-

ception and cognition are more consistent with neuroscience (reviewed by

(Bowers, 2009)). On the other side, parallel distributed processing theo-

ries of cognition claim that knowledge is coded in a distributed manner

in mind and brain (reviewed by (O’Reilly, 1998)). First, it would be far

too risky for the nervous system to rely too much on selectivity. A well

placed damage to a small number of cells could make one never recognize

his/her grandmother again. In addition, it’s now well established that

higher information-processing areas return information to the lower ones,

1.3. coding strategies and mechanisms used by theneo-cortex to provide invariant representation of the visualinformation 11

so that information travels in both directions between the modules, and

not just upward.

In the area IT neurons are usually invariant over changes in contrast,

stimulus size, color and position within the retina (Tanaka, 1996).

1.3 Coding strategies and mechanisms used bythe neo-cortex to provide invariantrepresentation of the visual information

In classical models that mimic natural vision systems, invariant represen-

tations emerge in the form of activity patterns at the highest level of the

network by virtue of the spatial averaging across the feature detectors at

the preceding layers. These models are based on the, so called, Neocogni-

tron (Fukushima, 1980b, 2003; Poggio and Bizzi, 2004; Chikkerur et al.,

2010), a hierarchical multilayer network of detectors with varying feature

and spatial tuning. In this approach, recognition is based on a large dic-

tionary of features stored in memory. Filters selective to specific features

or combinations of features are distributed over different network layers.

The underlying coding mechanism used in hierarchical models is based

on rate coding, i.e, the mean firing rate of a neuron is directly related to

its selectivity to the applied stimulus. In this way, hierarchical models

are based on the assumption that a cortical cell is tuned to a local and

specific feature using rate coding. A neuron in a rate coding network has

a binary response to a certain stimulus, encoding the information of one

bit. Thus, in this approach invariances to, for instance, position, scale

and orientation, are achieved at the high cost of increasing the number of

neurons and connections between layers responsible for all possible kinds

of inputs the network can receive. Taking into account the dimensionality

of the human’s visual space, the total number of visual features in the

world would require far more cells that are available in the cortex.

12 introduction

Even under the assumption that neurons do use rate coding, more infor-

mation could be provided by the network using a population of neurons.

In particular, a group of cells, each responsive to a different feature, can

cooperatively code for a complete subspace of sensory inputs. In the motor

cortex for instance, the direction of movement was found to be uniquely

predicted by the action of a population of motor cortical neurons (Geor-

gopoulos et al., 1986). These models have been criticized since the neural

substrate probably reflect only a fraction of the patterns present in the

brain. And also because this scheme requires a minimum degree of het-

erogeneity in the neuronal substrate which might not be anatomically true.

An alternative for such coding system is to include time as an extra coding

dimension. It has been showed that the precise spike timing of single neu-

rons can be successfully used to encode information (reviewed by Tiesinga

et al. (2008)). This encoding scheme has been called temporal coding. The

difference between rate and temporal code can be unclear if the spiking

rates are based on very tinny time bins. However, the conceptual idea is

clear. It is intuitive that for a single neuron, the potential information

content of precise and reliable spike times is many times larger than that

which is contained in the firing rate, which is averaged across a typical

interval of a hundred milliseconds (Reinagel and Reid, 2000).

Recently, a model of the primary visual cortex has used the concept of

temporal code within a population of neuron to show that the temporal

dynamics of a recurrently coupled neural network can encode the spatial

features of a 2D stimulus (Wyss et al., 2003b,a). This novel idea intro-

duced a new type of coding where the spatial information is translated into

a temporal population code (TPC). The model gives a functional interpre-

tation to the dense excitatory local connectivity found in V1 and suggests

an encoding paradigm invariant to a number of visual transformations

based on a temporal population code (TPC).

In this approach the input stimulus is initially processed by an LGN-like

circuit and topographically projected onto a V1 network of neurons orga-

1.3. coding strategies and mechanisms used by theneo-cortex to provide invariant representation of the visualinformation 13

nized in a bi-dimensional Cartesian space with dense local and symmetrical

connectivity. The output of the network is a compressed representation of

the stimulus captured in the temporal evolution of the population’ spike

activity 1.3.

Spike LateralSpreading

InputImage

ModeledNeurons

(V1)

Retina LGN

RetinotopicProjection

Readout

V1

Ø2

Time

Spik

e TPC

Figure 1.3: Overview of the Temporal Population code architecture. In themodel proposed by Wyss et al. (2003b), the input stimulus passes through anedge-detect stage that approximate the receptive field’s characteristics of LGNcells. In the next stage, the LGN output is continuously projected to a networkof laterally connected integrate-and-fire neurons with properties found in theprimary visual cortex. Due to the recurrent connections, the visual stimulusbecomes encoded over the network’s activity trace.

An intrinsic characteristic of TPC is the independence of the precise po-

sition of the stimulus within the network. The dynamics of a neuron

population do not depend on what neurons are involved and are invariant

to their topological arrangement. Therefore, the space to spatial-temporal

transformation of TPC provides for a high-capacity encoding, invariant to

position and rotation. In the code scheme of TPC the relationship among

the core geometrical components of the visual input is captured. Up to a

certain amount, distortions in form are smoothed by the lateral interac-

tions which makes the encoding robust to small geometrical deformations

for instance found in hand written characters (Wyss et al., 2003a).

14 introduction

From a theoretical perspective, the TPC architecture is totally generic re-

gardless of the stimulus input. Because the network comprises both the

coding material and the mechanisms needed to encapsulate local rules of

computation. In this sense, there is no need of re-wiring the network for

different inputs. The wire-independence of TPC in its use of connectiv-

ity makes it extremely more general than pure hierarchical models. The

network can receive general types of inputs without incorporating filters

selective to the incoming features.

From a biological perspective, there are a number of physiological studies

that support the encoding scheme of TPC. A work with salamanders re-

ported that, already in the retina, certain ganglion cells encode the spatial

structure of a briefly presented image in the relative timing of their first

spikes (Gollisch and Meister, 2008). In an experiment, were eight stimuli

were used (perfect encoding would represent 3 bits), the spike latency of

a ganglion cell transmitted up to 2 bits of information on a single trial.

Indeed, by simply plotting the recorded differential spike latencies as a

gray-scale code, Gollisch and Meister (2008) could obtain a rather faithful

neural representation of a raw natural image presented to the retina of the

animal.

A recent work has investigated how populations of neurons in area V1

represent the time-changing sensory input through a spatio-temporal code

Benucci et al. (2009). The results showed that the population activity

is attracted towards the orientation of consecutive stimulus. Therefore,

using a simple linear decoded based on the weighted sum of past stimulus

orientations higher cognitive areas can predict with high accuracy the

population responses to changes in orientation.

A study with monkeys has found that in prefrontal cortices the informa-

tion about stimulus categories is encoded by a temporal population code

(Barak et al., 2010). Monkeys were trained to distinguish among differ-

ent trials, separated by a time delay, of vibro-tactile stimuli applied on

their fingertip. Surprisingly, they found that population state consistently

1.4. how can the key components encapsulated in atemporal code be decoded by different cortical areasinvolved in the visual process? 15

reflected the vibrational frequency during the stimulus. Moreover, they

observed substantial variations of the population state between the stim-

ulus and the end of the delay. These findings challenge the standard view

that information in the working memory is exclusively encoded by the

spiking activity of dedicated neuronal populations.

Population code has been associate with oscillatory discriminations of sen-

sory input in a number of species. In the auditory cortex of birds the

temporal responses of neuron populations allow for the intensity invariant

discrimination of songs (Billimoria et al., 2008). Or in the early stages

of sensory processing of electric fishes (Marsat and Maler, 2010). These

animals use discharges of electrical signal to communicate courtship (big

chirps, around 300 Hz up to 900 Hz) or aggressive encounters (small chirps,

around 100 Hz) that are encoded by different populations of neurons. Big

chirps are accurately described by a population of pyramidal neurons us-

ing a linear representation of their temporal features. Whereas, small

chirps are encoded by synchronous bursting of populations of pyramidal

neurons. In the insect olfactory system the neurons of the antennal lobe

display stimulus induced temporal modulations of their firing rate in a

temporal code (Carlsson et al., 2005; Knusel et al., 2007).

1.4 How can the key components encapsulatedin a temporal code be decoded by differentcortical areas involved in the visual process?

In the last years a growing number of results have supported the idea

that the brain uses temporal signals through oscillations to link ongoing

processes over different brain areas (Buzsaki, 2006).

The hippocampus and prefrontal cortex (PFC) are structures that make

use of these channels of information to communicate with other brain cir-

cuits. These areas are hubs of communication that orchestrate the signals

originated in many cortical and subcortical areas promoting cognitive func-

16 introduction

tions such as working memory, memory acquisition and consolidation, and

decision making (Benchenane et al., 2011). In a recent study, Battaglia

and McNaughton (2011) show that during memory maintenance PFC os-

cillations are in phase with hippocampal oscillations. However when there

are no working memory demands, coherence breaks down, and no consis-

tent phase relationship is observed.

Both hippocampus and PFC receive converging input from higher sen-

sory/associative areas. Gamma oscillations have been mostly studied and

understood in the early and intermediate visual cortex. In primates, the

fast oscillations play a key role in the attention process underlying stim-

ulus selection. Attention causes both firing rate increases and synchrony

changes caused by a top-down mechanism mediated by the PFC (reviewed

by Noudoost et al. (2010) and Tiesinga and Sejnowski (2009a)). For ex-

ample, the reception of information from higher visual areas such as V4

can be tuned in on a V1 column whose receptive field contains relevant

stimulus, resulting on a steering visual attention mechanism (Womelsdorf

et al., 2007).

The TPC signal intrinsically carries the visual information through a set

of encapsulate oscillations over, different frequency ranges. The tempo-

ral code is in tune with the notion that the sensory information travels

throughout the brain in different wavelengths. In this context, if the TPC

plays a role in encoding global stimulus features in a compact temporal

representation, it is relevant to understand what its key coding features

are and how the signal can be decoded into different components by higher

visual areas by an efficient readout system.

In the past years, different solutions for reading out the TPC were pro-

posed. Recently, a TPC readout based on the so-called Liquid State Ma-

chine, or LSM (Doetsch, 2000) was developed by Knusel et al. (2004).

The LSM networks are examples of reservoir computing networks where

dense local circuits of the cerebral cortex are implemented as a large set of

nearly randomly defined filters. The results by Knusel et al. (2004) showed

1.4. how can the key components encapsulated in atemporal code be decoded by different cortical areasinvolved in the visual process? 17

to be inefficient compared to traditional linear decoders based purely on

correlations. Also additional layers of hundreds of integrate-and-fire neu-

rons are required for the readout. Furthermore, the LSM strategy did

not account for the different frequencies of oscillation present in the TPC.

Consequently the readout system has remained an open issue.

In this dissertation, we followed a different strategy to solve this issue.

We proposed a biologically plausible readout circuit based on wavelets

transforms (Luvizotto et al., 2012c). The hypothesis behind is that a

population of readout neurons tuned to different spectral bands can ex-

tract the different content encoded in the temporal signal. Our hypothesis

is in line with recent findings suggesting that frequency bands of coher-

ent oscillations constitute spectral fingerprints of canonical computations

mechanisms involved in the description of cognitive processes (Siegel et al.,

2012).

In the model we use a synchronous inhibition system oscillating in the

gamma frequency range to simulate the effect of a Haar wavelet transform.

Over windows of low inhibitory conductance the TPC signal is integrated

and differentiated producing a band-pass filter response. The frequency

response of the output is controlled by varying the synchronous inhibition

time. In the proposed system, the readout can be tuned to specific fre-

quency ranges, for instance the gamma frequency. Also in this scenario,

interferences added by oscillations in different frequency bands can be fil-

tered out enhancing the quality of the information that is captured. In

this sense, the wavelet readout could accounting for the demultiplexing of

different neuronal responses, for example of motion and orientation that

are encoded in V1 (Onat et al., 2011).

In addition, the mammalian visual system has a remarkable capacity of

processing a large amount of visual information within dozens of millisec-

onds (Thorpe et al., 1996; Fabre-Thorpe et al., 1998). In this sense, the

readout system must be fast, compact and therefore non-redundant. Those

are well known advantages of wavelet transforms Mallat (1998). The pro-

18 introduction

posed circuit reads out the signal using only a few coefficients. In par-

ticular, the simplicity and orthogonality of the Haar wavelet used in the

readout circuit leads to a very economic neuronal substrate. The imple-

mentation requires only four neurons.

In the chapter 2, we present all the details of the proposed neuronal wavelet

readout circuit, the methods and results addressing its strengths and lim-

itations.

1.5 Can TPC be used in the recognition ofmulti-modal sensory input such as humanfaces and gestures to provide formperception primitives?

In this thesis, the original TPC model was revisited. We fully investigate

the capabilities that TPC showed previously in simulations (Wyss et al.,

2003b,a), virtual environments Wyss and Verschure (2004b) and controlled

robot arena (Wyss et al., 2006) under realistic and natural environments.

If the TPC encoding scheme provides a good representation of the visual

input it is relevant to investigate what are its boundaries of robustness

and reliability in complex real-world environment. Also a relevant issue

for real-time applications is the computational requirements to process the

massive amount of recurrent connections in the network. Can robustness

in the encoding being achieved at reasonable speed of processing? We

know from a theoretical point of view that TPC is built around a generic

encoding mechanism. Can TPC be used as an unified and canonical model

of cortical computations that can be extended to other stimulus modalities

to generate perception primitives?

Using different and more realistic types of modeled neurons and an overall

new optimized architecture configuration, the original model was totally

re-designed and optimized. To address these questions we first exposed

the new TPC network to one of the most complex stimulus we are often

1.5. can tpc be used in the recognition of multi-modalsensory input such as human faces and gestures to provideform perception primitives? 19

exposed to: human faces (Zhao et al., 2003; Martelli et al., 2005; Sirovich

and Meytlis, 2009).

In chapter 3, we present the details of the face recognition system1 based

on the TPC model that was implemented in the humanoid robot iCub

(Luvizotto et al., 2011). To allow for real-time computations we resort

to a standard method of signal processing to develop a novel way of cal-

culating the lateral interactions in the network. The lateral connections

were interpreted as discrete finite impulse responses (FIR) filters and the

spreading of activity calculated using 2D convolutions. In addition, all the

convolutions were calculated as multiplications in the frequency domain.

A simple signal processing technique yields much faster processing times

and also the possibility of having much larger networks in comparison to

pre-assigned maps of connectivity stored in memory.

The model was benchmarked using images from the iCub’s camera and

from a standard database: the Yale Face Database B (Georghiades et al.,

2001; Lee et al., 2005). This way results we could compare the perfor-

mance of TPC with standard methods of face recognition available in the

literature. This work was awarded as one of the top-five papers presented

in the 2011 edition of the IEEE International Conference on Robotics and

Biomimetic (ROBIO). Another contribution of this work is a complete

TPC library developed and made publicly available2 that can be freely

used by the robotics community.

With the face recognition results, we could address questions regarding

to robustness in real-time and real-world scenarios of TPC. To finally ad-

dress the generality of TPC, we designed two more test scenarios involving

different tasks and a different robot platform.

In chapter 4, we extended the experiments performed with faces in the

iCub to gesture recognition. The experiments were motivated by the game

1iCub never forgets a face: a video of the face recognition system on iCub.http://www.robotcompanions.eu/blog/2012/01/2820/

2efaa.sourceforge.net

20 introduction

Rock-Paper-Scissors. In this human-robot interaction scenario, the robot

has to recognize the gesture thrown by the human to reveal who is the

winner. The results show that with the same network, the robot can

reliably recognize the gestures used in the game. Furthermore, the wavelet

circuit got fully integrated in the C++ library showing a remarkable gain

in recognition rate versus compression.

To further stress the generality of TPC, in chapter 5, the proposed model

is implemented in a mobile robot and used for navigation. Based on the

camera input, the model represents distance relationships of the robot

over different positions in a real-world arena. The TPC representation

feeds a hippocampus model that accounts for the formation of place fields

(Luvizotto et al., 2012b).

These results finally demonstrated that the TPC encoding is totally generic.

The same network implementation can feed the the working memory with

different stimulus classes without any need of rewiring, fast and invariant

to most transformations found in real-world environments.

Chapter 2

A wavelet based neural modelto optimize and read out atemporal population code

It has been proposed that the dense excitatory local connectivity of the

neo-cortex plays a specific role in the transformation of spatial stimulus

information into a temporal representation or a temporal population code

(TPC). TPC provides for a rapid, robust and high-capacity encoding of

salient stimulus features with respect to position, rotation and distortion.

The TPC hypothesis gives a functional interpretation to a core feature of

the cortical anatomy: its dense local and sparse long-range connectivity.

Thus far, the question of how the TPC encoding can be decoded in down-

stream areas has not been addressed. Here, we present a neural circuit that

decodes the spectral properties of the TPC using a biologically plausible

implementation of a Haar transform. We perform a systematic investi-

gation of our model in a recognition task using a standardized stimulus

set. We consider alternative implementations using either regular spiking

or bursting neurons and a range of spectral bands. Our results show that

our wavelet readout circuit provides for the robust decoding of the TPC

and further compresses the code without loosing speed or quality of decod-

21

22a wavelet based neural model to optimize and read out a

temporal population code

ing. We show that in the TPC signal the relevant stimulus information is

present in the frequencies around 100 Hz. Our results show that the TPC

is constructed around a small number of coding components that can be

well decoded by wavelet coefficients in a neuronal implementation. The

solution to the TPC decoding problem proposed here suggests that cor-

tical processing streams might well consist of sequential operations where

spatio-temporal transformations at lower levels form a compact stimulus

encoding using TPC that are subsequently decoded back to a spatial rep-

resentation using wavelet transforms. In addition, the results presented

here show that different properties of the stimulus might be transmitted

to further processing stages using different frequency components that are

captured by appropriately tuned wavelet based decoders.

2.1 Introduction

The encoding of sensory stimuli requires robust compression of salient fea-

tures (Hung et al., 2005). This compression must support representations

of the stimulus that are invariant to a range of transformations caused,

in case of vision, by varying viewing angles, different scene configura-

tions and deformations. Invariances and compression of information can

be achieved by moving across different representation domains i.e. from

spatial to temporal representations.

In earlier work we proposed an encoding paradigm that makes use of this

strategy called the Temporal Population Code (TPC) (Wyss et al., 2003a).

In this approach the input stimulus is topographically projected onto a

network of neurons organized in a bi-dimensional Cartesian space with

dense local connectivity. The output of the network is a compressed rep-

resentation of the stimulus captured in the temporal evolution of the pop-

ulation spike activity. The space to time transformation of TPC provides

for a high-capacity encoding, invariant to position and image deformations

that has been successfully applied to real world tasks such as hand-written

character recognition (Wyss et al., 2003c), spatial navigation (Wyss and

2.1. introduction 23

Verschure, 2004b) and face recognition in a humanoid robot (Luvizotto

et al., 2011). TPC shows that the dense excitatory local connectivity

found in the primary sensory areas of the mammalian neo-cortex can play

a specific role in the rapid and robust transformation and compression of

spatial stimulus information that can be transmitted over a small number

of projections to subsequent areas. This wiring scheme is consistent with

the anatomy of the neo-cortex where about 95% of all connections found

in a cortical volume also originate in it (Sporns and Zwi, 2004).

In classical models of visual perception invariant representations emerge in

the form of activity patterns at the highest level of an hierarchical multi-

layer network of spatial feature detectors (Fukushima, 1980a; Riesenhuber

and Poggio, 1999; Serre et al., 2007). In this approach, invariances are

achieved at the cost of increasing the number of connections between the

different layers of the hierarchy. However, these models seem to be based

on a fundamental assumption that is not consistent with cortical anatomy.

In comparison to these hierarchical models of object recognition, the TPC

architecture has the significant advantage of being both compact and wire

independent thus providing for the multiplexing of information.

Recently direct support for the TPC as a substrate for stimulus encod-

ing has been found in a number of physiological studies. For instance,

the physiology of the mammalian visual system shows dynamics consis-

tent with the TPC in orientation discrimination (Samonds and Bonds,

2004; Benucci et al., 2009; MacEvoy et al., 2009) and spatial selectivity

regulation (Benucci et al., 2007). In particular, showing stimulus spe-

cific modulation of the phase relationships among active neurons. In the

bird auditory system, the temporal responses of neuron populations allow

for the intensity invariant discrimination of songs (Billimoria et al., 2008).

Similarly in the inferior temporal and prefrontal cortices information about

stimulus categories is encoded by the temporal response of populations of

neurons (Meyers et al., 2008) (Barak et al., 2010). Signatures of the TPC

have also been found in the insect olfactory system where the glomeruli

and the projection neurons of the antennal lobe display stimulus induced

24a wavelet based neural model to optimize and read out a

temporal population code

temporal modulations of their firing rate at a scale of hundreds of millisec-

onds (Carlsson et al., 2005) (Knusel et al., 2007).

If the TPC plays a role in stimulus encoding it is relevant to understand

what its key coding features are and how these features can be subse-

quently decoded in areas downstream from the encoder. The readout by

the decoder must be fast and compact, extracting the key characteristics of

the original input stimulus in a compressed way. These key features must

be captured in a non-redundant fashion so that prototypes of a class can

emerge and be efficiently stored in memory and/or serve on-going action.

In a hierarchical model of sensory processing based on the notion of TPC,

the encoded temporal information provided by primary areas is mapped

back onto the spatial domain allowing higher order structures to further

process the stimulus. Hence, a TPC decoder is required to generate a

spatially structured response from the temporal population code of the

encoder. Taking into account these requirements our question is how a

cortical circuit can retrieve the features encapsulated in the TPC.

In the past years, different strategies for decoding temporal information

have been suggested. A recently proposal is the so-called Liquid State

Machine, or LSM (Doetsch, 2000) which is an example of a larger class of

models also called reservoir computing (Lukosevicius and Jaeger, 2009).

In this approach the dense local circuits of the cerebral cortex are seen

as implementing a large set of practically randomly defined filters. When

applied to reading out the TPC we have reported a lower performance

as compared to using Euclidean distance as a result of the LSM’s noise

sensitivity (Knusel et al., 2004). In addition to being less effective than a

linear decoder, LSM is computationally expensive requiring an additional

layer of hundreds of integrate and fire neurons, while performance strongly

depends on the specific parameters settings which compromises generality.

Given that TPC is consistent with current physiology we want to know

whether an alternative approach can be defined that is more tuned to the

specific properties of the TPC, i.e. its temporal structure.

2.1. introduction 25

A readout mechanism for temporal codes, such as TPC, could also be based

on an analysis of the temporal signal over different frequency bands and

resolutions. A population of readout neurons tuned to different spectral

bands could be possibly capable to implement such a readout stage. In this

scheme, the temporal information of TPC is mapped back into a spatial

representation by cells responsive to different frequency bands and thus

the spectral properties of their inputs. A suitable framework for modeling

such a readout stage is the wavelet decomposition: a spectrum analysis

technique that divides the frequency spectrum in a desirable number of

bands using variable-sized regions (Mallat, 1998). Higher processing stages

in the neo-cortex could make use of such a scheme in order to capture

information compressed and multiplexed in different frequency bands by

preceding areas.

The wavelet transform is a biological plausible candidate and has al-

ready been extensively used for modeling cortical circuits in different ar-

eas (Stevens, 2004; Chi et al., 2005). The classic description of image

processing in V1 is based on a two-dimensional Gabor wavelet transform

(Daugman, 1980). Recently, two alternative wavelet-based models approx-

imating the receptive field properties of V1 neurons in the discrete domain

have been proposed, which show additional desirable features such as or-

thogonality (Saul, 2008; Willmore et al., 2008).

A one-dimensional wavelet transform can be interpreted as a strategy for

reading out the different spectral components of the TPC that is equiva-

lent to the wavelet-based models of V1 receptive fields (Jones and Palmer,

1987; Ringach, 2002). Thus providing for a general encoding-decoding

model that can be generalized to the whole of the neo-cortex given its

relatively uniform anatomical organization. Furthermore, from both rep-

resentation and implementation perspectives, orthogonal wavelets are a

compact way of decomposing a signal where the frequency spectrum is di-

vided in a dyadic manner: at each resolution level of the filtering process

a new frequency band emerges represented by half of the wavelet coeffi-

cients presented in the previous resolution level. Thus meeting one of the

26a wavelet based neural model to optimize and read out a

temporal population code

fundamental requirements of an efficient readout system: compactness.

Here, we combine the encoding mechanism of the TPC with decoder that

is based on a one-dimensional, orthogonal and discrete wavelet transform

implemented by a biological plausible circuit. We show that the infor-

mation provided by the TPC generated at an earlier neuronal processing

level can be decoded in a compressed way by this wavelet read-out circuit.

Furthermore, we show that these wavelet transforms can be performed

by a plausible neuronal mechanism that implements the, so called, Haar

wavelet (Haar, 1911; Papageorgiou et al., 1998; Viola and Jones, 2001).

The simplicity and orthogonality of the Haar wavelet makes this read-

out process fast and compact in a implementation that requires only four

neurons.

To investigate the validity of our hypothesis we first define a baseline for

benchmarking the network’s performance in a classification task. The

benchmarking is done using a stimulus set of images based on artificially

generated geometric shapes. To test the readout performance we evaluate

how the information extracted across different sets of wavelet coefficients,

covering orthogonal regions of the frequency spectrum, influences classi-

fication performance. The simulations are performed using two types of

implementations: regular spiking and bursting neurons. We consider these

two types of models in order to address the effects of spiking dynamics on

the encoding and decoding performance of the model. These two types

of spiking behaviors have also been observed in V1 pyramidal neurons

(Hanganu et al., 2006; Iurilli et al., 2011) .

We also investigate the speed of encoding-decoding of the proposed wavelet

based circuit in comparison to the method used in previous studies of

TPC that are based purely on linear classifiers. In the last experiments,

we explore the generality that the wavelet coefficients hold in forming

prototypical representations of an object class that can be stored in a

working memory in fast object recognition tasks. In particular, we are

concerned with the question of how high-level information generated by

2.2. material and methods 27

sensory processing streams can be flexibly stored and retrieved in a long-

term and working memory systems (Verschure et al., 2003; Duff et al.,

2011).

One option for the memory storage problem would be a labelled line code

where specific axon/synapse complexes are dedicated to specific stimuli

and their components (Chandrashekar et al., 2006; Nieder and Merten,

2007). This approach, however, faces capacity limitations both in the

amount of information stored and the physical location where it can be

processed. Alternatively a purely temporal code, such as TPC, would be

in this respect independent of the spatial organization of the physical sub-

strate and allow the multiplexing of high-level information. We show that

this latter scenario is feasible and can be realized with simple biologically

plausible neuronal components.

Our results suggest that sensory processing hierarchies might well comprise

sequences of spatio-temporal transformations that encode combinations of

local stimulus features into perceptual classes using sequences of TPCs

encoding and their wavelet decoding back to a spatial domain.

2.2 Material and methods

The model is divided in two stages: a model of the lateral geniculate nu-

cleus (LGN) and a topographic map of laterally connected spiking neurons

with properties found in the primary visual cortex V1 (Fig. 2.1) (Wyss

et al., 2003c,a). In the first stage we calculate the response of the receptive

fields of LGN cells to the input stimulus, a gray scale image that covers

the visual field. The approximation of the receptive field’s characteristics

is done convolving the input image with a difference of Gaussians operator

(DoG) followed by a positive half-wave rectification. The positive rectified

DoG operator resembles the properties of on LGN center-surround cells

(Rodieck and Stone, 1965; Einevoll and Plesser, 2011). The LGN stage

is a mathematical abstraction of known properties of this brain area and

28a wavelet based neural model to optimize and read out a

temporal population code

performs an edge enhancement of the input image. In the simulations we

use a kernel ratio of 4:1, with size a of 10x10 pixels and variance σ = 1.5

(for the smaller Gaussian).

Input LGN

GaborFilters

Array ofSpiking Neurons

V1

S

......

TPC

Time

Spik

es

Figure 2.1: The TPC encoding model. In a first step, the input image isprojected to the LGN stage where its edges are enhanced. In the next stage, theLGN output passes through a set of Gabor filters that resemble the orientationselectivity characteristics found in the receptive fields of V1 neurons. Here weshow the output response of one Gabor filter as input for the V1 spiking model.After the image onset, the sum of the V1 network’s spiking activity over time givesrise to a temporal representation of the input image. This temporal signature ofthe spatial input is the, so called temporal population code, or TPC.

The LGN signal is projected onto the V1 spiking model, where the cod-

ing concept is illustrated in Fig. 2.2. The network is an array of NxN

model neurons connected to a circular neighborhood with synapses of equal

strength and instantaneous excitatory conductance. The transmission de-

lays are related to the Euclidean distance between the positions of the

pre-and postsynaptic neurons. The stimulus is continuously presented to

the network and the spatially integrated spreading activity of the V1 units,

as a sum of their action potentials, results in the so called TPC signal.

In the network, each neuron is approximated using the spiking model pro-

posed by Izhikevich (Izhikevich, 2003). These model neurons are biologi-

cally plausible and computationally efficient as integrate-and-fire models

(Izhikevich, 2004). Relying only on four parameters, our network can re-

produce both regular (RS) and bursting (BS) spiking behavior using a

system of ordinary differential equations of the form:

2.2. material and methods 29

SpikeLateral Propagation

! Spikes

TPC

Spik

es

Time

! S

TPC

TimeStimulus

ConnectivityRadius

Modeled NeuronLateraly Connected

Network

Figure 2.2: The TPC encoding paradigm. The stimulus, here represented by astar, is projected topographically onto a map of interconnected cortical neurons.When a neuron spikes, its action potential is distributed over a neighborhood ofa given radius. The lateral transmission delay of these connections is 1 ms/unit.Because of these lateral intra-cortical interactions, the stimulus becomes encodedin the network’s activity trace. The TPC representation is defined by the spatialaverage of the population activity over a certain time window. The invariancesthat the TPC encoding renders are defined by the local excitatory connections.

v′ = 0.04v2 + 5v + 140− u+ I (2.1)

u′ = a(bv − u) (2.2)

with the auxiliary after-spike resetting:

if v ≥ 30 mV, then

v ← c

u← u+ d(2.3)

Here, v and u are dimensionless variables and a, d, c and d are dimension-

less parameters that determine the spiking or bursting behavior of the

30a wavelet based neural model to optimize and read out a

temporal population code

neuron unit and ′ = ddt , where t is time. The parameter a describes the

time scale of the recovery variable u. The parameter b describes the sen-

sitivity of the recovery variable u to the sub-threshold fluctuations of the

membrane potential v. The parameter c accounts for the after-spike reset

value of v caused by the fast high-threshold K+, and d the after-spike for

the reset of the recovery variable u caused by slow high-threshold NA+

and K+ conductances. The mathematical analysis of the model can be

found in (Izhikevich, 2006).

The excitatory input I in 2.1 consists of two components: first a constant

driving excitatory input gi and second the synaptic conductances given by

the lateral interaction of the units gc(t). So

I(t) = gi + gc(t) (2.4)

For the simulations, we used the parameters suggested in (Izhikevich, 2004)

to reproduce RS and BS spiking behavior (Fig. 2.3). All the parameters

used in the simulations are summarized in table 2.1.

The network architecture is composed of 24 populations of orientation se-

lective neurons where a bank of Gabor filters are used to reproduce the

characteristics of V1 receptive fields (Fig. 2.1). The filters are divided in

layers of eight orientations Θ ∈ {0, π8 , 2π8 , ...π} and three scales denoted by

δ. The distance of the central frequency among the scales is 1/2 octave

with a max frequency Fmax = 1/10 cycles/pixel. The convolution with

Gδ,Θ is computed at each time step and the output is truncated according

to a threshold Ti ∈ [0, 1], where the values above Ti are set to a constant

driving excitatory input gi. Each unit can be characterized by its orien-

tation selectivity angle Θ, its scale δ and a bi-dimensional vector x ∈ R2

specifying the location of its receptive center within the input plane. So a

column is denoted by u(x,Θ, δ).

The lateral connectivity between V1 units is exclusively excitatory with

strength w. A unit ua(x, φ, δ) connects with ub if all of the following

2.2. material and methods 31

conditions are met:

1. be in the same population: Θa = Θb and δa = δb

2. have a different center position: xa 6= xb

3. within a region of a certain radius: ‖xb − xa‖ < r

According to recent physiological studies, intrinsic V1 intra-cortical con-

nections cover distances that represent regions of the visual space up to

eight times the size of single receptive fields in V1 neurons (Stettler et al.,

2002). In our model we set the connectivity radius r to 7 units. The lat-

eral synapses are of equal strength w and the transmission delays τa are

proportional to ‖xb − xa‖ with 1 ms/cell.

The TPC is generated by summing the network activity in a time window

of 128 ms. Finally, the output TPC vectors from different layers of orien-

tation and scales are read out by the proposed wavelet circuit forming the

decoded TPC vector used for the statistical analysis. In the discrete-time,

all the equations are integrated with Euler’s method using a temporal

resolution of 1 ms.

Neuronal wavelet circuit

The proposed neuronal wavelet circuit is based on the discrete multi-

resolution decomposition where each resolution reflects a different spectral

range and uses the Haar wavelets as basis (Mallat, 1998). Approximation

Coefficients, or AC, are the high-scale, low-frequency components of the

signal spectrum obtained by convolving the signal with the scale function

φ. The Detail Coefficients, or DC, are the low-scale, high-frequency com-

ponents giving by the wavelet function ψ. Each component has a time

resolution matched to the wavelet scale that works as a filter.

The Haar wavelet ψ at time t is defined as:

32a wavelet based neural model to optimize and read out a

temporal population code

Variable Description Value

N network dimension 80x80 Neurons

a scale of recovery 0.02

b sensitivity of recovery 0.2

crs after-spike reset value of v for RS neurons -65

cbs after-spike reset value of v for BS neurons -55

drs after-spike reset value of u for RS neurons 8

dbs after-spike reset value of u for BS neurons 4

v membrane potential rest -70

u membrane recovery rest -16

gi excitatory input conductance 20

Ti minimum V1 input threshold 0.4

r lateral connectivity radius 7 units

w synapse strength 0.4

Table 2.1: Parameters used for the simulations.

ψ(t) ≡

1 0 ≤ t < 1

2

−1 12 < t ≤ 1

0 otherwise

(2.5)

and its associated scale function φ as:

φ(t) ≡

1 0 ≤ t < 1

0 otherwise(2.6)

In a biologically plausible implementation, the wavelet decomposition can

be performed based on the activity of two short term buffer cells B1 and

B2 inhibited by an asymmetric delayed connection from cell A (Fig. 2.4

a). The buffer cells integrate rapid changes over a certain amount of

time analogous to the scale function φ, from eq. 2.6. In our model, the

buffer cells are modeled as discrete low-pass Finite Impulse Response (FIR)

filters. They are equivalent to the scale function φ in the discrete domain.

2.2. material and methods 33

Buffer cells have been reported recently in other cortical areas such as the

prefrontal cortex (Koene and Hasselmo, 2005; Sidiropoulou et al., 2009).

25 ms

35 ms7

Regular Spiking (RS)

Burst Spiking (BS)v(

t)

I(t)

v(t)

Figure 2.3: Computational properties of the two types of neurons used in thesimulations: regular (RS) and burst spiking (BS). The RS neuron shows a meaninter spike interval of about 25 ms (40 Hz). The BS type displays a similar inter-burst interval with a within burst inter-spike interval of approximately 7 ms (140Hz) every 35 ms (28 Hz).

In our model, the buffer cells must receive an inhibitory input from A

that defines the integration envelope. The inhibition has to be over a time

period of t and with a shift in phase of π/2 between A to B2. Therefore, B1

and B2 will have their integration profile shifted in time by t. When the

inhibition is synchronized in time with the integration profile of the buffer

cells, the period 2t determines the low-frequency cutoff or the resolution

level l associated with the Haar scale function φ, eq. 2.6 (low-pass filter).

If both B1 and B2 are excitatory, the projection to cell W gives rise to the

approximation level Al.

On the other hand, if one buffer cell is inhibitory, as in the example of Fig.

34a wavelet based neural model to optimize and read out a

temporal population code

Waveletread out cell

TPCTPC

TimeSp

ikes

128 ms

Short TermBuffer Cell

Exc

Exc

Exc Exc

Inh Inh

Inh

B1 B2A

W l t

W

Integrationprofile

InhibitionEnvelope

timet

Spike

|||||| ||||||t

Phase shift90º

|||||| ||||||t

Spike

Frequency

Ampl

itude

!!/2!/4!/8

Dc1Dc2Dc3Ac3

a)

b)

Figure 2.4: a) a) Neuronal readout circuit based on wavelet decomposition.The buffer cells B1 and B2 integrate, in time, the network activity performinga low-pass approximation of the signal over two adjacent time windows given bythe asynchronous inhibition received from cell A. The differentiation performedby the excitatory and inhibitory connections to W gives rise to a band-passfiltering process analogous to the wavelet detail levels. b) An example of band-pass filtering performed by the wavelet circuit where only the frequency rangecorresponding to the resolution level Dc3 is kept in the spectrum.

2.2. material and methods 35

2.4 a), the detail level Dl+1 is obtained by cell W as performed by the Haar

wavelet function itself (eq. 2.5). In our model, the inhibition is modeled as

discrete high-pass FIR filters. The combination of low-pass and high-pass

filters in cascade produces band-pass filters. Therefore, the readout can

be optimized to specific ranges of the frequency spectrum (Fig. 2.4 b).

Stimulus set

The stimulus set is based on abstract geometric forms as used previously

(Wyss et al., 2003c). In a circular path with a diameter of 40 pixels, 5

uniformly distributed vertices can connect to each other with equal prob-

ability, defining the shape of a stimulus class, (Fig. 2.5). The different

objects forming a class are generated by jittering the position of the ver-

tices and the default line thickness of 4 pixels. We defined a total of 10

classes for the experiments with 50 exemplars per class.

For the experiments we subdivide the data set in three subsets with in-

creasing complexity by varying the amount of jitter in the vertices’ position

and thickness of the connected line segments. The values of the jitter are

given by uniform randomly distributed factors with zero mean and stan-

dard deviation equal to 0.03, 0.04 and 0.05 for the vertices’ position and

0.018, 0.021 and 0.025 the thickness of each subset respectively. We used

subset 1 with 50 stimuli per class as training and classification set. In the

case where subset 1 is used for classification, 50% of the stimuli are ran-

domly assigned as training set and the other part used for classification.

The subsets 2 and 3 are only used as classification set. In this case the

subset 1 is entirely used for training. Therefore training stimuli are not

used for classification and vice-versa.

For estimating the degree of similarity among the images we used the

normalized Euclidean distance of the pixels. The normalization is done

as follows: a given stimulus has distance equal to zero if it is equal to its

class prototype and one if it is the globally most distant exemplar over all

36a wavelet based neural model to optimize and read out a

temporal population code

subsets. The image prototype is defined by the five vertices that define

the geometry of a class with no jitter applied.

Subset 1

Subset 2

Subset 3

Figure 2.5: The stimulus classes used in the experiments after the edge en-hancement of the LGN stage.

Cluster algorithm

For the classification we used the following algorithm. The network’s re-

sponses to stimuli from C stimulus classes S1, S2, ..., SC are assigned to

C response classes R1, R2, ..., RC of the training set, yielding a C × C

hit matrix N(Sα, Rβ), whose entries denote the number of times that a

stimulus from class Sα elicits a response in class Rβ. Initially, the matrix

N(Sα, Rβ) is set to zero. For each response r ∈ Sα, we calculate the Eu-

clidean distance of r to the responses r′ 6= r elicitepd by stimuli of class

Sγ :

ρ(r, Sγ) = 〈‖ρ(r, r′)‖〉r′ elicitepd by Sγ (2.7)

where 〈.〉 denotes the average among the temporal Euclidean distances

between r and r′ denoted by ρ(r, r′). The response r is classified into the

response-class Rβ for that ρ(r, Sβ) is minimal, and N(Sα, Rβ) is incre-

mented by one. The overall classification ratio in percentage is calculated

summing the diagonal of the N(Sα, Rβ) and dividing by the total number

of elements in the classification set R. We chose the same metrics that

was used in previous studies to establish a direct comparison between the

results over different scenarios.

2.3. results 37

2.3 Results

We start analyzing the properties of the proposed stimulus set detailed in

section 2.2. Then, we run network simulations in order to establish a base-

line classification ratio in an stimulus detection task. In the follow step we

use the wavelet circuit to read out the TPC signal over different frequency

bands and compare the classification results to the previously established

baseline. In order to address the effect of spiking modality on the decod-

ing mechanism we run separate simulations with two different kinds of

neurons: Regular and Burst Spiking (eqs. 2.1 and 2.2). In the subsequent

experiment, we investigate how the speed of encoding is affected by read-

ing out the TPC using a neuronal implementation of the wavelet circuit.

Subsequently, we show how the dense wavelet representation provided by

our decoding circuit can be used to create class prototypes that can be

used to flexibly define the content of a memory system. Finally, we per-

form an analysis of the similarity relationships between the TPC encoding

and the representation of the stimuli in the wavelet and in the spatial

domain respectively. The model parameters used for all simulations are

specified in section 2.2 and in table 2.1.

Stimulus set similarity properties

We use an algorithmic approach to parametrically define our stimulus

classes. Every class is defined around a prototype. (Fig. 2.6a upper row).

We measure how similar the exemplars from the classification sets are

to the respective image class prototype (see methods, session 2.2). The

median Euclidian distance of stimulus set 1 to 3 are 0.59, 0.64 and 0.70

respectively. This increasing median translates in an increasing difficulty

in the classification of the stimulus sets.

38a wavelet based neural model to optimize and read out a

temporal population code

Class prototypes

Exemplars with maximum distance to the class prototypes

a)

b) Subset 1

0.2 0.4 0.6 0.8 10

50

100

150

Freq

uenc

y

Euclidean Distance

Subset 2

0.2 0.4 0.6 0.8 10

50

100

150Fr

eque

ncy

Euclidean Distance

Subset 3

0.2 0.4 0.6 0.8 10

50

100

150

Freq

uenc

y

Euclidean Distance

Figure 2.6: The stimulus set. a) Image-based prototypes (no jitter in the ver-tices applied) and the globally most different exemplars with normalized distanceequal one. The distortions can be very severe as in the case of class number one.b) Histogram of the normalized Euclidean distances between the class exemplarsand the class prototypes in the spatial domain.

Baseline and network performance

As a reference for further experiments we first establish a baseline classifi-

cation ratio. The baseline is defined by applying the clustering algorithm

described previously (section 2.2) directly in the spatial domain of the

stimulus set. In this scenario, the classification is performed over the pixel

intensities of the centered and edge enhanced images (Fig. 2.7). As to

be expected, the classification performance decreases with an increase of

the geometric variability of the three subsets used in the experiments. For

subset one, two and three the classification ratio reaches 91%, 88% and

82% respectively.

Wavelet circuit readout

We want to now assess the classification performance of compression ca-

pacity of the wavelet circuit we have proposed (section 2.2). We consider

2.3. results 39

020406080

100

Cla

ssifi

catio

n %

Subset 1 Subset 2 Subset 3

Figure 2.7: Baseline classification ratio using Euclidean distance among theimages from the stimulus set in the spatial domain.

a range of frequency resolutions in a dyadic manner using the wavelet res-

olution levels Ac5 corresponding to 0 to 15.5 Hz, Dc5 from 15.5 to 31 Hz,

Dc4 from 31 to 62 Hz, Dc3 from 62 to 125 Hz, Dc2 from 125 to 250 Hz

and finally Dc1 from 250 to 500 Hz. In the simulations, we increase the

inhibition and integration time of the cells A, B1 and B2 (Fig. 2.4) in or-

der to explore the classification performance in the stimulus classification

task of the network.

The results using RS neurons show that the classification performance has

a peak in the frequency range from 62 Hz to 125 Hz equivalent to the,

so called, Dc3 level in the wavelet domain where 91% , 83% and 74% of

the TPC encoded stimuli are correctly classified for the subsets one, two

and three respectively (Fig. 2.8). In comparison, using BS neurons the

classification performance has a peak in the frequency range from 15 Hz

to 31 Hz equivalent to the, so called, Dc5 level in the wavelet domain

where 92% , 82% and 74% of the responses are correctly classified for the

subsets one, two and three respectively (Fig. 2.8). Reading out the TPC

without the wavelet circuit, i.e using the integrated spiking activity over

time without the wavelet representation, we achieve a classification ratio

for subsets one, two and three of 88%, 79% and 75 % for RS neurons

and 87%, 80% and 74 % for BS neurons. Thus, the wavelet circuit adds

a marginal improvement to the readout of the TPC signals as compared

40a wavelet based neural model to optimize and read out a

temporal population code

4 4 8 16 32 64

60

80

100

4 4 8 16 32 64

60

80

100

% C

lass

ifica

tion

% C

lass

ifica

tion

RS Neurons

BS Neurons

Wavelet Coefficients

Subset 1Subset 2Subset 3

Ac5 Dc5 Dc4 Dc3 Dc2 Dc1

Figure 2.8: Comparison among the correct classification ratio for different res-onance frequencies of the wavelet filters for both types of neurons RS and BS.The frequency bands of the TPC signal is represented by the wavelet coefficientsDc1 to Ac5 in a multi-resolution scheme. The network time window is 128 ms.

to the control condition for the RS neurons, in particular for the easier

stimulus set, while the BS version of the model does not show a marked

increase in classification performance.

However, while maintaining classification performance nearly the same,

the dyadic property of the wavelet discrete transform compresses the

length of the temporal signal by a factor of 8 and 32 using the Dc3 level

and the Dc5 levels for RS and BS neurons respectively. So the information

encoded over the 128 ms of stimulus presentation is captured by only a

few wavelet coefficients, in a compressed way.

2.3. results 41

In comparison, with the benchmark results (Fig. 2.3), the wavelet circuit

readout provides slightly lower classification classification ratio. However,

If we look at the BS network numbers, the TPC wavelet readout can

generate a reliable representation of the input image with 24x4 coefficients.

In comparison, the 80x80 pixels of the original input image used for the

benchmark is extremely compressed. The compression factor is about 66

times without a significant loss in classification performance. Therefore the

wavelet coefficients provide for a compact representation of the stimuli in

a specific region of the frequency spectrum.

Classification speed

A key feature of TPC is the encoding speed. It has been shown in previous

TPC studies that the speed of encoding is compatible with the speed of

processing observed in the mammalian visual system (Thorpe et al., 1996;

Wyss et al., 2003c; Tollner et al., 2011). Here we investigate how fast

the information transmitted by the TPC is captured by the wavelet co-

efficients. We use the mutual information measure (Victor and Purpura,

1999) to quantify the amount of transmitted information for a varying

length of the signal interval of all the 24 network layers used to calculate

the Euclidean distance (Fig. 2.9). The mutual information calculation is

performed using the wavelet coefficients generated by the readout circuit.

We also compare the speed of encoding between the TPC signal in the

temporal domain against the readout version based on the wavelet coef-

ficients. For RS neurons, the wavelet coefficients that lead to maximum

classification performance are localized in the frequency interval from 62

Hz to 125 Hz, equivalent to the resolution level Dc3. For BS neurons, the

frequency interval is in a lower range from 15 Hz to 31 Hz, equivalent to

the resolution level Dc5. In these frequency ranges the maximum classifi-

cation performance is achieved as shown in the previous section (Fig. 2.8

).

We observe that the number of bits encoded over the time window of

42a wavelet based neural model to optimize and read out a

temporal population code

0

1

2

3In

form

atio

n [B

its]

Info

rmat

ion

[Bits

]

Time [ms]0 50 100 1500

1

2

3 Subset 30

1

2

3

Info

rmat

ion

[Bits

]

Subset 2

Subset 1

RS-TPCRS-WavBS-TPCBS-Wav

Figure 2.9: Speed of encoding. Number of bits encoded by the network’s activitytrace as a function of time. The RS-TPC and BS-TPC curves represent the bitsencoded by the network’s activity trace without the wavelet circuit. The RS-Wavand BS-Wav correspond to the bits encoded by the wavelet coefficients using theDc3 resolution level for RS neurons and the Dc5 for BS neurons respectively. Fora time window of 128 ms the Dc3 level has 16 coefficients and the Dc5 has only4 coefficients. The dots in the figure represent the moment in time where thecoefficients are generated.

2.3. results 43

128 ms is nearly the same when comparing the non-filtered TPC signals

(RS-TPC and BS-TPC, Fig. 2.9) and the signals captured by the wavelet

readout (RS-Wav and BS-Wav, Fig. 2.9). However, in the case of the

BS neurons the speed of encoding is slower when the signal is decoded

by the wavelet circuit. This effect is due to the longer time constant of

the buffer cells B1 and B2 to integrate and differentiate the signal at this

resolution level and therefore to compute the wavelet coefficients. The

buffer cells need 32 ms to compute the first wavelet (Fig. 2.9). In case of

the RS neurons more than 90% of total information was captured within

the second wavelet coefficient, or 16 ms after stimulus onset. Thus the

effect of the neuronal wavelet circuit on the speed of encoding depends

both on the spiking behavior of the encoders and on the frequency range

at which the signal is read out.

Prototype-based classification

In the last step, we investigate whether the wavelet representation can be

generalized to the generation of prototypes from the stimulus classes. The

aim of the experiment is to create prototypes learned from the training

set that can be stored in memory and retrieved in a future classification

task. To construct such representations we build class prototypes based

on the wavelet coefficients of the N stimuli making up the training set. For

each of the ten stimulus classes we calculate the median over the wavelet

coefficients among the 50 response exemplars that define a training class

(subset one). Hence, for each class we have a vector of wavelet coefficients

that define the class specific prototype, i.e. a representation of a class in

the wavelet domain. Based on the classification experiments we use for

RS neurons the coefficients from the Dc3 level and for BS neurons the

Dc5 levels. The Fourier transform of the prototypes reveals the frequency

components that comprise the prototypes for the two different neuron

models we consider (Fig. 2.10).

We present the classification set to the network and calculate the Euclidean

44a wavelet based neural model to optimize and read out a

temporal population code

distance between the output responses of the wavelet network within the

ten previously created prototypes. For the classification a simple criterion

is adopted, the smallest Euclidean distance defines the class to which the

stimulus is assigned.

The results for the RS neurons show that 86%, 82% and 75% of the re-

sponses are correctly classified (diagonal entries) by the wavelet prototypes

for the subset one, two and three respectively. For the BS neurons we ob-

serve that 91%, 81% and 72% of the responses are correctly classified for

the subset one, two and three respectively (Fig.2.11). The classification

ratios are consistent with the results previously presented in section 2.3

using the cluster algorithm (see methods, 2.2). However, the number of

calculations in the classification stage is drastically reduced because each

class is represented by a prototype vector of wavelet coefficients instead of

a collection of vectors. This result suggest that with a simple algorithm

the wavelet representation can be integrated into a compact description of

a complex spatially organized stimulus. Therefore, the information pro-

vided by densely coupled cortical neurons can be learned and efficiently

stored in memory independently of the details of their spiking behavior.

In the second part of the experiment, we present a geometric and spatio-

temporal analysis of the underlying neural code. We perform a correlation

analysis among the stimuli misclassified using the wavelet prototypes in

both wavelet and spatial domains. We want to understand whether the ge-

ometric deformations applied in the spatial domain are directly translated

to the temporal representation captured by the wavelet coefficients. This

analysis is performed using the exemplars that where misclassified using

the wavelet prototypes approach. We calculate the normalized Euclidean

distances in the spatial domain between each stimulus of the misclassified

set and its prototype for each class. Second, we apply the same distance

measure but using the wavelet representation of the misclassified stimuli

and the prototypes. Finally, we make a correlation among the distance

values. (Fig.2.12).

2.3. results 45

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 1 Class Prototype 6

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 2

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|Class Prototype 7

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 3

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 8

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|Class Prototype 4

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 9

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 5

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

Class Prototype 10

0 100 200 300 400 5000

100

200

Frequency (Hz)

|X(f)

|

RS Neurons

BS Neurons

Ac5 Dc3

Figure 2.10: Single-sided amplitude spectrum of the wavelet prototype for eachstimulus class used in the simulations. The signals x(t) where reconstructed intime using the wavelet coefficients from the Dc3 and Dc5 levels for RS and BSneurons respectively. The shaded areas shows the optimal frequency response ofthe Dc3 level (62 Hz to 125 Hz) and of the Dc5 level (15.5 Hz to 31 Hz). Theless pronounced responses around 400 Hz are aliasing effects due to the signalreconstruction to calculate the Fourier transform (see discussion).

46a wavelet based neural model to optimize and read out a

temporal population code

14

11

2523

11

25

1

20

4

11

20

3

2

22

1

1

20

425

5

1

19

Stimulus Class

Res

pons

e C

lass

Res

pons

e C

lass

331

1

2

13

46

31

46

4

446

5

39

6

1

33

4111

5

6344

1

1

2

44

3

3

1

145

10

3

1

36

Stimulus Class

Res

pons

e C

lass

38

6111

3

42

44

1

41

431

739

13

5

1

33

11

1

46

344

1

4

73531

4

2

1

436

3

4

4

2391

3

3

21

6

35

Stimulus Class

Stimulus Class

Res

pons

e C

lass

41

2

4

3

1471

1

1

331

9123

1136

2

1

8

2

201

1117

1

393

16

3

9323

3

3

11

433

8

6

221372

3

2

4

41

Stimulus Class

Res

pons

e C

lass

43

7

544

1

1

42

41

2

347

31

282

7

9

43

16

1

3346

6

2

334

11

4

22

42

2

1

2

45

Stimulus Class

Res

pons

e C

lass

23

1

1

520

23

1

1

25

2

20

2

1

24

1

1

222

1

1

22

1

2

231

24

Subset 1 Subset 2 Subset 3

Subset 1 Subset 2 Subset 3

RS

BS

Figure 2.11: Prototype based classification hit matrices. For each class in thetraining we average the wavelet coefficients to form class prototypes. In theclassification process, the euclidean distance between the classification set andthe prototypes are calculated. A stimulus is assigned to a the class with smallereuclidean distance to the respective class prototype.

The results show a positive correlation over the Euclidian distances in the

two domains suggesting that the amount of geometric deformations in the

spatial domain is directly translated to the wavelet representation of the

temporal code. The correlation is higher for RS neurons with a value

of 0.62 against 0.50 for BS neurons (p<0.001). The positive correlation

between both domains validates the wavelet prototypes and therefore the

overall TPC transformation structure as an equivalent representation of

the stimuli classes that conserves the relevant spatial information.

2.4 Discussion

We have shown previously that in a model of the sensory cortex, the

representation of a static stimulus can be generated using the temporal

dynamics of a neuronal population or Temporal Population Code. Here

2.4. discussion 47

0 0.2 0.4 0.6 0.8 10

0.5

1

0 0.2 0.4 0.6 0.8 10

0.5

1

Dis

tanc

e W

avel

et D

omai

n

Distance Spatial Domain

Rho = 0.50p < 0.001

Dis

tanc

e W

avel

et D

omai

nDistance Spatial Domain

Rho = 0.62p < 0.001

RS Neurons

BS Neurons

Figure 2.12: Distribution of errors in the wavelet-based prototype classificationwith relation to the Euclidean distances within the prototyped classes.

we have shown that this temporal code has a specific signature in the

phase relationships among the active neurons of the underlying substrate.

This signal is efficiently used to pass a complete and dense amount of

information that can be decoded in further areas through a sub-set of

wavelet coefficients.

The TPC is a relevant hypothesis on the encoding of sensory events given

its consistency with cortical anatomy and recent physiology. Since its in-

48a wavelet based neural model to optimize and read out a

temporal population code

troduction, however, a persistent problem has been to extend this concept

to a readout stage that would be neurobiologically compatible. A priori

it was not clear whether a wavelet transform would be suitable because it

implies a specific structure in the TPC representation itself.

We have shown that decoding of the TPC can be based on wavelet trans-

forms. Using a systematic spectral investigation of the TPC signal we

observed that the Haar basis seems to be a feasible choice providing ro-

bust and reliable decoding. The achieved results associated with the Haar

wavelet are consistent with previous studies where Haar filters are used to

preprocess neuronal data prior to a linear discriminant analysis (Laubach,

2004). In comparison with the original TPC readout the neuronal wavelet

circuit proposed here showed the same encoding performance for regular

spiking neurons as compared to the algorithmic version. However, for

bursting neurons the wavelet readout had a slower speed of encoding in

comparison with the original TPC. Therefore the details of the readout

mechanism such as its integration time constant and its wavelet resolution

level depend on the spiking dynamics.

The specific geometric characteristics of each stimulus class could be cap-

tured in a very compact way using the wavelet filters. We showed that a

visual stimulus can be represented and further classified using a strongly

compressed signal based on the wavelet coefficients. Reading out the net-

work based on regular spiking neurons using wavelet coefficients yielded a

compression factor of 16.6 times the original image size. In the case of the

bursting neurons the compression ratio was even higher reaching 66 times

the original image size. Therefore, the spatial to temporal transformation

of the TPC model combined with the efficient wavelet readout circuit can

provide for a robust and compact representation of sensory inputs in the

cortex for different spiking modalities.

We have performed a detailed analysis of the misclassified stimuli in or-

der to better understand the interesting feature of similarity conserving

misclassifications we observe. We found a positive correlation among the

2.4. discussion 49

geometric distortions between the spatial and temporal domain, repre-

sented by the wavelet coefficients. These findings suggest that the defor-

mations in the spatial domain were directly translated into the wavelet

domain and therefore responsible for the misclassifications observed. This

result reinforces the direct relationship present between the geometric and

spatio-temporal portions of the underlying neural code and its decoding.

Our results also suggest that specific axon/synapse complexes dedicated

to specific features are not needed to successfully encode visual stimuli.

We have shown that the efficient structure of an orthogonal basis like the

Haar wavelet, can be implemented by a neuronal circuit (see Fig. 2.4),

with low computational cost, low latency and thus in a real-time system.

The wavelet filters as implemented with buffer neurons can be changed

on line depending on what kind of information needs to be retrieved and

its specific frequency range. This property allows multiplexing high-level

information from the visual input and flexible storage and retrieval of in-

formation by a memory system. Indeed, recent experiments have shown

that distinct activity patterns overlaid in primary visual cortex individ-

ually signal motion direction, speed, and orientation of object contours

within the same network at the same time (Onat et al., 2011). We spec-

ulate that in a non-static scenario other stimulus characteristics such as

motion speed and direction could be extracted from other frequency bands

using the same wavelet circuit.

The optimal resolution level for the wavelet readout was determined through

an systematic investigation based on the correlation between compression

and performance (Fig. 2.8). In the case of regular spiking neurons we ob-

served a maximum classification performance in the Dc3 resolution level,

or in the frequency range from 62 Hz up to 125 Hz. While for the bursting

neurons it falls in the range of 15.5 Hz to 31 Hz. The wavelet circuit itself

does not define the choice of the wavelet resolution. However, we could

show that in the TPC framework the proposed readout circuit can capture

different properties of visual stimuli that travel through the sensory pro-

cessing stream at different frequency ranges. In our case the sensitivity of

50a wavelet based neural model to optimize and read out a

temporal population code

the wavelet circuit will depend on the feed-forward receptive fields com-

bined with the phase relationship imposed by the inhibitory units. The

mechanisms used by higher cognitive areas to manipulate the frequency

ranges and the kind of information carried in these temporal information

channels are currently not clear and are subject of follow up studies.

In comparison to the LSM model previously applied to read out the TPC,

the wavelet circuit is computationally inexpensive and requires only four

neurons to be implemented. Although the optimal readout performance

also depends on specific parameters to set the readout frequency range

the generality of the model is not affected as in the case of the LSM. Our

results suggest that this optimal frequency range is determined by the

spiking behavior of the neurons in the network.

From a technical perspective, one issue related to orthogonal wavelets is

the aliasing effect (Chen and Wang, 2001) which could insert redundant

spectral content in the TPC signals leading to reduced classification per-

formance. This property can be addressed by increasing the vanishing

moments of the wavelet basis, the effects of aliasing is smoothed, increas-

ing the orthogonality between the spectral sub-bands. However, the filter-

ing process would get more sophisticated and would require more than two

buffer cells. In contrast, the Haar based readout circuit is computationally

efficient.

To evaluate the effects of different spiking behaviors on the proposed read-

out circuit, we used a different and more physiologically constrained neu-

ron model from previous TPC studies. In comparison, the overall dynam-

ics of the previous neuron model are significantly different from the the

model used here. For instance, the model used in previous studies (Wyss

et al., 2003c) has a strong onset response with about 50% more spikes in

the first 20 ms after stimulus onset as compared to the model used in the

current study that includes mechanisms for spike adaptation. In addition,

we observed significant differences in the sub-threshold fluctuations and

the membrane potential envelope between these two neuron models. How-

2.4. discussion 51

ever, the overall results reported here are compatible with those previously

reported (Wyss et al., 2003a,c). Based on that, we conclude that TPC and

the proposed readout mechanism are robust with respect to the details of

the spiking dynamics and the overall biophysical properties of the mem-

brane potential envelope.We are not aware of any other encoding-decoding

model of cortical dynamics that shows a similar generality.

The performance measure we use, essentially based on classification, is

a well-established standard (Victor and Purpura, 1999) in the literature,

based on Euclidean distance. We looked both at classification perfor-

mance and information encoded. In order to develop the specific point

of this paper, the decoding of the TPC using wavelets, we adhere to this

standard. Hence, the results should be seen as relative to those established

and published for the TPC, contributing to the unresolved issue of how

a biologically plausible decoding can take place. We demonstrate that a

Wavelet-like transform can fulfill the requirements for an efficient readout

mechanism, thus generating a specific hypothesis on the role of the sparse

inter-areal connectivity found in the neo-cortex.

We have shown that the dense local connectivity of the neo-cortex can

transform the spatial organization of their inputs into a compact Tempo-

ral Population Code (TPC). By virtue of its multiplexing capability this

code can be transmitted to downstream decoding areas using the sparse

long-range connections of the cortex. We have shown that in these down-

stream areas the TPC can be decoded and further compressed using a

wavelet based readout system. Our results show that the TPC informa-

tion is organized in a specific subset of frequency space creating virtual

communication channels that can serve working memory systems. Thus,

TPC does not only provide a functional hypothesis on the specifics of cor-

tical anatomy but it also provides a computational substrate for a func-

tional neuronal architecture that is organized in time rather than in space

(Buzsaki, 2006).

52a wavelet based neural model to optimize and read out a

temporal population code

2.5 Acknowledgments

This work was supported by EU FP7 projects EFAA (FP7-ICT-270490)

and GOAL-LEADERS (FP7-ICT-97732).

Chapter 3

Temporal Population Code forFace Recognition on the iCub

Robot

The connectivity of the cerebral cortex is characterized by dense local

and sparse long-range connectivity. It has been proposed that this con-

nection topology provides a rapid and robust transformation of spatial

stimulus information into a temporal population code (TPC). TPC is a

canonical model of cortical computation whose topological requirements

are independent of the properties of the input stimuli and, therefore, can

be generalized to the processing requirements of all cortical areas. Here we

propose a real time implementation of TPC for classifying faces, a complex

natural stimuli that mammals are constantly confronted with. The model

consists of a network comprising a primary visual cortex V1 network of lat-

erally connected integrate-and-fire neurons implemented in the humanoid

robot platform iCub. The experiment was performed using human faces

presented to the robot under different angles and position of light inci-

dence. We show that the TPC-based model can recognize faces with a

correct ratio of 97 % without any face-specific strategy. Additionally, the

speed of encoding is coherent with the mammalian visual system suggest-

53

54temporal population code for face recognition on the icub

robot

ing that the representation of natural static visual stimulus is generated

based on the combined temporal dynamics of multiple neuron populations.

Our results provides that, without any input dependent wiring, TPC can

be efficiently used for encoding local features in a high complexity task

such as face recognition.

3.1 Introduction

The mammalian brain has great abilities of recognizing objects under

widely varying conditions. To perform this task, the visual system must

solve the problem of building invariant representations of the objects that

are the sources of the available sensory information. Since the break-

through work on V1 by Hubel and Wiesel (Hubel and Wiesel, 1962) there

has been an increasing number of models of the visual cortex trying to

reproduce the performance of natural vision systems.

A number of these models has been developed to reproduce characteristics

of the visual system such as invariance to shifts in position, rotation, and

scaling. Most of these models are based on the, so called, Neocognitron

(Fukushima, 1980b, 2003; Poggio and Bizzi, 2004; Chikkerur et al., 2010),

a hierarchical multilayer network of detectors with varying feature and

spatial tuning.

In this classical framework invariant representations emerge in the form

of activity patterns at the highest level of the network by virtue of the

spatial averaging across the feature detectors at the preceding layers. The

recognition process is based on a large dictionary of features stored in

memory where model neurons in the different layers act as filters tuned to

specific features or feature combinations. In this approach invariances to,

for instance, position, scale and orientation, are achieved at the high cost

of increasing the number of connections between the layers of the network.

In the last years a novel model of object recognition has been proposed

based on the specific connectivity template found in the visual cortex

3.1. introduction 55

combined with the temporal dynamics of neural populations. The, so

called, Temporal Population Code, or TPC, emphasizes the property of

densely coupled networks to rapidly encode the geometric organization of

its inputs into the temporal structure of its population response (Wyss

et al., 2003a,c). In the TPC architecture the spatial information provided

by an input stimulus is transformed into a temporal representation. The

encoding is defined by the spatially averaged spikes of a population of

integrate and fire neurons over a certain time window.

In this paper we make use of the TPC scheme to encode one of the most

complex stimuli mammals are challenged by: faces (Zhao et al., 2003;

Martelli et al., 2005; Sirovich and Meytlis, 2009). In the brain, the recog-

nition process of known faces takes place through a complex set of local

and distributed processes that interact dynamically (Barbeau et al., 2008).

Recent results suggest the existence of specialized areas populated by neu-

rons selective to specific faces. For instance, in the macaque brain the

middle face patch has been identified where neurons detect and differenti-

ate faces using a strategy based on the distinct constellations of face parts

(Freiwald et al., 2009). Nevertheless, these higher cortical areas where

curves and spots are assembled into coherent objects must be fed by a

canonical computational principle. Our model is based on the hypothesis

that the TPC is the canonical feature encoder of the visual cortex that

feeds higher areas involved in the recognition process.

The model is implemented and tested in real time context. The robot

platform used for the experiments is the humanoid robot iCub (Metta

et al., 2008). In the face recognition task, we demonstrate that the TPC

model can recognize faces under different natural conditions. We also show

that the speed of encoding is compatible with the human visual system.

Finally, we present a comparison between the TPC face recognition and

other models available in the literature using a standard face database.

In the next session we have a full description of the model and the param-

eters used for the simulations.

56temporal population code for face recognition on the icub

robot

3.2 Methods

The model we propose here consists of a retinotopic map of spiking neu-

rons with properties found in the primary visual cortex V1(Wyss et al.,

2003c,a). The network consists of NxN neurons interconnected with a

circular neighborhood with synapses of equal strength and instantaneous

excitatory conductances. The transmission delays are related to the Eu-

clidean distance between the positions of the pre-and postsynaptic neurons

in the map. The stimuli are continuously presented to the network and

the spreading spatial activity of the V1 units are integrated over a time

window, as a sum of their action potentials. The outcome is the so called

TPC, as showed in the previous chapter (Fig. 2.1).

The TPC network

In the network, each modeled neuron is approximated by a leaky integrate-

and-fire unit. The time course of its membrane voltage is given by:

CmdV

dt= −(Ie(t) + Ik(t) + Il(t)). (3.1)

Cm is the membrane capacitance and I represents the transmembrane

current: excitatory (Ie), spike-triggered potassium current (Ik) and leak

current (Il). These currents are computed multiplying the conductance g

by the driving force as following:

I(t) = g(t)(V (t)− V r) (3.2)

where V r is the reversal potential of the conductance. The unit’s activity

at time t is given by:

A(t) = a{H(V (t)− θ)} (3.3)

3.2. methods 57

where a ∈ [0, 1] is the spike amplitude of the unit, and V (t) is determined

by the total input composed by the external stimulus and internal con-

nections. H denotes the Heaviside function and θ is the firing threshold.

After a spike is generated, the unit’s potential is reset to Vr. The time

course of the potassium conductance gl is given by:

τKdgkdt

= −(gk(t)− gpeakk A(t)) (3.4)

where A(t) = H(V (t)− θ).

The excitatory input is composed by the sum of two components: a con-

stant driving excitatory input gi(t) and the synaptic conductances given

by the lateral interaction of the units gc(t). So

ge(t) = gi + gc(t) (3.5)

In the visual cortex it has been observed that many cells produce strong

activity for a specific optimal input pattern. De Valois (De Valois et al.,

1982) showed that the receptive field of simple cells is sensitive to specific

position, orientation and spatial frequency. This behavior can be formal-

ized as a multidimensional tuning response that can be approximated by

a Gaussian-like template-matching operation.

In our model each unit is characterized by orientation selectivity angle φ ∈{0, π4 , 2

π4 , 3

π4 } and a bi-dimensional vector x ∈ R2 specifying its receptive

field’s center location within the input plane. So each unit in the network

is defined by u(x, φ) that we envision as approximating the functional

notion of cortical column.

The receptive field of each unit is approximated using the second derivative

of an isotropic Gaussian filterGσ. The input image is convolved withGσ to

reproduce the V1 receptive field’s characteristics for orientation selectivity.

The input stimulus is a retinal gray scale image that fills the whole visual

field.

58temporal population code for face recognition on the icub

robot

The convolution with ∂2x(φ)Gσ is computed by twice applying a sampled

first derivative of a Gaussian (σ = 1.5 and size 11x11 pixels) over the four

orientations φ (Bayerl and Neumann, 2004). The filter output is truncated

according to a threshold Ti ∈ [0, 1], where the values above Ti are set to

Ex. Each orientation selective output is projected onto a specific network

layer.

The lateral connectivity between V1 units is exclusively excitatory and a

unit ua(x, φ) connects to ub if the following conditions are met:

1. be in the same population: φa = φb

2. have a different center position: xa 6= xb

3. within a region of a certain diameter: ‖xb − xa‖ < r

The synaptic efficacies are of equal strength w and the transmission delays

τa proportional to ‖xb−xa‖ with 1 ms/cell. The connectivity diameter is

equal to dmax. In the discrete-time case, all the equations are integrated

with Euler’s method using a temporal resolution of 1 ms. Therefore in

the simulations one time step is equivalent to 1 ms. All the simulations

were done using the parameters of table 3.2. When a unit u receives the

maximum excitatory input Ex at onset time t = 0 ms, the first spike

occurs at time 6 ms.

The TPC is generated by summing the network’s activity over a time

window of T ms. Finally, the output vectors from different orientation

layers are summed forming the TPC vector. In this way the temporal

representation of TPC is naturally invariant to position. Theoretically

rotation invariances are achieved by increasing the number of orientation

selective layers reaching a full invariant architecture by expanding the

number of orientation layers.

3.2. methods 59

Variable Description Value

N network dimension 128x128 Neurons

Cm membrane capacitance 0.2nF

V re excitatory rev. potential 60mV

V rk spike-triggered potassium rev. potential -90mV

V rl leak reversal potential -70mV

θ firing threshold -55mV

τK time constant 1ms/cell

gpeakk peak potassium conductance 200 nS

gl leak conductance 20 nS

a spike amplitude 1

gi excitatory input conductance 5 nS

Ti minimum V1 input threshold 0.6

dmax lateral connectivity diameter 7 units

w synapsis strength 0.1 nS

Table 3.1: Parameters used for the simulations. See text for further explanation

Optimizing the lateral spreading of activation in thenetwork

The key part of the model is the lateral propagation of activity denoted

by gc(t) in the equation (3.5). Since, for each time step, hundreds of

spikes can occur an efficient and fast real-time implementation is required.

This is achieved by performing consecutive multiplications in the frequency

domain.

The lateral propagation of activity is performed after each spike every time

step, i.e. 1 ms in our simulations. The lateral spread can be seen as con-

secutive square regions of propagation fd, or filters of lateral propagation,

with increasing scale over a short period of time (Fig. 3.1).The maximum

scale of f is given by the lateral connection diameter dmax. The magni-

tude of the spreading activation is given by the synapsis strength w. For

the simulations dmax = 7 thus fd are of scale 3x3, 5x5 and 7x7. They are

generated and stored in the frequency domain (denoted by Fd) when the

network is initialized. In the implementation the Discrete Fourier Trans-

60temporal population code for face recognition on the icub

robot

form (DFT) operations were done using the FFTW3 library (Frigo and

Johnson, 2005).

t

iDFT

Spike

Matrix of Spikes

Filters

Frequency Domain

Lateral Propagation (g (t))c

t +1

t +2

X

X

F3

F5

iDFT

Figure 3.1: Schematic of the lateral propagation using Discrete Fourier Trans-form (DFT). The lateral propagation is performed as a filtering operation in thefrequency domain. The matrix of spikes generated at time t is multiplied by thefilters of lateral propagation Fd over the next time steps t + 1 and t + 2. Theoutput is given by the inverse DFT of the result.

At time t we calculate the DFT of the spike matrix denoted by M . The

matrix M is composed by zeros and ones (in the case of spikes). The prop-

agation of activity is achieved by multiplying M by Fd for d = 3, 5 and 7

over the time steps t + 1, t + 2 and t + 3 respectively. Finally the inverse

DFT is applied. Using a 2.66 Ghz Intel Core i7 processor under Linux

each time step takes approximately 9 ms to be processed.

3.2. methods 61

Clustering algorithm used in the classifier

The network’s responses to stimuli from C stimulus classes S1, S2, ..., SC

are assigned to C response classes R1, R2, ..., RC of the training set in a

supervised manner. The result is a C × C hit matrix N(Sα, Rβ), whose

entries denote the number of times that a stimulus from class Sα elicits a

response in class Rβ. Initially, the matrix N(Sα, Rβ) is set to zero. For

each response r ∈ Sα, we calculate the mean Pearson correlation coefficient

over r and the responses r′ 6= r elicitepd by stimuli of class Sγ :

ρ(r, Sγ) = 〈‖ ρ(r, r′) ‖〉r′ elicited by Sγ (3.6)

where 〈.〉 denotes the average among the correlation between r and r′

denoted by ρ(r, r′). The response r is classified into the response-class

Rβ for that ρ(r, Sβ) is maximal, and N(Sα, Rβ) is incremented by one.

The overall classification ratio in percentage is calculated summing the

diagonal of the N(Sα, Rβ) and dividing by the total number of elements

in the classification set R. For the statistics reported here, the data was

generated in real-time and analyzed off-line.

TPC and DAC architecture

The face recognition system was included in a larger architecture called

BASSIS. We have recently used this architecture as a framework for human-

robot interaction (Luvizotto et al., 2012a)1. BASSIS is fully grounded in

the Distributed Adaptive Control (DAC) architecture (Verschure and Al-

thaus, 2003; Duff and Verschure, 2010). DAC consists of three tightly

coupled layers: reactive, adaptive and contextual. At each level of organi-

zation increasingly more complex and memory dependent mappings from

sensory states to actions are generated dependent on the internal state

1This work is was done in collaboration with the partners from the european projectEFAA.

62temporal population code for face recognition on the icub

robot

of the agent. The BASSIS architecture expands DAC with many new

components. It incorporates the repertoire of capabilities available for the

human to interact with the humanoid robot iCub. BASSIS is divided in

4 hierarchical layers: Contextual, Adaptive, Reactive, Soma. The differ-

ent modules responsible for implementing the relevant competences of the

humanoid are distributed over these layers (Fig. 3.2).

OPC Supervisor

FaceDetector

Face Recognition

tpcNetwork pmpActions

cartesianControl gazeControl

Contextual

Adaptive

Reactive

Soma

Figure 3.2: BASSIS is a multi-scale biomimetic architecture organized at threedifferent levels of control: reactive, adaptive and contextual. It is based on thewell established DAC architecture. See text for further details.

The Soma and the Reactive layers are the bottom layers of the BASSIS

diagram. They are in charge of coding the perceptual inputs through the

faceDetector where the position of the faces in the visual field is acquired

and tpcNetwork that encodes the cropped face collected using the infor-

3.2. methods 63

mation given by the faceDetector. The Reactive layer also deals with the

robot motor primitives. The pmpActions package provides motor control

mechanisms that allow the robot to perform complex manipulation tasks.

It relies on a revised version of the so-called Passive Motion Paradigm (Mo-

han et al., 2009) that uses the cartesian controller developed by Pattacini

et al. (2010) to solve the inverse kinematics problem while assuring to

fulfill additional constraints such as joint limits.

In the Adaptive layer, the classification of the encoded face is performed

using the cluster algorithm detailed in section 3.2.

At the top of the BASSIS diagram lays the Contextual layer which is

composed of two main entities: the Supervisor and the Objects Properties

Collector (OPC). The Supervisor module essentially manages the spo-

ken language system in the interaction with the human partner so as the

synthesis of the robot’s voice. The OPC reflects the current knowledge

the system is able to acquire from the environment, collecting data that

ranges from objects coordinates in multi-reference frames (both in image

and robot domains). Furthermore, a central role is played by the OPC

module which represents the database where a large variety of information

is stored since the beginning of the experiments.

The iCub

The iCub is a humanoid robot one meter tall, with dimensions similar to a

3.5 years old child. It has 53 actuated degrees of freedom distributed over

the hands, arms, head and legs (Metta et al., 2008). The sensory inputs are

provided by artificial skin, cameras, microphones. The robot is equipped

with novel artificial skin covering the hands (fingertips and the palm) and

the forearms (Cannata et al., 2008). The stereo vision is provided by

stereo cameras in a swivel mounting. The stereo sound capturing is done

using two microphones. Both cameras and microphones are located where

eyes and ears would be placed in a human. The facial expressions, mouth

and eyebrows, are projected from behind the face panel using lines of red

64temporal population code for face recognition on the icub

robot

LEDs. It also has the sense of proprioception (body configuration) and

movement (using accelerometers and gyroscopes).

The different software modules in the architecture are interconnected us-

ing YARP (Metta et al., 2006). This framework supports distributed

computation focusing in the robot control and efficiency. The communi-

cation between two modules using YARP happens between objects called

”ports”. The ports can exchange data over different network protocols

such as TCP and UDP.

Currently, the iCub is capable of reliably co-ordinating reach and grasp

motions, facial expressions to express emotions, force control exploiting its

force/torque sensors and gaze at points in the visual field. The interaction

can be performed using spoken language. With all these features, the

iCub is an outstanding platform for research in social interaction between

humans and humanoid robots.

Object Properties Collector (OPC)

The OPC module collects and store the data produce during the inter-

action. The module consists of a database where a large variety of in-

formation is stored starting at the beginning of the experiment and as

the interaction with humans evolves over time. In this setup, the data

provided by the face recognition model and eventually informations about

other objects feed the memory of the robot represented by the OPC.

OPC reflects what the system knows about the environment and what

is available. The data collected range from object coordinates in multi-

reference frames (both in the image and robot domains) to the human’s

emotional states or position in the interaction process.

Entities in OPC are addressed with unique identifiers and managed dy-

namically, meaning that they can be added, modified, removed at run-time

safely and easily from multiple sources located anywhere in the network,

practically without limitations. Entities consist of all the possible elements

3.2. methods 65

that enter or leave the scene: objects, table cursors, matrices used for geo-

metric projection, as well as the robot and the human themselves. All the

properties related to the entities can be manipulated and used to populate

the centralized database. A property in the OPC vocabulary is identified

by a tag specified with a string (a sequence of characters). The values that

can be assigned to a tag can be either single strings or numbers, or even

a nested list of strings and numbers.

Spoken Language Interaction and Supervisor

The spoken language interaction is managed by the Supervisor. It is im-

plemented in the CSLU Toolkit (Sutton et al., 1998) Rapid Application

Development (RAD) under the form of a finite-state dialogue system where

the speech synthesis is done by Festival and the recognition by Sphinx-II.

Indeed, these state modules provide functions including conditional transi-

tion to new states based on the words and sentences recognized, and thus

conditional execution of code based on current state, and spoken input

from the user. The recognition and the extraction of the useful informa-

tion from the recognized sentence is based on a simple grammar that can

be developed according to each application (Lallee et al., 2010).

The stimulus set used in the face recognition experiments

Real-time

The images used as input to the network were acquired using the robot’s

camera with a resolution of 640x480 pixels (Fig. 3.3). The faces were

detected and cropped using a cascade of boosted classifiers based on haar-

like features (Li et al., 2004). After the detection process the images are

rescaled to a resolution of 128x128 pixels covering the whole scene.

We ran the experiments reported here with faces from 6 subjects, a sample

size enough for most social robotics tasks such as games, interaction or

66temporal population code for face recognition on the icub

robot

TPC

iCub Camera Face Detection

Orientation Filters

V1 Network

Figure 3.3: Model overview. The faces are detected and cropped from the inputprovided by the iCub’s camera image. The cropped faces are resized to a fixedresolution of 128x128 pixels and convolved with the orientation selective filters.The output of each orientation layer is processed by separated neural populationsas explained above. The spike activity is summed over a specific time windowrendering the Temporal Population Code, or TPC.

human assistance. The stimulus set consists of 6 classes of faces, one per

subject, where each class has 7 variations of the same face under different

poses. For the statistics 2 faces are used as training set and 5 faces as

classification set for each class. The training stimuli are not used for

classification and vice-versa.

Standard face database

In order to compare the proposed system to standard face recognition

methods available in the literature, we use the cropped Yale Face Database

B (Georghiades et al., 2001) (Lee et al., 2005). In this dataset, there are

10 classes with different human subjects. Each class has variations of the

same face under different angles and position of light incidence, divided in

four subsets of increasing difficulty 3.4. In this quantitative study we use

lt = 7 faces of the subset 1 and lc = 12 faces of the subset 2 as training set.

The subsets 2, 3 and 4 are used as classification set. The training stimuli

3.3. results 67

are not used for classification and vice-versa, except when the subset 2

is used for classification. All images are in original resolution and aspect

ratio.

Figure 3.4: Cropped faces from Yale face database B used as a benchmark tocompare TPC with other methods of face recognition available in the literature(Georghiades et al., 2001) (Lee et al., 2005).

3.3 Results

In the first experiment we investigate the capabilities of our model in a

face recognition task using iCub. We also evaluate the optimal number

of time steps required for the recognition process which is a free param-

eter that determines how many interactions are needed for the encoding

process to support reliable classification. It both affects directly the pro-

cessing time and therefore the overall system performance and the length

of the TPC vector and thus the optimal performance/compression ratio of

the data. We calculate the classification ratio over the network’s activity

trace for different time windows after stimulus onset used to compute the

correlation between TPCs of the training and test set ranging from 2 to

130 ms every 1 ms.

The simulations were performed with the parameters of table 3.2. We are

not using any light normalization mechanism in the model. For that reason

we performed the simulations for different values of the input threshold Ti

(see methods).

68temporal population code for face recognition on the icub

robot

The classification performance shows a maximum of 87 % within 21 ms

after stimulus onset using Ti = 0.6 (Fig. 3.5). This speed of optimal

stimulus encoding is compatible with the speed of processing observed in

the mammalian visual system (Thorpe et al., 1996). A TPC vector over

this time window a represents a face with 22 points covering about 1%

of the total number of pixels in a 128x128 image. The misclassified faces

show strong deviations in position or light conditions (Fig. 3.7 and Fig.

3.6).

0 50 100 1500

20

40

60

80

100

Time [ms]

Cla

ssific

ation [%

]

Ti = 0.2

Ti = 0.4

Ti = 0.6

Ti = 0.8

Figure 3.5: Speed of encoding. Average correct classification ratio given bythe network’s activity trace as a function of time for different values of the inputthreshold Ti.

In the second experiment we show that the recognition performance can

be enhanced changing the read out mechanism. In this experiment the

network spike activity is read out over Sr adjacent square sub-regions of

the same aspect ratio. The TPC vector from the different sub-regions are

concatenated to form the overall representation of the face. We used four

different configurations to divide the subregions using a time window of 21

ms (Fig. 3.8). This operation increases by Sr times the size of the TPC

vector that represents a face.

The results show a maximum performance of 97 %. It also suggests an

3.3. results 69

Figure 3.6: Response clustering. The entries of the hit matrix represent thenumber of times a stimulus class is assigned to a response class over the optimaltime window of 21 ms and Ti of 0.6.

optimal ratio of performance/compression with 16 sub-regions, i. e 16*22

points per vector.

In the last part, we show the results obtained with the TPC model us-

ing cropped faces from the Yale Face Database B. We followed the same

procedure, reading out the network over Sr adjacent square sub-regions

of the same aspect ratio and also varying the values of the input thresh-

old Ti (Fig. 3.9). In table 3.3, we summarize the TPC performance in

comparison with other face recognition methods. The TPC results pre-

sented in the table where calculated using 64 sub-regions, totalizing 64x22

= 1408 points per vector. This corresponds to approximately 5% of the

number of pixels (192x168 = 32256) of the input image. The results show

that both performance and compression ratio are totally compatible with

state-of-art methods for face recognition.

70temporal population code for face recognition on the icub

robot

MisclassifiedStimulus

Training Set

Class Set

Figure 3.7: Face data set. Training set (Green), Classification set (Blue) andMisclassified faces (Red).

1 4 16 6450

60

70

80

90

100

Cla

ssifi

catio

n [%

]

Sub-Regions

Figure 3.8: Classification ratio using a spatial subsampling strategy where thenetwork activity is read over multiples subregions.

3.4. discussion 71

2 3 40

20

40

60

80

100

2 3 40

20

40

60

80

100

2 3 40

20

40

60

80

100

Subset2 3 4

0

20

40

60

80

100

Subset

SubsetSubset

Cla

ssifi

catio

n Pe

rfor

man

ce [%

]

Ti = 0.2

Ti = 0.4

Ti = 0.6

Ti = 0.8

1 Sr 4 Sr 16 Sr 64 Sr

Figure 3.9: Classification ratio using the cropped faces from the Yale databse.

3.4 Discussion

In this paper, we addressed the question whether we could reliably recog-

nize faces using a biological plausible cortical neural network exploiting the

Temporal Population Code. Moreover, we investigated whether it could

be performed in real time using the iCub robot. We have shown that in

our cortical model, the representation of a static visual stimulus is gener-

ated based on the temporal dynamics of neuronal populations given by a

specific signature in the phase relationships of the neuronal responses.

In the specific benchmark evaluated here, the model showed a good per-

formance both in terms of the ratio of correct classification and processing

72temporal population code for face recognition on the icub

robot

Table 3.2: Comparison of TPC with other face recognition methods. The resultswere extracted from Lee et al. (2005) were the reader can find the references foreach method.

Method Classification Ratio (%) vs. Ilum.

Subset 1&2 Subset 3 Subset 4

Correlation 100.0 76.7 23.7

Eigenfaces 100.0 74.2 24.3

Eigenfaces w/o 1st 3 100.0 80.8 33.6

Linear Subspace 100.0 100.0 85.0

Cones-attached 100.0 100.0 91.4

Harmonic images (no cast shadows) 100.0 100.0 96.4

9PL (simulated images) 100.0 100.0 97.2

Harmonic images (with cast shadows) 100.0 100.0 97.3

TPC 100.0 100.0 98.0

Gradient Angle 100.0 100.0 98.6

Cones-Cast 100.0 100.0 100.0

9PL (real images) 100.0 100.0 100.0

time. Importantly, we showed that the TPC model can recognize faces

with a correct ratio of 97 % without any face-specific mechanism in the

network. Indeed, similar filter properties have been used for character

recognition and spatial cognition tasks. The processing time needed for

classifying an input of 128x128 pixels was about 200 ms. In comparison

generic face recognition methods such as correlation and eigenfaces have

reduced recognition performance (Lee et al., 2005).

In the implementation reported here we used only one frequency resolution

to model the receptive field’s characteristics of simple cells. It reduces the

number of operations and therefore improves the recognition speed. The

optimal frequency response was based on previous studies of V1 (Bayerl

and Neumann, 2004).

We expect that performance and the generality of the model can be further

improved in a future implementation, using multiple processors in parallel.

Specially because the features could be extracted by gaussian filters opti-

3.4. discussion 73

mally arranged over different scales similar to the SIFT transform (Lowe,

2004). With such a scheme we could achieve scale invariances without

relying on re-scaling the cropped face image. However, the ideal architec-

ture for TPC is a neuromorphic implementation where the integrate and

fire units can work fully in parallel which calls for more specialized parallel

hardware.

Our results confirm that TPC is a relevant hypothesis on the encoding of

sensory events by the brain of both vertebrates and invertebrates given its

performance, speed of processing and consistency with cortical anatomy

and physiology. Moreover, it reinforced the main characteristic of the

TPC: its independence of the exact detailed properties of the input stim-

ulus. The network does not need to be rewired to encode a very complex

natural stimulus that is a human face and thus qualifies as a canonical

cortical input processor.

Chapter 4

Using a temporal populationto recognize gestures on the

humanoid robot iCub.

4.1 Introduction

Gesture recognition is a core human ability that improves our capacity of

interaction and communication. Through gestures we can reveal unspoken

thoughts that are crucial for verbal communication (Mitra et al., 2012) or

even to interact with machines (Kurtenbach and Hulteen, 1990). It also

plays a role in changing our knowledge, during the learning phase in our

childhood (Goldin-Meadow, 2009).

Different studies were dedicated to understand and categorize human ges-

tures from a psychological point of view. According to Cadoz (1984); Mc-

Neill (1996), gestures can be used to communicate meaningful information

(semiotic); used to manipulate the physical world and create artifacts (er-

gotic) or used to learn from the environment through tactile or haptic

exploration (epistemic).

75

76using a temporal population to recognize gestures on the

humanoid robot icub.

Gonzalez Rothi et al. (1991) suggest the existence of different neural mech-

anisms for imitating either meaningful or meaningless actions. Meaning-

less and meaningful gestures can be imitate using a direct route that con-

nects the visual analysis to the innervatory patterns, whereas meaningful

gestures can only be imitate using the semantic route that comprises dif-

ferent processing stages.

Independent to the kind of gesture, meaningful or not, higher cognitive

areas require a sophisticated representation of the visual content. Fur-

thermore, this representation must be invariant to a number of transfor-

mations caused by varying viewing angles, different scene configurations,

light sources and deformations in form.

In earlier work we proposed a model of the primary visual cortex that

makes use of known cortical properties to perform a spatial to temporal

transformation in a representation invariant to rotation, translation and

scale. We showed that the dense excitatory local connectivity found in the

primary visual cortex of the mammalian brain can play a specific role in

the rapid and robust transformation of spatial stimulus information into

a, so called, Temporal Population Code (or TPC) (Wyss et al., 2003a).

In this study, we propose to apply the high-capacity encoding provided

by TPC to a gesture recognition task. We want to evaluate whether the

performance both in terms of recognition rate and computation speed

of the TPC are feasible for real time applications. To our knowledge,

there are very few models of object recognition that take into account the

anatomy of biological systems. Most of actual systems try to reproduce the

characteristics of humans abilities in gesture recognition through purely

algorithmic solutions (Maurer et al., 2005).

The model proposed here is benchmarked using the TPC library developed

for the iCub robot (Luvizotto et al., 2011). The task is motivated by the

Rock-Paper-Scissors hand game where the human player competes with

the robot. Here we show the initial results of the proposed system using

images obtained from the robot’s camera in a real game scenario. We

4.2. material and methods 77

Scissors Paper

Rock

Beats

BeatsBeats

Figure 4.1: The gestures used in the Rock-Paper-Scissors game. In the game,the players usually count until three before showing the gestures. The objectiveis to select a gesture which defeats that of the opponent. Rock breaks scissors,scissors cut paper and paper covers captures rock. Unlike a truly random selectionmethod, like coin flipping or a dice, in the Rock-Paper-Scissors game is possibleto recognize and predict the behavior of an opponent.

present a detailed description of the model used in the experiments and

the statistical analysis of the recognition performance using a standard

cluster algorithm. The results are then compared to a predefined baseline

classification rate. We then finish the chapter with a discussion of the

main findings of this ongoing study.

4.2 Material and methods

The model consists on a topographic map of laterally connected spiking

neurons with properties found in the primary visual cortex V1 (Fig. 4.2)

(Wyss et al., 2003c,a). In the first stage we segment the hand from the

background using color segmentation. The input image is transformed

into hue, saturation and value (HSV) space and then thresholded. The

threshold values for the three parameters H, S and V can be set at runtime

using a graphical interface.

78using a temporal population to recognize gestures on the

humanoid robot icub.

iCub Camera ColorSegmentation Wavelet Readout

TPC

Orientation Filters

V1 Network

Figure 4.2: Visual model diagram. In the first step, the input image is colorfiltered in the hue, saturation and value (HSV) space and segmented from thebackground. The segmented image is then projected to the LGN stage where itsedges are enhanced. In the next stage, the LGN output passes through a set ofGabor filters that resemble the orientation selectivity characteristics found in thereceptive field of V1 neurons. Here we show the output response of one Gaborfilter as input for the V1 spiking model. After the image onset, the sum of theV1 network’s spiking activity over time gives rise to a temporal representationof the input image. This temporal signature of the spatial input is the so calledtemporal population code, or TPC. The TPC output is then read out by a waveletreadout system.

The segmented hand is then projected to the V1 spiking model, where

the coding concept is illustrated in Fig. 4.3. The network is an array of

NxN model neurons connected to a circular neighborhood with synapses

of equal strength and instantaneous excitatory conductance. The trans-

mission delays are related to the Euclidean distance between the positions

of the pre-and postsynaptic neurons. The stimulus is continuously pre-

sented to the network and the spatially integrated spreading activity of

the V1 units, as a sum of their action potentials, results in the so called

TPC signal.

In the network, each neuron is approximated using the simple spiking

model proposed by Izhikevich (Izhikevich, 2003). These modeled neurons

4.2. material and methods 79

SpikeLateral Propagation

∑ Spikes

TPC

Spi

kes

Time

TimeStimulus(Hand)

ConnectivityRadius

Modelled NeuronLateraly Connected

Network

Figure 4.3: Schematic of the encoding paradigm. The stimulus, a human hand,is continuously projected onto the network of laterally connected integrate and firemodel neurons. The lateral transmission delay of these connections is 1 ms/unit.When the membrane potential crosses a certain threshold, a spike occur and itsaction potential is distributed over a neighborhood of a given radius. Becauseof these lateral intra-cortical interactions, the stimulus becomes encoded in thenetwork’s activity trace. The TPC signal is generated by summing the totalpopulation activity over a certain time window.

are biologically plausible and as computationally efficient as integrate-and-

fire models. Relying only on four parameters, the model can reproduce

spiking behavior of known types of cortical neurons (Izhikevich, 2004)

using a system of ordinary differential equations of the form:

v′ = 0.04v2 + 5v + 140− u+ I (4.1)

u′ = a(bv − u) (4.2)

with the auxiliary after-spike resetting:

if v ≥ 30 mV, then

v ← c

u← u+ d(4.3)

80using a temporal population to recognize gestures on the

humanoid robot icub.

Here, v and u are dimensionless variables and a, d, c and d are dimension-

less parameters that determine the spiking or bursting behavior of the

neuron unit and ′ = ddt , where t is the time. The parameter a describes

the time scale of the recovery variable u. The parameter b describes the

sensitivity of the recovery variable u to the sub-threshold fluctuations of

the membrane potential v. The parameter c accounts for the after-spike

reset value of v caused by the fast high-threshold K+, and d the after-spike

for the reset of the recovery variable u caused by slow high-threshold Na+

and K+ conductances. The mathematical analysis of the model can be

find in (Izhikevich, 2006).

The excitatory input I in 4.1 consists of two components: first a constant

driving excitatory input gi and second the synaptic conductances given by

the lateral interaction of the units gc(t). So

I(t) = gi + gc(t) (4.4)

For the simulations, we used the parameters suggested in (Izhikevich, 2004)

to reproduce regular (RS) spiking behavior (Fig. 4.4). All the parameters

used in the simulations are summarized in the table 4.1.

!80!60!40!20

020 25

Regular Spiking (RS)

mV

Figure 4.4: Computational properties after frequency adaptation of the inte-grate and fire neuron used in the simulations. The spikes in this regular spikingneuron model occurs every 25 ms approximately (40 Hz).

4.2. material and methods 81

In our model each unit is characterized by orientation selectivity angle φ ∈0, π6 , 2

π6 , ..., π and a bi-dimensional vector x ∈ R2 specifying its receptive

field’s center location within the input plane. So each unit in the network

is defined by u(x, φ) that we envision as approximating the functional

notion of cortical column.

The receptive field of each unit is approximated using the second derivative

of an isotropic Gaussian filterGσ. The input image is convolved withGσ to

reproduce the V1 receptive field’s characteristics for orientation selectivity.

The input stimulus is a retinal gray scale image that fills the whole visual

field.

The convolution with ∂2x(φ)Gσ is computed by twice applying a sampled

first derivative of a Gaussian (σ = 0.7 and size 5x5 pixels) over the six

orientations φ (Bayerl and Neumann, 2004). The filter output is truncated

according to a threshold Ti ∈ [0, 1], where the values above Ti are set to

Ex. Each orientation selective output is projected onto a specific network

layer.

The lateral connectivity between V1 units is exclusively excitatory with

strength w. A unit ua(x, φ) connects with ub if all of the following condi-

tions are met:

1. be in the same population: φa = φb.

2. have a different center position: xa 6= xb

3. within a region of a certain radius: ‖xb − xa‖ < r

In our model we set the connectivity diameter r to 11 units, as used in

previous TPC studies (Luvizotto et al., 2011). The lateral synapses are

of equal strength w and the transmission delays τa are proportional to

‖xb − xa‖ with 1 ms/cell.

The TPC is generated by summing the network’s activity over a time

window of 64 ms. Finally, the output vectors from different orientation

82using a temporal population to recognize gestures on the

humanoid robot icub.

layers are summed forming the TPC vector. In this way the temporal

representation of TPC is naturally invariant to position. Theoretically

rotation invariances are achieved by increasing the number of orientation

selective layers reaching a full invariant architecture by expanding the

number of orientation layers.

Wavelet Readout

The wavelet transform is a spectral analysis technique that uses variable-

sized regions as a kernel of the transformation. In comparison with a stan-

dard Short Time Fourier Transform (STFT), the wavelet transform does

not use a time-frequency, but rather a time-scale region Mallat (1998). One

major advantage is the ability to perform local analysis with optimal reso-

lution in both the time and frequency domain rendering a time-frequency

representation Fig. 4.5.

In the model presented here, we used a neuronal implementation of a Haar

wavelet transform. This circuit has been recently proposed in a TPC study

(Luvizotto et al., 2012c), detailed in chapter 2. In this circuit, the TPC

signal with N = 64 samples (sampling rate of 1 Khz), passes through

a set of filters which slice the frequency spectrum in a desirable number

of bands, i.e, the approximations and detail levels. The approximations

are the high-scale, low-frequency components of the signal spectrum. The

details are the low-scale, high-frequency components. At each level l of

the filtering process a new frequency band emerges and is represented byN2l

wavelet coefficients. A dense representation is achieved selecting the

coefficients that carry the greatest amount of information to re-construct

the signal, without unwanted redundancies and artifacts.

Stimulus set

The stimulus set used in the experiment consists of the three different

hand gesture’s classes used in the Rock-Paper-Scissors game (Fig. 4.2).

4.2. material and methods 83

TPC SignalLevel 0N = 64

Level 1N = 32

Level 2N = 16

Level 3N = 8

Level 4N = 4

Level 5N = 2

Approximation(A)

A

A

A

A

D

D

D

D

Detail(D)

500

250125623115

Hz

Figure 4.5: Wavelet transform time-scale representation. At each resolutionlevel, the number of wavelet coefficient drops by factor of 2 (dyadic representation)as well as the frequency range of the low-pass signal given by the approximationcoefficients. The detail coefficients can be interpreted as a band-pass signal,with frequency range equal to the difference in frequency between the actual andprevious approximation levels.

For each class, 40 stimulus with natural variations of the hand’s posture

are used. There are in total 4 subjects. In the experiments, 75% of the

images are used for training and the remaining 25% others are used as

classification set. Giving the compact representation of TPC, the training

set composed by 90 pictures can be generated offline and stored in memory

84using a temporal population to recognize gestures on the

humanoid robot icub.

Variable Description Value

N network dimension 96x72 Neurons

a scale of the recovery 0.02

b sensitivity of the recovery 0.2

crs after-spike reset value of v -65

drs after-spike reset value of u 8

v Membrane potential -70

u membrane recovery -16

gi excitatory input conductance 20

Ti minimum V1 input threshold 0.4

r lateral connectivity radius 11 units

w synapsis strength 0.4

Table 4.1: Parameters used for the simulations.

(only 20 Kb) for online recognition.

RockPaper

Scissors

Figure 4.6: The stimulus classes used in the experiments. Here 12 exemplarsfor each class are shown.

Cluster algorithm

For the classification we used the following algorithm. The network’s re-

sponses to stimuli from C stimulus classes S1, S2, ..., SC are assigned to

C response classes R1, R2, ..., RC of the training set, yielding a C × C

hit matrix N(Sα, Rβ), whose entries denote the number of times that a

stimulus from class Sα elicits a response in class Rβ. Initially, the matrix

N(Sα, Rβ) is set to zero. For each response r ∈ Sα, we calculate the Eu-

clidean distance of r to the responses r′ 6= r elicited by stimuli of class

Sγ :

4.3. results 85

ρ(r, Sγ) = 〈‖ ρ(r, r′) ‖〉r′ elicited by Sγ (4.5)

where 〈.〉 denotes the average among the temporal Euclidean distances

between r and r′ denoted by ρ(r, r′). The response r is classified into the

response-class Rβ for that ρ(r, Sβ) is minimal, and N(Sα, Rβ) is incre-

mented by one. The overall classification ratio in percentage is calculated

summing the diagonal of the N(Sα, Rβ) and dividing by the total number

of elements in the classification set R. We chose the same metrics that

was used in previous studies to establish a direct comparison between the

results over different scenarios.

4.3 Results

In the first part of the experiments, we established a classification baseline

to further compare and evaluate the classification performance provided by

TPC. We define the baseline applying the cluster algorithm (see section 4.2

for details) over the pixel intensity values of the gray scale images used in

the experiments. In this situation, an image of 96x72 pixels is represented

by a vector of size 6912. Using the 75% of the images as training set, the

results show a correct classification ratio of 70% (Fig. 4.7). In the next

step, we run the simulations using the TPC to compare classification,

speed and compression ratio of the model against the baseline.

In our simulations we used the wavelet readout circuit at the approxima-

tion level 5, or Dc5 to finally capture the intrinsic geometric properties of

the stimulus. At this level, the signal ranges from 15 Hz to 31 Hz and it is

represented by only 2 wavelet coefficients. As we have 6 signals provided

by the different orientation layers, the total size of the readout signal is

12.

The simulations were performed with the parameters of table 4.1, using a

TPC library developed in C++. The performance shows a maximum of

86using a temporal population to recognize gestures on the

humanoid robot icub.

4

7

6 10 3

Stimulus Class

Res

pons

e C

lass

Figure 4.7: Classification hit matrix for the baseline. In the classification pro-cess, the correlation between the classification set (25% of the stimuli )and thetraining set (75%) are calculated. A stimulus is assigned to a the class withsmaller mean correlation to the respective class in the training set.

93 % of correct classification ratio. The misclassifications happen mostly

with class one, i.e the Rock gesture (Fig. 4.8). This gesture intersects

with a great extend the other two and it’s more bounded to misclassifica-

tions. This classification ratio is compatible with previous TPC works, for

instance with faces and hand written characters recognition (Wyss et al.,

2003c) (Luvizotto et al., 2011). In a 2.67 Ghz core i7 processor, the pro-

cessing time per hand is about 0.3 s. In comparison to the baseline, the

TPC representation shows a gain in correct classification of 23% while

reducing the representation in an incredible factor of 576, i.e. from 6912

pixels in a resolution of 96x72 to 12 Dc5 wavelet coefficients.

4.4. discussion 87

8

2

10

10

Stimulus Class

Res

pons

e C

lass

Figure 4.8: Classification hit matrix. In the classification process, the euclideandistance between the classification set and the prototypes are calculated. A stim-ulus is assigned to a the class with smaller euclidean distance to the respectiveclass prototype.

4.4 Discussion

We have shown that in a cortical model, the representation of a static

visual stimulus can be generated using the temporal dynamics of a neu-

ronal population. This temporal encoding has a specific signature in the

phase relationships between the active neurons of the underlying encoding

substrate.

The TPC is a relevant hypothesis on the encoding of visual information.

It is consistent with known cortical characteristics both anatomically and

physiologically. Also the final representation given by the wavelet circuit

compress the signal, removing redundant and non-orthogonal components

of the signal greatly improving the compromise between performance and

size.

88using a temporal population to recognize gestures on the

humanoid robot icub.

In conclusion, simple hand gestures can be reliably classified using the

TPC approach. With the highly optimized C++ library we developed for

the experiments, the computational time needed for the TPC computation

is totally feasible for the the recognition of the gestures. Hence, TPC has

proved to be robust and fast enough as a gesture recognition mechanism in

the iCub without affecting or compromising the game playing with iCub.

Chapter 5

A framework for mobile robotnavigation using a temporal

population code

Recently, we have proposed that the dense local and sparse long-range

connectivity of the visual cortex accounts for the rapid and robust trans-

formation of visual stimulus information into a temporal population code,

or TPC. In this paper, we combine the canonical cortical computational

principle of the TPC model with two other systems: an attention sys-

tem and a hippocampus model. We evaluate whether the TPC encoding

strategy can be efficiently used to generate a spatial representation of the

environment. We benchmark our architecture using stimulus input from

a real-world environment. We show that the mean correlation of the TPC

representation in two different positions of the environment has a direct

relationship with the distance between these locations. Furthermore, we

show that this representation can lead to the formation of place cells. Our

results suggest that TPC can be efficiently used in a high complexity task

such as robot navigation.

89

90a framework for mobile robot navigation using a temporal

population code

5.1 Introduction

Biological systems have an extraordinary capacity of performing simple

and complex tasks in a very fast and reliable way. These capabilities are

largely due to the robust processing of sensory information in an invariant

manner. In the mammalian brain, the representation of dynamic scenes

in the visual world are processed invariant to a number of deformations

such as perspective, different light conditions, rotations and scales. This

great robustness leads to successful strategies for optimal foraging. For

instance, a foraging animal has to explore the environment, search for food,

escape from possible predators and produce an internal representation of

the world that allows it to successfully navigate through it.

To perform navigation with mobile robots using only visual inputs from a

camera is a complex task (Bonin-Font et al., 2008). Recently, a number of

models have been proposed to solve this task (Jun et al., 2009; Koch et al.,

2010). Unlike the visual system, most of these methods are based on brute

force methods of feature matching that are computationally expensive and

stimulus dependent.

In contrast, we have proposed a model of the visual cortex that can gener-

ate reliable invariant representations of visual information. The temporal

population code, or TPC, is based on the unique anatomical characteristics

of the neo-cortex: dense local connectivity and sparse long-range connec-

tivity (Binzegger et al., 2004). Physiological studies have estimated that

only a few percent of synapses that make up cortical circuits originate out-

side of the local volume (Schubert et al., 2007; Liu et al., 2007; Nauhaus

et al., 2009). The TPC proposal has demonstrated the property of densely

coupled networks to rapidly encode the geometric organization of the pop-

ulation response, for instance induced by external stimuli, into a robust

and high-capacity encoding (Wyss et al., 2003a,c).

It has been shown that TPC provides an efficient encoding and can gen-

eralize to navigation tasks in virtual environments. In a previous study,

5.1. introduction 91

TPC model applied to a simulated robot in a virtual arena could account

for the formation of place cells (Wyss et al., 2006). It has also been showed

that TPC can generalize to realistic tasks such as handwritten character

classification (Wyss et al., 2003a). The key parameters that control this

encoding and the invariances that it can capture, are the topology and

the transmission delays of the laterally coupled neurons that constitute an

area. In this respect TPC emphasizes that the neo-cortex exploits specific

network topologies as opposed to the randomly connected networks found

in examples of, so-called, reservoir computing (Lukosevicius and Jaeger,

2009).

In this paper, we evaluate whether the TPC applied can be generalized

to a navigation task in a natural environment. In particular we assess

whether it can be used to extract spatial information from the environ-

ment. In order to achieve this generalization we combine the TPC, as a

model of the ventral visual system, with a feedforward attention system

and a hippocampal like model of episodic memory. The attention system

uses early simple visual features to determine which salient regions in the

scene the TPC framework will encode. The spatial representation is then

generated by a model of the hippocampus that develops place cells. Our

model has been developed to mimic the foraging capabilities of a rat. Ro-

dents are capable to optimally explore the environment and its resources

using distal visual cues for orientation. The images used in the experi-

ments are acquired in a real-world environment taken by the camera of a

mobile robot called the Synthetic Forager (SF) (Renno-Costa et al., 2011).

Our results show that the combination of TPC with a saliency based at-

tention mechanisms can be used to develop robust place cells. These

results directly support the hypothesis that the computational principle

proposed by TPC leads to a spatial-temporal coding of visual features that

can reliably account for position information in a real-world environment.

Differently from previous TPC experiments with navigation, here we stress

the capabilities of position representation in a real-world environment.

92a framework for mobile robot navigation using a temporal

population code

In the next section we present a description of the 3 different systems we

use in the experiments. We detailed how the environment for the experi-

ments was setup and how the images for the experiments were generated.

In the following section we expose the main results obtained with the

experiments and we finish discussing the core findings of the study.

5.2 Methods

Architecture overview

The architecture presented here comprises three areas (Fig. 5.1): attention

system, early visual system and a model of the hippocampus. The atten-

tion system exchanges information with the visual system in a bottom-up

fashion. It receives bottom-up input of simple visual features and provides

salience maps (Itti et al., 1998). The saliency maps determine the sub re-

gions that are rich in features such as colors, intensity and orientations

that pop-out from the visual field and therefore will be encoded in the

visual system.

The visual system also provides input to the hippocampal model in a

bottom-up way. The subregions of interest determined by the attention

system are translated into temporal representations that feed the hip-

pocampal input stage. The temporal representation of the visual input is

generated using the so called Temporal Population Code, or TPC (Wyss

et al., 2003a,c; Luvizotto et al., 2011). The hippocampus model translates

the visual representation of the environment in a spatial representation

based on the activity of place cells (de Almeida et al., 2009a; Renno-Costa

et al., 2010). These kinds of granule cells respond only to one or a few po-

sitions in the environment. In the following sections we present the details

and modeling strategies used for each area.

5.2. methods 93

Input Image

Early VisualFeatures

SaliencyMaps

Visual Representationof the Environment

Salience Map

colors intensity orientations

AttentionSystem

TimeSalientSubregion

Tpc

Hippocampus

PlaceCells

Cell #1 Cell #2

Cell #3 Cell #4

Visual System

Top-Down

Bottom-upBottom-up

Figure 5.1: Architecture scheme. The visual system exchanges information withboth the attention system and the hippocampus model to support navigation.The attention system is responsible for determining which parts of the scene areconsidered for the image representation. The final position representation is givenby the formation of place cells that show high rates of firing whenever the robotis in a specific location in the environment, i.e. place cells.

Attention System

In our context, attention is basically the process of focusing the sensory

resources to a specific point in space. We proposed a deterministic at-

tention system that is driven by the local points of maximum saliency in

the input image. To extract these subregions of interest, the attention

system uses a saliency-based model proposed and described in detail in

(Itti et al., 1998). The system makes use of image features extracted by

the early primate visual system, combining multi-scale image features into

a topographical saliency map (bottom-up). The output saliency map has

94a framework for mobile robot navigation using a temporal

population code

the same aspect ratio as the input image. To select the center points of the

subregions, we calculate the points of local maxima p1, p2, . . . , p21 of the

saliency map. The minimum accepted distance between peaks in the local

maxima search is 20 pixels. This competitive process among the salient

points can be interpreted as a top-down pushpull mechanism (Mathews

et al., 2011).

A subregion of size 41× 41 is determined by a neighborhood of 20 pixels

surrounding a local maximum p. Once a region is determined, the points

inside of this area are not used in the search for the next pk+1. This

constraint avoids processing redundant neighboring regions with overlaps

bigger than 50% in area. For the experiments, the first 21 local maxima

are used by the attention system. These 21 points are the center of the

subregions of interest cropped from the input image and sent to the TPC

network. We used a Matlab implementation1 of the saliency maps (Itti

et al., 1998). In the simulations we use the default parameters, except for

the map width which is set to the same size as the input image: 967 pixels.

Visual System

The early visual system consists of two stages: a model of the lateral

geniculate nucleus (LGN) and a topographic map of laterally connected

spiking neurons with properties found in the primary visual cortex V1 (Fig.

5.2) (Wyss et al., 2003c,a). In the first stage we calculate the response

of the receptive fields of LGN cells to the input stimulus, a gray scale

image that covers the visual field. The approximation of the receptive

field’s characteristics is performed by convolving the input image with a

difference of Gaussians operator (DoG) followed by a positive half-wave

rectification. The positive rectified DoG operator resembles the properties

of on LGN center-surround cells (Rodieck and Stone, 1965; Einevoll and

Plesser, 2011). The LGN stage is a mathematical abstraction of known

properties of this brain area and performs an edge enhancement on the

1The scripts can be found at: http://www.klab.caltech.edu/ harel/share/gbvs.php

5.2. methods 95

input image. In the simulations we use a kernel ratio of 4:1, with size of

7X7 pixels and variance σ = 1 (for the smaller Gaussian).

Input TPC

Time

Spik

es

LGN V1

Array ofSpiking Neurons

Figure 5.2: Visual model overview. The saliency regions are detected andcropped from the input image provided by camera image. The cropped areas,subregions of the visual field, have a fixed resolution of 41x41 pixels. Each subre-gion is convolved with difference of gaussian (DoG) operator that approximatesthe properties of the receptive field of LGN cells. The output of the LGN stage isprocessed by the simulated cortical neural population and their spiking activityis summed over a specific time window rendering the Temporal Population Code.

The LGN signal is projected to the V1 spiking model, that uses the same

network based on Izhikevich neurons presented in chapters 2 and 4 (Fig.

fig:2Rd).

For the simulations, we used the parameters suggested in (Izhikevich, 2004)

to reproduce regular firing (RS) spiking behavior. All the parameters

used in the simulations are summarized in the table 5.1. The lateral

connectivity between V1 units is exclusively excitatory with strength w.

A unit ua(x) connects with ub if they have a different center position

xa 6= xb and if they are within a region of a certain radius ‖xb − xa‖ < r.

The temporal population code is generated by summing the network ac-

tivity in a time window of 128 ms. Thus a TPC is a vector with the spiking

count of the network for each time step. In the discrete-time, all the equa-

tions are integrated with Euler’s method using a temporal resolution of 1

ms.

96a framework for mobile robot navigation using a temporal

population code

Variable Description Value

N network dimension 41x41 Neurons

a scale of the recovery 0.02

b sensitivity of the recovery 0.2

crs after-spike reset value of v for RS neurons -65

cbs after-spike reset value of v for BS neurons -55

v membrane potential -70

u membrane recovery -16

gi excitatory input conductance 20

Ti minimum V1 input threshold 0.4

r lateral connectivity radius 7 units

w synapsis strength 0.3

Table 5.1: Parameters used for the simulations.

Hippocampus model

The hippocampus model is an adaptation of a recent study on the rate

remapping in the dentate gyrus (de Almeida et al., 2009a,b; Renno-Costa

et al., 2010). In our implementation, the granule cells receive excitatory

input from the grid cells of the enthorinal cortex. The place cells that

are active for a given position in the environment are then determined

according to the interaction of the summed excitation and inhibition using

a rule based on the percentage of maximal suprathreshold excitation E%-

max winner-take-all (WTA) process. The excitatory input received by the

ith place cell from the grid cells is given by:

Ii(r) =

n∑j−1

WijGj(r) (5.1)

where Wij is the synaptic weight of each input. In our implementation,

the weights are either random values in the range of [0, 1] (maximum bin

size among normalized histograms) or 0 in case of no connection. In our

implementation, we use the TPC histograms as the information provided

by the grid cells Gj(r). The activity of the ith place cell is given by:

5.2. methods 97

Fi(r) = Ii(r)H(Ii(r)− (1− k)maxgrid (r)) (5.2)

The constant k varies in the range of 5% to 15% (de Almeida et al., 2009b).

It is referred to E%-max and determines which cells will fire. Specifically,

the rule states: a cell fires if their feedforward excitation is within E% of

the cell receiving maximal excitation.

Experimental environment

For the experiments we produced a dataset of pictures from an indoor

square area of 3m2. The area was divided in a 5x5 grid of 0.6 m2 square

bins, compromising 25 sampling positions (Fig. 5.3a). The distance to

objects and walls ranged from 30 cm to 5 meters. To emulate the sensory

input we approximated the rat view at each position with 360 degree

panoramic pictures. Rats use distal visual cues for orientation (Hebb,

1932) and such information is widely available for the rat given the spaced

positioning of their eyes, resulting in a wide field of view and low binocular

vision (Block, 1969). The pictures were built using the software Hugin

(D’Angelo, 2010) to process 8 pictures with equally distant orientations

(45 degrees steps) for each position. The images were acquired using the

SF-robot (Renno-Costa et al., 2011). Each indoor images was segmented

in 21 subregions of 41 × 41 pixels by the attention system as described

earlier. The maximum overlap among the subregions is 50 % (Fig. 5.3b).

Image representation using TPC

For every image, 21 TPC vectors are generated, i.e. one for each subre-

gion selected by the attention system. In total, we acquired 25 panoramic

images of the environment, leading to a matrix of 525 vectors of TPCs

(25x21). We cluster the TPC matrix into 7 classes using the k-means

cluster algorithm. Therefore, a panoramic image in the environment is

98a framework for mobile robot navigation using a temporal

population code

represented by the cluster distribution of its TPC vectors using a his-

togram of 7 bins. The histograms are used as a representation of each

position in the environment (Fig. 5.3b). They are the inputs for the

hippocampus model.

0.6 m

3 m

Samplingposition

3 4 56 7 8 9 10

11 12 13 1416 17 18 19

21 22 23 24 2520

15

21

Environment used in the experiments

123 45

6

7 8910

11

1213 1415

1617

18

19 2021

Histogram ofTPCs Panoramic image of position 2

a)

b)

Salient Regions

Figure 5.3: Experimental environment. a) The indoor environment was dividedin a 5x5 grid of 0.6 m2 square bins, compromising 25 sampling positions. b) Forevery sampling position, a 360 degrees panorama was generate to simulate therat’s visual input. A TPC responses is calculated for each salient region, 21in total. The collection of TPC vectors for the environment is clustered into 7classes. Finally, a single image is represented by the cluster distribution of itsTPC vectors using a histogram of 7 bins.

5.3 Results

In the first experiment, we explored whether the TPC response preserves

bounded invariance in the observation of natural scenes, i.e. if the rep-

resentation is tolerant to certain degrees of visual deformations without

loosing specificity. More specifically, we investigate whether the correla-

tion between the TPC histograms of visual observations in two positions

can be associated with their distance. We calculate the mean pairwise

correlation among TPC histograms of the 25 sampling positions used for

image acquisition in the environment. We sort the correlation values into

5.3. results 99

distance intervals of 0.6 m, the same distance intervals used for sampling

the environment.

Our results show that the TPC transformation can successfully account

for position information in a real world scenario. The mean correlation

between two positions in the environment decays monotonically with the

increase of their distance (Fig. 5.4). As expected, the error in the corre-

lation measure increases with the distance.

0 0.6 1.2 1.8 2.4 3

0

0.2

0.4

0.6

0.8

1

Meters

Mean C

orr

ela

tion

Figure 5.4: Pairwise correlation of the TPC histograms in the environment.We calculate the correlation among all the possible combinations of two positionsin the environment. We average the correlation values into distance intervals,according to the sampling distances used in the image acquisition. This resultsuggests that we can produce a linear curve of distance based on the correlationvalues calculated using the TPC histograms.

In the second experiment, we use the TPC histograms as input for the

hippocampus model to acquire place cells. Neural representations of po-

sition, as observed in place cells, present a quasi-Gaussian transition be-

tween inactive and active locations, which makes bounded invariance a

key property for a smooth reconstruction of the position from visual input

as observed in nature. Previous work have shown that the conjunction of

TPC responses of synthetic figures in a virtual environment do preserve

bounded invariance and, with the use of reinforcement learning, place cells

could be emerge from these responses (Wyss and Verschure, 2004a).

We perform the simulations using 100 cells with random weights. The

100a framework for mobile robot navigation using a temporal

population code

results show that from 100 cells used, 12 have significant place related

activity (Fig. 5.5). These cells show high activity in specific areas of the

environment, also called place fields. Thus the TPC representation of the

natural scenes can successfully lead to the formation of place fields.

Cell #3

3m

3m

Cell #7 Cell #10 Cell #14 Cell #42 Cell #66

Cell #70 Cell #79 Cell #88 Cell #91 Cell #97Cell #74

Figure 5.5: Place cells acquired by the combination of TPC based visualrepresentations and the E%-max WTA model of the hippocampus.

5.4 Discussion

We addressed the question whether we could combine the properties of

TPC with an attention system and a hippocampus model to reliably pro-

vide position information in a navigation scenario. We have shown that

in our cortical model, the TPC, the representation of a static visual stim-

ulus is reliably generated based on the temporal dynamics of neuronal

populations. The TPC is a relevant hypothesis on the encoding of sensory

events given its consistency with cortical anatomy and its consistency with

contemporary physiology.

The information carried in this, so called, Temporal Population Code, is

efficiently used to pass a complete and dense amount of information. The

encoding of visual information is done through a sub-set of regions of high

saliency that pops up from the visual field. Combining the TPC encoding

5.4. discussion 101

strategy with a hippocampus model, we have shown that this approach

provides for robust representation of position in a natural environment.

Panoramic images have been proposed in other studies of robot navigation

(Zeil et al., 2003). Similarly, the TPC strategy proposed here also makes

use of pixel differences between panoramic images to provide a reliable

cue to location in real-world scenes. In both cases differences in the direc-

tion of illumination and environmental motion often can seriously degrade

the position representation. However, in comparison, the representation

provided by TPC is extremely compact. A subregion of 41x41 pixels is

transformed in a TPC vector of 64 points. Finally, the visual field, an

image of 96.000 pixels, is represented by a histogram of 7 bins.

In the specific benchmark evaluated here, the model showed that distance

information could be reliably recovered from the TPC. Importantly, we

showed that the model can produce reliable position information based on

the mean correlation value of TPC based histograms. Furthermore, this

method does not use any particular mechanism specific to the stimuli used

in the experiments. It is totally generic. Also, the processing time needed

for classifying an input of 41x41 pixels was about 50 ms using an offline

Matlab implementation. Therefore, we conclude that our model could be

efficiently used in a real-time task. Hence, our model shows that the brain

might use the smooth degradation of natural images represented by TPCs

as an estimate of distance.

Acknowledgment

This work was supported by EU FP7 projects GOAL-LEADERS (FP7-

ICT-97732) and EFAA (FP7-ICT-270490).

Chapter 6

Conclusion

In this dissertation, we have investigated the computational principles that

the early visual system employs in order to provide an invariant representa-

tion of the visual world. Our starting point was the Temporal Population

Code (TPC) model proposed by Wyss et al. (2003b). We first investigated

a central open issue: how the intrinsic elements that form the temporal

representation provided by the TPC could be accessed and readout by

further cognitive areas involved in the visual processing. Indeed, since the

introduction of the TPC, a readout stage that would be neurobiologically

compatible had been a persistent problem. In this thesis, we solved this

issue proposing a novel and biologically consistent readout system based

on the Haar wavelet transform. We showed that the efficient representa-

tion provided by the TPC signal can be decoded by further areas through

a sub-set of wavelet coefficients.

The implicit concept behind the proposed readout is that the brain uses

different frequency bands to convey multiple types of informations multi-

plexed into a temporal signal (Buzsaki, 2006). Indeed, recent results have

shown that distinct network activity observed in primary visual cortex in-

dividually signals different properties of the input stimulus such as motion

direction and speed, orientation of object contours at same time (Onat

103

104 conclusion

et al., 2011). Therefore, the decoder must be able to select the suitable

frequency range in order to access the information needed for each situa-

tion. We investigate this idea using a systematic spectral investigation of

the TPC signal using a simple, yet powerful, Haar basis that seems to be

a feasible choice providing robust and reliable decoding.

In comparison with the original TPC readout, the wavelet circuit proposed

here showed the same performance in the object recognition experiments.

Although, for bursting neurons the wavelet readout is slower in compar-

ison with the original TPC, the wavelet readout could extract the key

geometric properties of synthetic shapes without compromising the speed

of encoding.

The fine details of the mechanism such as its integration time constant and

its wavelet resolution level depend on the network dynamics. How higher

areas in the visual pathways dynamically control the readout frequency

range remains a subject for further investigations. This interaction be-

tween bottom-up sensory information and top-down modulation could be

related to an attentional mechanism that would synchronize the inhibitory

neurons in the neuronal wavelet circuit similar to the interneuron gamma

mechanism (Tiesinga and Sejnowski, 2009b).

These findings make a strong point in terms of how the neo-cortex deals

with sensory information. Our results show that stimulus can be repre-

sented and further classified using a strongly compressed temporal signal

based on wavelet coefficients. This strategy allows multiplexing high-level

information from the visual input and flexible storage and retrieval of

information by a memory system.

In the next part of this dissertation, we explored the TPC as candidate for

a canonical computational mechanism used by the neo-cortex to provide

invariant representation of visual information in different scenarios. In

chapter 3, we have specifically addressed this question using human faces

as a test case. Indeed, we investigated whether the TPC model could be

used to provide a real-time face recognition system to the iCub robot. In

conclusion 105

the specific benchmark evaluated here using human faces, our results con-

firmed that the representation of a static visual stimulus can be generated

based on the temporal activity of neuronal populations.

One of the main issues solved was the reduction of the computational cost

due to the recurrent connections in the large scale spiking neural network.

We optimized the TPC architecture with respect to the lateral connectivity

among the modeled neurons using 2D convolutions performed in the fre-

quency domain. Also, instead of using two different filters to simulate the

LGN and V1 receptive field’s characteristics, we redesign the model merg-

ing these two stages into a single step based on derivatives of gaussians.

With this strategy we could achieve optimal edge enhancement selective

to orientation in a single filtering operation which improves considerably

the performance. Although the scenario was built around human faces,

the TPC network used is totally generic. The model did not rely on any

pre-built dictionary of features related to any specific stimulus set.

The developed TPC library was integrated into the brain-based cognitive

architecture DAC (Verschure et al., 2003). The tightly coupled layered

structure of DAC consists on reactive, adaptive and contextual memory

mappings. At each level, more complex and memory dependent mappings

from sensory states are built depending on the internal state of the agent.

Here the TPC plays a major role in building a compact representation

of sensory and form primitives. In this sense, the TPC/DAC integration

can serve as an unified perception system for the humanoid robot iCub.

Indeed, we further explored the capabilities and generality of this archi-

tecture in a further scenario to recognize gestures. The experiments were

motivated by the game Rock-Paper-Scissor, where a human plays with

the robot through a set of three hand gestures that symbolizes these these

three elements. The results showed that the robot can reliably recognize

gestures under realistic conditions with a precision of 92%. The only as-

sumption at the moment is that the hands are showed against a black

background. This way we can use the color information to indicate the

time of stimulus onset. Also the wavelet readout circuit has

106 conclusion

This promising work has not been published yet, as we are now imple-

ment the motor behavior of the robot associated with the game. However,

preliminary tests indicate that TPC is a highly feasible solution both in

terms of performance and processing time that allows the iCub to play

Rock-Paper-Scissors with humans. This scenario will provide a rich plat-

form to study the interaction between humans and humanoid robots. The

challenge of winning may have an important impact in the engagement

of humans that play the game against the robot. Thus a reliable gesture

recognition system is crucial element in the assessment of interaction and

engagement.

Indeed, given the speed of encoding, the computational cost, the classifi-

cation performance and consistency with cortical anatomy and physiology

TPC has proved to be a relevant hypothesis on the encoding of sensory

events. Moreover, it reinforces very well the main aspect of the question

we want to address: the generation of a robust representation independent

of exact detailed properties of the input stimulus.

Subsequently we moved on to propose the use of the generality of TPC in

a completely different scenario. If TPC is reliable way of encoding visual

sensory input in a generic way, the representation should also account

for bounded invariances and therefore for a smooth reconstruction of the

position from visual input as observed in nature.

In chapter 5, we proposed a framework that combines the TPC, with an

attention and a hippocampus model to provide position estimation for

navigation tasks. The framework proposed was built to be used in mobile

robots relying purely on single camera input.

In a previous study, the TPC model was applied to a simulated robot in

a virtual arena accounting for the formation of place fields (Wyss et al.,

2006). Our results here extended these achievements to a real-world situ-

ation, where images of a realistic environment were used as input to feed

a fairly realistic hippocampus model (Renno-Costa et al., 2010). The im-

ages captured by the camera were processed to emulate the sensory input

conclusion 107

of a rat. At at each position in the arena, 360 degree panoramic pictures

where used Hebb (1932). In a bottom-up fashion, the attention system

exchanged information with the visual system receiving simple visual fea-

tures and providing salience maps of the panoramic images (Itti et al.,

1998). The saliency maps provided a plot of sub regions rich in features

such as colors, intensity and orientations that could be compactly encoded

in the visual system by the TPC. The subregions of interest determined

by the attention system, translated into temporal representations, fed the

hippocampal input stage using the TPC signal leading to the final position

representation translated on the activity of place cells (de Almeida et al.,

2009a).

This strategy, based on distal cues using panoramic images have been

proposed in other studies of robot navigation (Zeil et al., 2003). However,

the compression versus performance obtained with the proposed frame-

work is remarkable. The visual field, an image of 96.000 pixels, using

the proposed framework could be represented by a histogram of 7 bins.

With such compact representation we could achieve a monotonically de-

scendent correlation curve among distances in different parts of the arena.

The smooth reconstruction of distance relationships indeed fulfilled the

needs of the hippocampus input in order to generate place fields. This

framework suggests that the brain might use the smooth degradation of

natural images represented by TPCs to estimate distances and generate

representation of position using place cells.

In summary, we proposed that a spatial to temporal transformation based

on a Temporal Population Code could account for the generation of a ro-

bust, compact and non-redundant invariant representation of visual sen-

sory input. We have extensively showed the capabilities of the proposed

strategy over different scenarios, in real-world tasks and in different plat-

forms raging from humanoid and mobile robots to pure computational

simulations using artificial objects.

The results present in this doctor thesis not only strongly support the idea

108 conclusion

that time is indispensable in the construction of invariant representation of

sensory inputs but also suggest a clear hypothesis of how time can be used

in parallel with spatial information. In recent neuroscience studies, there is

increasingly interest in the field of temporal population coding and of how

time can be included in models of sensory encoding as suggested by other

authors (see (Buonomano and Maass, 2009) for a comprehensive review).

In this sense, we are confident that this thesis will strongly contribute to

this discussion adding novel methods and results to the fields of vision

research, computational neuroscience and robotics.

Appendix A

Distributed Adaptive Control(DAC) Integrated into a Novel

Architecture for ControlledHuman-Robot Interaction

Robots will increasingly become more integrated in our society. This will

require high levels of social competence. The understanding of the core as-

pects required for social interaction is therefore a crucial topic in the field

of robotics. In this article we propose an architecture and an experimental

paradigm which allows a humanoid robot to engage humans in well defined

and quantified social interactions. The BASSIS architecture is a 4 layers

structure of increasingly more complex and memory dependent mappings

from sensory states to action. At its core, BASSIS is allowing the iCub

robot to sustain dyadic interactions with humans using the tangible table

Reactable which is a novel musical instrument. In this setup, the iCub

receives from the Reactable a complete set of data about the objects that

are on the table. This combination avoids issues regarding object catego-

rization and provides a rich environment were humans can play music or

games with the robot. We benchmark the system performance in a col-

109

110distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

laborative object manipulation and planning task using spoken language.

The results show that the architecture proposed here provides reliable and

complex interaction with controlled and precise action execution. We show

that the iCub can successfully interact with humans grasping and placing

objects on the table with a precision of 2 cm. The software implementation

of the architecture is available online.

A.1 Introduction

We are at the start of a period of radical social change where robots will

enter more and more into our daily lives (Bekey and Yuh, 2007; Bar-

Cohen and Hanson, 2009). The interactions between human and robots

will become more omnipresent and will take more roles in our houses,

offices, hospitals or schools requiring the creation of machines socially

compatible with humans (Thrun, 2004; Dautenhahn, 2007).

Some studies suggest that humans prefer humanoid robots to other ar-

tificial systems in social interactions (Breazeal et al., 2006). Nowadays,

this kind of machine is being developed by many research groups around

the world and by a number of large corporations and smaller specialist

robotics companies (Bekey and Yuh, 2007). In Europe, the FP6 Robot-

Cub project made a major stride forward in providing a means for in-

tegrative humanoid research via the iCub platform (Metta et al., 2008).

The iCub robot has rich motor and perceptual features provide for all core

capabilities to efficiently interact with humans.

In the development of experimental setups for investigating social inter-

action with humanoid robots, we still face many unsolved technical issues

in the treatment of sensory inputs that greatly interfere in the develop-

ment of the the processes of motor control. The categorization of objects

is a good example of a hard task which is still not completely solved by

current technologies (Dickinson et al., 2009; Pinto et al., 2011). Often the

interaction scenario must be simplified to an unrealistic level in order to

a.1. introduction 111

be technically feasible. Therefore, to better understand and realize the

key elements of the interaction between humans and robots, we need to

overcome these technical obstacles.

In this paper, we propose a setup in which humans can interact with a

robot in a very controlled environment. We, together with this environ-

ment, present the technical and methodological aspects of the Biomimetic

Architecture for Situated Social Intelligence Systems, called BASSIS. This

architecture is based on the DAC architecture (Verschure and Althaus,

2003) and incorporates and implements a set of capabilities which allows

the android to engage humans in social interactions, such as games and

cognitive tasks.

BASSIS allows the humanoid robot iCub to interact with humans using an

electronic music instrument with a tangible user interface called Reactable

Reactable1 in a controlled environment. The Reactable, conceptualized as

a musical instrument (Geiger et al., 2010), provides a complete platform

where humans and robots can interact, playing music or games, manipu-

lating physical objects that are on the table. The different properties of

the objects are displayed on the table-top in a visually alluring graphical

interface (Fig. A.1).

In the proposed setup, the iCub is connected to the Reactable and receives

precise information about the different properties of objects placed on the

table directly from the Reactable software. In this way, issues regarding

object properties such as identity and position are flawlessly avoided and

the experiments can be totally focused on the interaction itself. Another

important component of BASSIS is the spoken language module that al-

lows the human and the robot to interact using natural language.

All the parameters describing the interaction can be stored in log files

to be further analyzed unveiling a novel architecture for benchmarking

experiments in human-robot interactions.

1The Reactable is a trademark of the Reactable Systems 2009-2012.http://www.reactable.com/

112distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

Here we provide a complete description of all the components of the pro-

posed setup. We present in detail all the modules that compose the pro-

posed architecture. Notably, we made all the software publicly available2.

This allows the whole robotics community to benefit from the proposed

architecture and to test on their ow and benchmark our results in their on

system.

In the results section, we present a detailed analysis of the overall system

performance. We benchmark the performance of the robot in a collabora-

tive object manipulation task. We provide statistics about the precision

in grasping and position representation provided by the BASSIS architec-

ture. We also benchmark the setup over 4 different types of interactions

between humans and the iCub. The interactions are performed using the

spoken language system to achieve cooperative actions. Finally, in the

last section, the highlights and drawbacks of the proposed system are dis-

cussed.

A.2 Methods

BASSIS architecture overview

BASSIS is an extended version of the Distributed Adaptive Control (DAC)

architecture (Verschure and Althaus, 2003; Duff and Verschure, 2010).

DAC consists of three tightly coupled layers: reactive, adaptive and con-

textual. At each level of organization increasingly more complex and

memory dependent mappings from sensory states to actions are generated

dependent on the internal state of the agent. The BASSIS architecture

expands DAC with many new components. It incorporates the repertoire

of capabilities available for the human to interact with the humanoid robot

iCub. BASSIS is divided in 4 hierarchical layers: Contextual, Adaptive,

Reactive, Soma. The different modules responsible for implementing the

2Available at: http://efaa.sourceforge.net/

a.2. methods 113

Figure A.1: The iCub ready to play with the objects that are on the Reactable.The table provides an appealing atmosphere for social interactions between hu-mans and the iCub.

relevant competences of the humanoid are distributed over these layers

(Fig. A.2).

The Soma and the Reactive layers are the bottom layers of the BASSIS

diagram. They are in charge of coding the robot motor primitives through

the attentionSelector and pmpActions package. It also acquires useful

sensory information from the tactile sensors mounted on the iCub hands

through the tactileInterface module.

In the Adaptive layer, a real-time affine transformation is performed in

order to map the coordinates received from the Reactable into the robot’s

frame of reference. iCub needs to be given the Cartesian coordinates

within its own frame to execute any motor action on an object. This is

accomplished by the ObjectLocationTransformer, of the Adaptive layer.

114distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

At the top of the BASSIS diagram lays the Contextual layer which is

composed of two main entities: the Supervisor and the Objects Properties

Collector (OPC). The Supervisor module essentially manages the spoken

language system in the interaction with the human partner so as the syn-

thesis of the robot’s voice. The OPC reflects the current knowledge the

system is able to acquire from the environment, collecting data that ranges

from objects coordinates in multi-reference frames (both in the ReacTable

and robot domains). Furthermore, a central role is played by the OPC

module which represents the database where a large variety of informa-

tion is stored since the beginning of the experiments. And finally, the

reactable2OPC modules handles the required communication interface to

exchange the information provided by the Reactable with the OPC.

In the following sections, we elaborate on the description of each compo-

nent of the BASSIS architecture.

The iCub

The iCub is a humanoid robot one meter tall, with dimensions similar to

a 3.5 years old child. It has 53 actuated degrees of freedom distributed

over the hands, arms, head and legs (Metta et al., 2008).

The sensory inputs are provided by artificial skin, cameras, microphones.

The robot is equipped with novel artificial skin covering the hands (finger-

tips and the palm) and the forearms (Cannata et al., 2008). The stereo

vision is provided by stereo cameras in a swivel mounting. The stereo

sound capturing is done using two microphones. Both cameras and mi-

crophones are located where eyes and ears would be placed in a human.

The facial expressions, mouth and eyebrows, are projected from behind

the face panel using lines of red LEDs. It also has the sense of propri-

oception (body configuration) and movement (using accelerometers and

gyroscopes).

The different software modules in the architecture are interconnected us-

a.2. methods 115

OPC Supervisor

AttentionSelector

ObjectLocationTransformer

pmpActions tactileInterface

cartesianControl gazeControl

Contextual

Adaptive

Reactive

Soma

reac

tabl

e2O

PC

Figure A.2: BASSIS is a multi-scale biomimetic architecture organized at threedifferent levels of control reactive, adaptive and contextual. It is based on thewell established DAC architecture. See text for further details.

ing YARP (Metta et al., 2006). This framework supports distributed

computation focusing in the robot control and efficiency. The communi-

cation between two modules using YARP happens between objects called

”ports”. The ports can exchange data over different network protocols

such as TCP and UDP.

Currently, the iCub is capable of reliably co-ordinating reach and grasp

motions, facial expressions to express emotions, force control exploiting its

force/torque sensors and gaze at points in the visual field. The interaction

can be performed using spoken language. With all these features, the

iCub is an outstanding platform for research in social interaction between

116distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

humans and humanoid robots.

Reactable

Conceptualized as a new musical instrument, the Reactable is a tabletop

tangible interface (Geiger et al., 2010). With the Reactable the user can

manipulate the digital information interacting with real-world objects. It

gives a physical shape to the digital information. The instrument is com-

posed by a round table, with a translucent top where the objects and

fingertips (cursors) are used to control its parameters. It has a back pro-

jector to display the properties of the objects through the translucent top

and a camera used to track the different objects (Fig. A.3).

Object recognition is performed using the reacTIVision tracking system

(Kaltenbrunner and Bencina, 2007). The reacTIVision can track a large

number of objects in real-time using fiducial markers (Bencina and Kaltenbrun-

ner, 2005). The objects receive an identification number (ID) according to

its fiducial marker. The software provides precise information about the

(x, y) position of the objects, rotation angle, speed and acceleration.

The tracking system uses the TUIO protocol for exchanging data (Kaltenbrun-

ner et al., 2005). This protocol has been specifically designed to meet the

requirements of a general versatile communication interface for tabletop

tangible interfaces. The information from the reacTIVision software can

be read by any software using TUIO.

Object Properties Collector (OPC)

The OPC module collects and store the data produce during the interac-

tion. The module consists of a database where a large variety of infor-

mation is stored starting at the beginning of the experiment and as the

interaction with humans evolves over time. In this setup, the data pro-

vided by the Reactable feed the memory of the robot represented by the

OPC.

a.2. methods 117

Camera

Reactivision

Projec

tor

TUIO clientApplication

TUIO Protocol

TaggedObjects

TranslucentTabletop

FingertipPointing

(x,y,ø,..,s)

Figure A.3: Reactable system overview. See text for further explanation.

OPC reflects what the system knows about the environment and what

is available. The data collected range from object coordinates in multi-

reference frames (both in the Reactable and robot domains) to the human’s

emotional states or position in the interaction process.

Entities in OPC are addressed with unique identifiers and managed dy-

namically, meaning that they can be added, modified, removed at run-time

safely and easily from multiple sources located anywhere in the network,

practically without limitations. Entities consist of all the possible elements

that enter or leave the scene: objects, table cursors, matrices used for geo-

metric projection, as well as the robot and the human themselves. All the

properties related to the entities can be manipulated and used to populate

the centralized database. A property in the OPC vocabulary is identified

by a tag specified with a string (a sequence of characters). The values that

118distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

Tag Description Type Examples

”entity” type of entity string ”object”, ”table”

”name” object name string ”drums”, ”guitar”

onTable presence/absence integer 0, 1

rt position x x-table double [xmin, xmax] m

rt position y y-table double [ymin, ymax] m

robot position x x-robot double [xmin, xmax] m

robot position y y-robot double [ymin, ymax] m

rt id x unique ID table integer ID 0

Table A.1: Excerpt of available OPC properties

can be assigned to a tag can be either single strings or numbers, or even

a nested list of strings and numbers. The table A.2 shows some examples

of the tags used.

Reactable connection with OPC

The reactable2OPC module provides the required communication interface

between the Reactable and the OPC. The implementation is done using a

TUIO client module that receives the data from the reacTIVision software

and re-sends it over a YARP port.

The OPC is populated with the properties of tagged objects that are on the

table. Such properties include: object ID, the (x, y) position of each object

on the table within the Reactable coordinate frame, angle of rotation and

speed. When an object is removed from the table the module keeps a

memory of its ID. The system reassigns the same ID to the same object

when it reappears.

Coordinate frame transformation based on homography

In order for the iCub to explore the world and interact with, it not only

needs to use its sensors but also coordinate perception and action across

a.2. methods 119

several frames of reference. The iCub has its own coordinate frame, there-

fore it is required to execute a transformation process between the table

and the robot centered coordinate frame and vice-versa in order to match

objects to actions. In the iCub’s reference frame, the x axis points back-

ward. The y axis points laterally and is chosen according to the right hand

rule. And the z axis is parallel to gravity but pointing upwards.

With BASSIS, the objectLocationTransformer solves this mathematical

problem using a coordinate frame transformation based on homography.

This technique is commonly used in computer vision to map two coor-

dinate frames and camera calibration (Zhang, 2000). The kernel of the

transformation is given by a 3x3 matrix H that rotates and translates the

points B received from the table into the the points A in the frame of the

robot. The transformation matrix can be formalized as: A = H×B.

To obtain the positions A and B in order to compute H, we perform an

a priori calibration session. In this phase, the iCub points at 3 random

positions on the table using its fingertips using the tactileInterface. Cur-

rently, the center of the palm is the known position where the iCub is

pointing. We use the length of iCub fingers and joint angles to calcu-

late the fingertip position, by a computation of the forward kinematics

(Siciliano et al., 2011), and therefore estimate A. The points B are the

correspondent coordinates in the Reactable frame obtained directly from

the OPC.

Once the system is calibrated, the transformation between the two do-

mains can be performed in real-time with a simple multiplication by H

which is stored in the OPC as well.

The Passive Motion Paradigm (PMP) and the CartesianInterface

The PmpActionsModule provides motor control mechanisms that allow the

robot to perform complex manipulation tasks. It relies on a revised version

120distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

of the so-called Passive Motion Paradigm (PMP). PMP is a biomimetic

synergy formation model (Mohan et al., 2009) conceived for solving redun-

dancy in human movements. The neurobiological background is related

to the Equilibrium Point Hypothesis, developed in the 60s by Feldman

and then by Bizzi. In the model (Mussa Ivaldi et al., 1988) goals and

constraints are expressed as force fields simultaneously in different mo-

tor spaces. In this way the model can carry out kinematic inversion by

means of a differential approach that relies on Jacobians but never com-

putes their inverse explicitly. In the current implementation the model

was modified by breaking it in two modules: 1) a module where virtual

force fields are used in the workspace in order to perform reaching tasks

while avoiding obstacles as conceived by Khatib in the 80s (Khatib, 1985);

2) a module that carries out kinematic inversion via a modified pseudo-

inverse/transposed Jacobian. Within the PMP framework it is possible to

describe objects of the perceived world either as obstacles or as targets,

and to consequently generate proper repulsive or attractive force fields,

respectively. The PMP module can also include internal constraints, for

example related to bimanual coordination.

According to the composition of all the active fields, the robots end-effector

is eventually driven towards the selected target while bypassing the identi-

fied obstacles. The behavior and performance strictly depend on the mu-

tual relationship among the tunable field’s parameters. This framework

allows defining a trajectory in Cartesian Space, which needs to be properly

converted in joint space through a robust inverse kinematics solver.

Specifically, we resort to the Cartesian Controller (Pattacini et al., 2010)

to solve the inverse kinematics problem while assuring to fulfill addi-

tional constraints such as joint limits. The non-linear optimization al-

gorithm used, IPOPT, allows solving the inverse kinematics problem ro-

bustly (Wachter and Biegler, 2005). The Cartesian Controller also pro-

vides a reliable controller that extends the Multi-Referential Dynamical

Systems approach (Hersch and Billard, 2008) synchronizing two dynam-

ical controllers, one in joint space and the other one in task space. It

a.2. methods 121

assures the robot to actually follow the trajectory.

In order to reproduce human-like manipulation tasks, we built a library,

namely the pmpActions, modeling complex actions as a combination of

simple action primitives that defined high level actions such as such as

Reach, Grasp and Release; in fact if we assume that, when an object is

grasped, it becomes an appendix of the robots body structure, complex

actions can be performed as a sequence of simple primitives (Fig.A.4).

Request foradding/removing

objects

Grasp andrelease

Field andtrajectory

generation

Inversekinematics

PMPActions

PMPClient

PMPServer

CartesianInterface

Target/Obstacleposition and size

3D point positionand pose

PM

P

Figure A.4: PMP architecture: at lower level the pmpServer module computesthe trajectory, targets and obstacles in the space feeding the Cartesian Interfacewith respect to the trajectory that has to be followed. The user relies on apmpClient in order to ask the server to move, add and remove objects, andeventually to start new reaching tasks. At the highest level the pmpActionslibrary provides a set of complex actions, such as ”Push” or ”Move the object’,simply implemented as combinations of primitives.

122distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

Spoken Language Interaction and Supervisor

The spoken language interaction is managed by the Supervisor. It is im-

plemented in the CSLU Toolkit (Sutton et al., 1998) Rapid Application

Development (RAD) under the form of a finite-state dialogue system where

the speech synthesis is done by Festival and the recognition by Sphinx-II.

Indeed, these state modules provide functions including conditional transi-

tion to new states based on the words and sentences recognized, and thus

conditional execution of code based on current state, and spoken input

from the user. The recognition and the extraction of the useful informa-

tion from the recognized sentence is based on a simple grammar that we

have developed. For instance, for the action order, a sentence is following

this structure:

1. $action1 = grasp | point ;

2. $action2 = put | drop ;

3. $object = drums | synthesizer | guitar | box ;

4. $where rel = my | your ;

5. $location = here | $where rel left | middle | $where rel right;

6. $actionOrder = $action1 [*sil%%|the%%] $object | $action2 [the%%]

$object [on%% the%%] $location;

$action1, $action2, $object, $where rel and $location are variables which

could take several words. The final recognized sentence will have the

grammar of $actionOrder. The words between square bracket are optional,

and the ones followed by a %% sign are recognized but not stored.

The Supervisor allows the human to interact with the robot, and requires

access and control over several modules and thus, is in charge of the YARP

connections and establishes connections with the following components:

a.2. methods 123

1. attentionSelector, to be able to control where the iCub is looking,

providing a non-verbal way of communication by gazing at the tar-

gets while it is speaking (e.g. looking at the drums where he says

the name ”drums”),

2. objectsPropertiesCollector, to obtain all the knowledge of the iCub,

what it knows about the world in order to answer the human (e.g.

to know if the synthesizer is on the table),

3. pmpActions, to move its hands, executing order from the human

where it has to manipulate some objects (e.g. grasping the guitar

after the proper command spoken by the human),

4. tactileInterface, to realize the calibration process between the Re-

acTable and the iCub, where the human has made a contribution

(e.g. when the iCub ask him to touch the table).

Visualizations in real-time

Displaying the state of the world and of the iCub is important for debug-

ging purposes since it provides a useful tool to visualize in real-time the

elements available in the memory of the robot. All objects detected by

the Reactable or stored as static locations are loaded from the OPC and

displayed in a graphical user interface, the iCubGUI. The robot is also

rendered in real-time, which gives the user an accurate visual represen-

tation of the whole scenario in the coordinate frame of the robot. The

objLocationTransformer module is responsible to send the data directly

to the iCubGUI. This classification allows us to customize the display for

different entity types.

Experimental scenario

We performed 2 different experiments in order to benchmark the archi-

tecture’s capabilities to maintain social interaction. We first defined 3

124distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

fixed positions on the table: left, middle and right according (Fig. A.5).

They are used as spatial reference for placing the objects the robot has to

interact with.

In the initial experiments, we evaluate the precision in grasping tasks

performed by the iCub. The robot had to perform the following tasks:

grasp an object, lift it from the table and put it back to a desired location.

We use an object of dimensions 4x6x7 cm (Fig. A.1).

In the first experiment, we divided the tasks in two: namely the grasping

and the place tasks. In the first, the iCub grasps an object placed on

one of the three fixed positions and then release it to the exact same

point where it picked it up, assessing therefore the robot’s capability of

object manipulation as well as precision. For each of these three positions,

the robot had to perform 4 grasps. The robot executed a total of 36

grasping action, equally divided between right, middle and left positions.

We removed 4 samples relative to non-successful grasping, 3 at the middle

and 1 at the left position. At these positions the robot slightly hit the

object after placing it on the table. As the data collection is performed

after the arm is back to the rest position, the values recorded do not

reflect the precision of the grasping.To measure the precision, we define

the grasping error (GE) as the difference between the target position and

the final position where the object is placed.

To further test the robot’s ability to move an object from one point to

another, we designed the place task. In this task, the robot had to grasp

and move an object from one spatial point to another. We designed the

experiment to cover all possible combinations among right, middle and

left. The robot executed a total of 72 grasping action, 12 repetitions of

the 6 possible combinations of source-target (where source and target are

necessarily different). The place error (PE) is calculated in the same way

used for the grasping, i.e, the difference between target and final position.

The system was tested in a cluster with 4 machines using quad-cores Intel

Core i7 820QM processors (1.73 Ghz).

a.2. methods 125

In the last experiment, we test the iCub’s learning capabilities as well as

the spoken language interaction. For the scope of this experiment, we

have used shared plans. Also called cooperative plan, plans correspond to

interlaced actions which are scheduled by two cooperative agents (here a

human and the iCub) to succeed in achieving a common goal (Tomasello

et al., 2005).

The shared plan can be taught in real-time by the human. After announc-

ing to the iCub it wants to do a shared plan by telling it ”Shared Plan”,

the Supervisor enters the sub-dialogue node of the cooperative plan. The

user could describe the plan he wants to execute with the iCub and could

execute it if the iCub knows it : otherwise the robot will ask the human to

teach him. For the learning part, the user has to describe the sequence of

actions, one by one in order to let the iCub asking for confirmation each

time. When all the actions have been told, the user end the teaching by

saying ”finished” (Petit et al., 2012). Thus the new learnt shared plan is

saved in a file under the form :

• Shared plan name, e.g. ”Swap”

• Steps, e.g. ”{I put {{drums middle}}}{Y ou put {{guitar right}}} {I put {{drums left}}”

• Subjects, e.g. ”I Y ou”

• Arguments, e.g. ”drums guitar”

The subjects and arguments are written so as to be parsed and replaced

in the shared plan according to the sentence told by the user when he says

the shared plan. Thus, one could learn for instance the swap plan using

”I and You swap the drums and the guitar” but execute it with ”You

and I swap the drums and the guitar” : in this way, there will be a role

reversal and the robot will do the actions originally planned to be the ones

executed by the human, and inversely : the role reversal test is passed

126distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

(Dominey and Warneken, 2009). Indeed, this aspect is very important

because this capability is the signature of a shared plan (Tomasello et al.,

2005; Carpenter et al., 2005).

Moreover, the arguments could also be replaced, and with the same origi-

nal shared plan learnt, the iCub and the user could resolve the cooperative

plan ”You and I swap the synthesizer and the box”, showing the capacity

to generalize, in this example to generalize objects, in the cooperative plan

: it has learnt how to swap something with something, not just the drums

and the guitar(Lallee et al., 2010).

For the experiment, we designed 4 different shared plans. We repeated

all four shared plans learning and execution with three non-naive humans.

Here we present the 4 shared plans in both teaching and executing phases

exploring both role reversal and generalization of objects:

Teaching:

1. You and I swap drums guitar

• You put drums left and I put guitar right

2. You and I split drums guitar

• You put drums right and I put guitar left

3. You and I group guitar drums

• You put guitar middle and I put drums middle

4. You and I pass guitar left

• You put guitar middle and I put guitar left

Execution:

1. You and I swap drums guitar

a.3. results 127

2. I and You split drums guitar

3. You and I group synthesizer guitar

4. I and You pass guitar right

iCub

Reactable

L M R(x = -35 cm)

(y = 20 cm) (y = 0 cm) (y = -20 cm)

Figure A.5: A cartoon of the experimental scenario. The distances for the left(L), middle (M) and right (R) positions are giving according to the robot.

A.3 Results

In the first part of this section, we present an analysis of error in the

grasping tasks. We perform the analysis separately for the two axis x and

y. Initially, we perform a Shapiro-Wilk test to check for normality in GEx

and GEy. The results show a p-value = 0.68 for GEx and p-value = 0.85

for GEy, hence the data is normal distributed. We further perform a t-test

to evaluate whether the distribution have zero mean. The values are equal

to -1.20 cm and 1.57 for GEx and GEy respectively. It suggests that the

robot has a tendency of placing the objects more to the front and right of

the target.

128distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

Analyzing individually the 3 different conditions, we see that the mean in

the horizontal error GEx decreases from right to left (Fig. A.6). Meaning,

the horizontal error is higher at the side correspond to the arm used in

the task (right arm). In our interpretation, this reflects the geometrical

arrangement of the arm in relation to the torso when the target position is

in the opposite side of the arm used for grasping. In the left side, the error

in x is into its cosine component. Therefore, the same deviation caused

by possible imprecisions in the movement of the torso represents a smaller

error in the left side than in the middle or right side.

We confirm this observation by performing a multiple regression analysis

to investigate the impact of different positions over those deviations. The

only significant impact is in GEx when the action is performed at the

left side (with p-value = 2.55 × 107). No impact on source was observed

in GEy. Hence, we conclude that the iCub is significantly more precise

grasping objects that are on the left side. In the second part, we present

−2.5−2.0−1.5−1.0−0.5

0.0

0.5

1.0

1.5

2.0

2.5

left middle rightPosition on the Table

left middle rightPosition on the Table

Gra

spin

g E

rror

in x

[cm

]

Gra

spin

g E

rror

in y

[cm

]

Figure A.6: Box plots of GEx and GEy for the grasp experiments. The errorin placing an object on the table increases from left to right with respect to thex axis. No significant difference was observed for y.

an analysis of error in the place experiment (PE). The Shapiro-Wilk test

gives a p-value of 0.25 and 0.0015, respectively for PEx and PEy. In this

case, PEy is not normally distributed. For this reason, we use parametric

tests for PEx and non-parametric for PEy.

A linear regression is performed in order to see if the distance of the

a.3. results 129

trajectory, defined as one for middle to right/left and two from left to

right, has an impact on the error. The analysis gives a p-value of 0.80,

suggesting that the distance has no impact in the error (Fig. A.7).

left middle right

−10

12

3E

rror

in x

[cm

]

Target

left middle right

−10

12

3E

rror

in y

[cm

]

Target

left middle right

−1.0

0.0

1.0

Err

or in

x [c

m]

Source

left middle right

−1.0

0.0

1.0

Err

or in

y [c

m]

Source

Figure A.7: Box plots for the place experiment for the different conditionsanalyzed: source and target. Both conditions have an impact in the error.

We go further and perform another linear regression using the source and

the target as variables, followed by an ANOVA. The analysis gives an

impact for both parameters (p-value equals 0.020 and 0.001 for source

and target respectively). However, still very precise and the error is going

from -1 to 1 cm.

For the analysis of PEy we use Kruskal-Willis tests. The tests reveal no

significant impact from the distance (p-value = 0.099), but a significant

influence from the source (p-value = 4.8× 10−10) and the target (p-value

= 2.06×10−9). The precision in this axis is smaller than the one in x axis,

130distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

but still very acceptable. The errors are about -1 to 3 cm when moving a

4x6 cm-sided object in at least 20 cm (or 40).

For the second experiment, three subjects were asked to teach the robot 4

different shared plans and then execute them. In total, the robot learned

and executed 12 shared plans. To evaluate the learning capabilities of

the iCub as well as it’s performance in executing each plan, we measured

the time of completion for each shared plan. Learning time is measured

from the beginning of the human command to the indication of ”finished”,

which represents the completion of a learning sequence. Accordingly, ex-

ecution time is measured from the start of the human command to the

execution of the last action, followed by a next request by the iCub.

The iCub was able to learn a shared plan of two actions in approximately

half a minute and execute the learned plan in a minute (Fig. A.8). During

the learning phase, only two deficits in speech recognition occurred, whilst

during execution phase three deficits in speech recognition occurred. These

deficits are mainly attributed to the the Sphinx-II speech recognizer, and

where therefore not taken into account. During the execution of the 12

shared plans with N=18 distinct actions by the iCub, no physical errors

occurred. Finally, it is worth noticing that a shared plan execution takes

roughly the same amount of time with a role reversal, replacing arguments

or both role reversal and argument replacement, stressing the robustness

of the system.

A.4 Conclusions

In this paper, we proposed an interaction paradigm and architecture for

human robot interaction, called BASSIS, designed to provide a novel and

controlled environment for research in social interaction with humanoid

robots. The core part of BASSIS is the integration between the iCub

robot and the tangible interface Reactable. This combination provides a

rich context for the execution of cooperative plans between the iCub and

a.4. conclusions 131

Learning Execution20

30

40

50

60

Tim

e [s]

Figure A.8: Average time needed for the shared plan tasks. The learningbox, represents the time need for the human to create the task. The execution,represents the time needed by the interaction to perform the learned action.

the human with games or collaborative music generation. We presented a

detailed technical description of all the components that build the 4 layers

of BASSIS. Moreover, we stressed the system capacities over 2 different

kinds of experiments.

In our results we have assessed both the precision of the overall system and

the usability of it in the context of acquiring and executing shared plans.

We showed that the position of the objects provided by the combination

of the Reactable, objectLocationTransformer, pmpActions and cartesian-

Control is reliably mapped to the robot’s frame allowing precise motor

actions. In a series of grasping tasks, the setup showed a maximum error

of 2 cm between pre-defined targets and the final position the objects were

displaced.

In the experiments, we observed an impact of position of the objects in the

grasping precision. For simple grasping over the same position, i.e, taking

an object off table and putting it back on the same spot, the precision was

significantly better on the left side of the table (at the side opposite to the

used arm). Assuming that the robot has a perfect bilateral symmetric,

this difference should be compensated using both arms.

The processes run smooth without affecting the dynamics of the interac-

132distributed adaptive control (dac) integrated into a novel

architecture for controlled human-robot interaction

tion. Over 40 hours of tests and 360 executions of grasping, we had no

mechanical issues with any hardware or software.

In the next version of the architecture 2 new modules that are already in a

test phase will be integrated. The first, is based on the integration of the

HAMMER model (Sarabia et al., 2011) (Hierarchical Attentive Multiple

Models for Execution and Recognition of actions) with the PmpActions-

Module. With HAMMER the system will be able to understand human

movements by internally simulating the robots own actions and predicting

future human states. The other model will add a face recognition capabil-

ities to the robot. It is based on a canonical model of cortical computation

called Temporal Population Code (Luvizotto et al., 2011). In addition we

will add an multi-scale bottom-up and top-down attention system (Math-

ews et al., 2011). With these new features we expect to increase the level

of social connection between the human in the interaction with the robot.

Bibliography

Each reference indicates the pages where it appears.

Yoseph Bar-Cohen and David Hanson. The Coming Robot Revolution: Ex-

pectations and Fears About Emerging Intelligent, Humanlike Machines.

Springer, 2009. ISBN 0387853480. 110

Omri Barak, Misha Tsodyks, and Ranulfo Romo. Neuronal popula-

tion coding of parametric working memory. The Journal of neuro-

science, 30(28):9424–30, July 2010. ISSN 1529-2401. doi: 10.1523/

JNEUROSCI.1875-10.2010. URL http://www.jneurosci.org/cgi/

content/abstract/30/28/9424. 14, 23

Emmanuel J Barbeau, Margot J Taylor, Jean Regis, Patrick Marquis,

Patrick Chauvel, and Catherine Liegeois-Chauvel. Spatio temporal dy-

namics of face recognition. Cerebral cortex (New York, N.Y. : 1991), 18

(5):997–1009, May 2008. ISSN 1460-2199. doi: 10.1093/cercor/bhm140.

URL http://cercor.oxfordjournals.org/cgi/content/abstract/

18/5/997. 55

Francesco P Battaglia and Bruce L McNaughton. Polyrhythms of the

brain. Neuron, 72(1):6–8, October 2011. ISSN 1097-4199. doi: 10.1016/

j.neuron.2011.09.019. URL http://dx.doi.org/10.1016/j.neuron.

2011.09.019. 16

Pierre Bayerl and Heiko Neumann. Disambiguating visual motion through

contextual feedback modulation. Neural computation, 16(10):2041–66,

October 2004. ISSN 0899-7667. doi: 10.1162/0899766041732404. URL

133

134 bibliography

http://www.citeulike.org/user/toerst/article/4199040. 58, 72,

81

G. Bekey and J. Yuh. The Status of Robotics Report on the WTEC

international study: Part I. IEEE Robotics and Automation Magazine,

14(4):76–81, 2007. 110

Karim Benchenane, Paul H Tiesinga, and Francesco P Battaglia. Oscil-

lations in the prefrontal cortex: a gateway to memory and attention.

Current opinion in neurobiology, 21(3):475–85, June 2011. ISSN 1873-

6882. doi: 10.1016/j.conb.2011.01.004. URL http://dx.doi.org/10.

1016/j.conb.2011.01.004. 16

Ross Bencina and Martin Kaltenbrunner. The Design and Evolution of

Fiducials for the reacTIVision System. In 3rd international conference

on generative systems in the electronic arts, 2005. 116

Andrea Benucci, Robert A Frazor, and Matteo Carandini. Standing waves

and traveling waves distinguish two circuits in visual cortex. Neuron, 55

(1):103–117, 2007. ISSN 0896-6273 (Print). doi: 10.1016/j.neuron.2007.

06.017. 23

Andrea Benucci, Dario L Ringach, and Matteo Carandini. Coding of stim-

ulus sequences by population responses in visual cortex. Nature neuro-

science, 12(10):1317–24, October 2009. ISSN 1546-1726. doi: 10.1038/

nn.2398. URL http://www.pubmedcentral.nih.gov/articlerender.

fcgi?artid=2847499&tool=pmcentrez&rendertype=abstract. 14, 23

Cyrus P Billimoria, Benjamin J Kraus, Rajiv Narayan, Ross K Maddox,

and Kamal Sen. Invariance and sensitivity to intensity in neural discrim-

ination of natural sounds. The Journal of neuroscience, 28(25):6304–

6308, 2008. ISSN 1529-2401 (Electronic). doi: 10.1523/JNEUROSCI.

0961-08.2008. 15, 23

Tom Binzegger, Rodney J Douglas, and Kevan A C Martin. A quan-

titative map of the circuit of cat primary visual cortex. The Jour-

nal of neuroscience : the official journal of the Society for Neu-

roscience, 24(39):8441–53, September 2004. ISSN 1529-2401. doi:

10.1523/JNEUROSCI.1400-04.2004. 90

bibliography 135

M Block. A note on the refraction and image formation of the rat’s

eye. Vision Research, 9(6):705–711, June 1969. ISSN 00426989. doi:

10.1016/0042-6989(69)90127-8. URL http://dx.doi.org/10.1016/

0042-6989(69)90127-8. 97

Francisco Bonin-Font, Alberto Ortiz, and Gabriel Oliver. Visual Naviga-

tion for Mobile Robots: A Survey. J. Intell. Robotics Syst., 53(3):263–

296, November 2008. ISSN 0921-0296. doi: 10.1007/s10846-008-9235-4.

90

Jeffrey S. Bowers. On the biological plausibility of grandmother cells:

Implications for neural network theories in psychology and neuroscience.

Psychological Review, 116(1):220, 2009. 10

Cynthia Breazeal, Matt Berlin, Andrew Brooks, Jesse Gray, and Andrea L.

Thomaz. Using perspective taking to learn from ambiguous demonstra-

tions. Robotics and Autonomous Systems, 54(5):385–393, May 2006.

ISSN 09218890. doi: 10.1016/j.robot.2006.02.004. 110

Farran Briggs and Edward M. Callaway. Layer-Specific Input to Distinct

Cell Types in Layer 6 of Monkey Primary Visual Cortex. J. Neurosci.,

21(10):3600–3608, May 2001. URL http://www.jneurosci.org/cgi/

content/abstract/21/10/3600. 7

Dean V Buonomano and Wolfgang Maass. State-dependent computa-

tions: spatiotemporal processing in cortical networks. Nature reviews.

Neuroscience, 10(2):113–25, February 2009. ISSN 1471-0048. doi:

10.1038/nrn2558. URL http://dx.doi.org/10.1038/nrn2558. 108

G Buzsaki. Rhythms of the brain. Oxford University Press, New York,

New York, USA, 2006. ISBN 9780195301069. URL http://books.

google.es/books?id=ldz58irprjYC. 15, 51, 103

C. Cadoz. Le geste canal de communication homme/machine: la commu-

nication instrumentale. TSI. Technique et science informatiques, 13(1):

31–61, 1984. ISSN 0752-4072. URL http://cat.inist.fr/?aModele=

afficheN&cpsidt=3595521. 75

Edward M. Callaway. Local circuits in primary visual cortex of the

macaque monkey. Annual Review of Neuroscience, 21:47–74, November

136 bibliography

1998. URL http://www.annualreviews.org/doi/abs/10.1146/

annurev.neuro.21.1.47?prevSearch=Local+circuits+in+primary+

visual+cortex+of+the+macaque+monkey&searchHistoryKey=. 7

Edward M Callaway. Structure and function of parallel pathways in

the primate early visual system. The Journal of physiology, 566(Pt

1):13–9, July 2005. ISSN 0022-3751. doi: 10.1113/jphysiol.2005.

088047. URL http://www.pubmedcentral.nih.gov/articlerender.

fcgi?artid=1464718&tool=pmcentrez&rendertype=abstract. 5

Giorgio Cannata, Marco Maggiali, Giorgio Metta, and Giulio Sandini. An

embedded artificial skin for humanoid robots. In 2008 IEEE Interna-

tional Conference on Multisensor Fusion and Integration for Intelligent

Systems, pages 434–438. IEEE, August 2008. ISBN 978-1-4244-2143-5.

doi: 10.1109/MFI.2008.4648033. 63, 114

Mikael A Carlsson, Philipp Knusel, Paul F M J Verschure, and Bill S

Hansson. Spatio-temporal Ca2+ dynamics of moth olfactory projection

neurones. European Journal of Neuroscience, 22(3):647–657, 2005. ISSN

0953-816X (Print). doi: 10.1111/j.1460-9568.2005.04239.x. 15, 24

Malinda Carpenter, Michael Tomasello, and Tricia Striano. Role Reversal

Imitation and Language in Typically Developing Infants and Children

With Autism. Infancy, 8(3):253–278, November 2005. ISSN 15250008.

doi: 10.1207/s15327078in0803\ 4. 126

Jayaram Chandrashekar, Mark A Hoon, Nicholas J P Ryba, and Charles S

Zuker. The receptors and cells for mammalian taste. Nature, 444(7117):

288–94, November 2006. ISSN 1476-4687. doi: 10.1038/nature05401.

URL http://dx.doi.org/10.1038/nature05401. 27

Shi-Huang Chen and Jhing-Fa Wang. Extraction of pitch information in

noisy speech using wavelet transform with aliasing compensation. In

Acoustics, Speech, and Signal Processing, 2001. Proceedings., volume 1,

pages 89–92 vol.1, Salt Lake City, USA, 2001. doi: 10.1109/ICASSP.

2001.940774. 50

Taishih Chi, Powen Ru, and Shihab A Shamma. Multiresolution spec-

trotemporal analysis of complex sounds. The Journal of the Acoustical

bibliography 137

Society of America, 118(2):887–906, 2005. doi: 10.1121/1.1945807. URL

http://link.aip.org/link/?JAS/118/887/1. 25

S Chikkerur, T Serre, C Tan, and T Poggio. What and where: a Bayesian

inference theory of attention. Vision research, 50(22):2233–47, October

2010. ISSN 1878-5646. doi: 10.1016/j.visres.2010.05.013. URL http:

//www.ncbi.nlm.nih.gov/pubmed/20493206. 11, 54

Bevil R Conway, Soumya Chatterjee, Greg D Field, Gregory D Horwitz,

Elizabeth N Johnson, Kowa Koida, and Katherine Mancuso. Advances

in color science: from retina to behavior. The Journal of neuroscience

: the official journal of the Society for Neuroscience, 30(45):14955–63,

November 2010. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.4348-10.

2010. URL http://www.jneurosci.org/cgi/content/abstract/30/

45/14955. 3

A. Cowey and E.T. Rolls. Human cortical magnification factor and its

relation to visual acuity. Experimental Brain Research, 21(5), December

1974. ISSN 0014-4819. doi: 10.1007/BF00237163. URL http://www.

springerlink.com/content/q7v12n66365h123h/. 6

Pablo D’Angelo. Hugin, 2010. URL http://hugin.sourceforge.net/.

97

J Daugman. Two-dimensional spectral analysis of cortical receptive field

profiles. Vision Research, 20(10):847–856, 1980. ISSN 00426989. doi:

10.1016/0042-6989(80)90065-6. URL http://dx.doi.org/10.1016/

0042-6989(80)90065-6. 25

Kerstin Dautenhahn. Socially intelligent robots: dimensions of human-

robot interaction. Philosophical transactions of the Royal Society of

London. Series B, Biological sciences, 362(1480):679–704, April 2007.

ISSN 0962-8436. doi: 10.1098/rstb.2006.2004. 110

Licurgo de Almeida, Marco Idiart, and John E Lisman. The input-output

transformation of the hippocampal granule cells: from grid cells to place

fields. The Journal of neuroscience : the official journal of the Society

for Neuroscience, 29(23):7504–12, June 2009a. ISSN 1529-2401. doi: 10.

1523/JNEUROSCI.6048-08.2009. URL http://www.jneurosci.org/

138 bibliography

cgi/content/abstract/29/23/7504. 92, 96, 107

Licurgo de Almeida, Marco Idiart, and John E Lisman. A second function

of gamma frequency oscillations: an E%-max winner-take-all mechanism

selects which cells fire. The Journal of neuroscience : the official journal

of the Society for Neuroscience, 29(23):7497–503, June 2009b. ISSN

1529-2401. doi: 10.1523/JNEUROSCI.6044-08.2009. URL http://www.

jneurosci.org/cgi/content/abstract/29/23/7497. 96, 97

R L De Valois, D G Albrecht, and L G Thorell. Spatial frequency selec-

tivity of cells in macaque visual cortex. Vision research, 22(5):545–59,

January 1982. ISSN 0042-6989. URL http://www.ncbi.nlm.nih.gov/

pubmed/7112954. 57

Sven J. Dickinson, Ales Leonardis, Bernt Schiele, and Michael J. Tarr,

editors. Object Categorization. Cambridge University Press, Cambridge,

2009. ISBN 9780511635465. doi: 10.1017/CBO9780511635465. 110

G S Doetsch. Patterns in the brain. Neuronal population coding in the

somatosensory system. Physiology and Behavior, 69(1-2):187–201, 2000.

ISSN 0031-9384 (Print). 16, 24

Peter Ford Dominey and Felix Warneken. The basis of shared intentions

in human and robot cognition. New Ideas in Psychology, 29(3):260–274,

December 2009. ISSN 0732118X. doi: 10.1016/j.newideapsych.2009.07.

006. 126

Armin Duff and Paul F.M.J. Verschure. Unifying perceptual and be-

havioral learning with a correlative subspace learning rule. Neuro-

computing, 73(10-12):1818–1830, June 2010. ISSN 09252312. doi:

10.1016/j.neucom.2009.11.048. 61, 112

Armin Duff, Cesar Renno-Costa, Encarni Marcos, Andre Luvizotto, An-

drea Giovannucci, Marti Sanchez-Fibla, Ulysses Bernardet, and Paul

Verschure. From Motor Learning to Interaction Learning in Robots,

volume 264 of Studies in Computational Intelligence. Springer Berlin

Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-05180-7. doi:

10.1007/978-3-642-05181-4. URL http://www.springerlink.com/

content/v348576tk12u628h.

bibliography 139

Armin Duff, Marti Sanchez Fibla, and Paul F M J Verschure. A biologi-

cally based model for the integration of sensory-motor contingencies in

rules and plans: A prefrontal cortex based extension of the Distributed

Adaptive Control architecture. Brain research bulletin, 85(5):289–304,

June 2011. ISSN 1873-2747. doi: 10.1016/j.brainresbull.2010.11.008.

URL http://www.ncbi.nlm.nih.gov/pubmed/21138760. 27

Gaute T. Einevoll and Hans E. Plesser. Extended difference-of-Gaussians

model incorporating cortical feedback for relay cells in the lateral genic-

ulate nucleus of cat. Cognitive Neurodynamics, pages 1–18, Novem-

ber 2011. ISSN 1871-4080. doi: 10.1007/s11571-011-9183-8. URL

http://www.springerlink.com/content/g64572342p61vk40/. 6, 27,

94

M Fabre-Thorpe, G Richard, and S J Thorpe. Rapid categorization of

natural images by rhesus monkeys. Neuroreport, 9(2):303–8, January

1998. ISSN 0959-4965. URL http://www.ncbi.nlm.nih.gov/pubmed/

9507973. 17

Winrich A Freiwald, Doris Y Tsao, and Margaret S Livingstone. A face

feature space in the macaque temporal lobe. Nature neuroscience, 12

(9):1187–96, September 2009. ISSN 1546-1726. doi: 10.1038/nn.2363.

URL http://dx.doi.org/10.1038/nn.2363. 10, 55

M. Frigo and S.G. Johnson. The Design and Implementation of FFTW3.

Proceedings of the IEEE, 93(2):216–231, February 2005. ISSN 0018-

9219. doi: 10.1109/JPROC.2004.840301. URL http://ieeexplore.

ieee.org/xpl/freeabs_all.jsp?arnumber=1386650. 60

K Fukushima. Neocognitron: a self organizing neural network model for a

mechanism of pattern recognition unaffected by shift in position. Biolog-

ical cybernetics, 36(4):193–202, January 1980a. ISSN 0340-1200. URL

http://www.ncbi.nlm.nih.gov/pubmed/7370364. 2, 23

K Fukushima. Neocognitron for handwritten digit recognition. Neu-

rocomputing, 51(1):161–180, 2003. ISSN 09252312. doi: 10.1016/

S0925-2312(02)00614-8. URL http://linkinghub.elsevier.com/

retrieve/pii/S0925231202006148. 11, 54

140 bibliography

Kunihiko Fukushima. Neocognitron: A self-organizing neural network

model for a mechanism of pattern recognition unaffected by shift in

position. Biological Cybernetics, 36(4):193–202, April 1980b. ISSN 0340-

1200. doi: 10.1007/BF00344251. URL http://www.springerlink.

com/content/r6g5w3tt54528137/. 11, 54

G Geiger, N Alber, S Jorda, and M Alonso. The Reactable: A Col-

laborative Musical Instrument for Playing and Understanding Music.

Her&Mus. Heritage & Museography, 2(4):36 – 43, 2010. 111, 116

A S Georghiades, P N Belhumeur, and D J Kriegman. From Few to

Many: Illumination Cone Models for Face Recognition under Variable

Lighting and Pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23

(6):643–660, 2001. xxviii, 19, 66, 67

A. Georgopoulos, A. Schwartz, and R. Kettner. Neuronal population

coding of movement direction. Science, 233(4771):1416–1419, Septem-

ber 1986. ISSN 0036-8075. doi: 10.1126/science.3749885. URL

http://www.sciencemag.org/content/233/4771/1416.abstract. 12

CD Gilbert and TN Wiesel. Columnar specificity of intrinsic horizon-

tal and corticocortical connections in cat visual cortex. J. Neurosci.,

9(7):2432–2442, July 1989. URL http://www.jneurosci.org/cgi/

content/abstract/9/7/2432. 7, 8

Susan Goldin-Meadow. How gesture promotes learning through-

out childhood. Child development perspectives, 3(2):106–111, Au-

gust 2009. ISSN 1750-8606. doi: 10.1111/j.1750-8606.2009.00088.

x. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?

artid=2835356&tool=pmcentrez&rendertype=abstract. 75

Tim Gollisch and Markus Meister. Rapid neural coding in the retina with

relative spike latencies. Science (New York, N.Y.), 319(5866):1108–11,

February 2008. ISSN 1095-9203. doi: 10.1126/science.1149639. URL

http://www.sciencemag.org/content/319/5866/1108.abstract. 4,

14

Tim Gollisch and Markus Meister. Eye smarter than scientists believed:

neural computations in circuits of the retina. Neuron, 65(2):150–64,

bibliography 141

January 2010. ISSN 1097-4199. doi: 10.1016/j.neuron.2009.12.009. URL

http://www.ncbi.nlm.nih.gov/pubmed/20152123. 4

Leslie J. Gonzalez Rothi, Cynthia Ochipa, and Kenneth M. Heilman.

A Cognitive Neuropsychological Model of Limb Praxis. Cognitive

Neuropsychology, 8(6):443–458, November 1991. ISSN 0264-3294.

doi: 10.1080/02643299108253382. URL http://dx.doi.org/10.1080/

02643299108253382. 75

Charles G Gross. Genealogy of the ”grandmother cell”. The Neuroscientist

: a review journal bringing neurobiology, neurology and psychiatry, 8(5):

512–8, October 2002. ISSN 1073-8584. URL http://www.ncbi.nlm.

nih.gov/pubmed/12374433. 10

Alfred Haar. Zur Theorie der orthogonalen Funktionensysteme. Math-

ematische Annalen, 71(1):38–53, March 1911. ISSN 0025-5831. doi:

10.1007/BF01456927. URL http://www.springerlink.com/content/

q1n417x61q8r2827/. 26

Ileana L Hanganu, Yehezkel Ben-Ari, and Rustem Khazipov. Reti-

nal waves trigger spindle bursts in the neonatal rat visual cortex.

The Journal of neuroscience, 26(25):6728–36, June 2006. ISSN 1529-

2401. doi: 10.1523/JNEUROSCI.0752-06.2006. URL http://www.

jneurosci.org/cgi/content/abstract/26/25/6728. 26

D O Hebb. Studies of the organization of behavior. I. Behavior of the rat

in a field orientation. Journal of Comparative Psychology, 25:333–353,

1932. 97, 107

Micha Hersch and Aude G. Billard. Reaching with multi-referential dy-

namical systems. Autonomous Robots, 25(1-2):71–83, January 2008.

ISSN 0929-5593. doi: 10.1007/s10514-007-9070-7. 120

D H Hubel and T N Wiesel. Receptive fields, binocular interaction

and functional architecture in the cat’s visual cortex. The Jour-

nal of physiology, 160:106–54, January 1962. ISSN 0022-3751. URL

http://www.ncbi.nlm.nih.gov/pubmed/14449617. 7, 54

Chou P Hung, Gabriel Kreiman, Tomaso Poggio, and James J Di-

Carlo. Fast readout of object identity from macaque inferior tempo-

142 bibliography

ral cortex. Science, 310(5749):863–6, November 2005. ISSN 1095-

9203. doi: 10.1126/science.1117593. URL http://www.sciencemag.

org/content/310/5749/863.abstract. 22

J M Hupe, A C James, P Girard, and J Bullier. Response modulations

by static texture surround in area V1 of the macaque monkey do not

depend on feedback connections from V2. Journal of neurophysiology,

85(1):146–63, January 2001. ISSN 0022-3077. URL http://www.ncbi.

nlm.nih.gov/pubmed/11152715. 8

L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention

for rapid scene analysis. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 20(11):1254–1259, 1998. ISSN 01628828. doi: 10.

1109/34.730558. URL http://ieeexplore.ieee.org/xpl/freeabs_

all.jsp?arnumber=730558. 92, 93, 94, 107

Giuliano Iurilli, Fabio Benfenati, and Paolo Medini. Loss of Visually

Driven Synaptic Responses in Layer 4 Regular-Spiking Neurons of Rat

Visual Cortex in Absence of Competing Inputs. Cerebral cortex, pages

bhr304–, November 2011. ISSN 1460-2199. doi: 10.1093/cercor/bhr304.

URL http://cercor.oxfordjournals.org/cgi/content/abstract/

bhr304v1. 26

E M Izhikevich. Simple model of spiking neurons. IEEE transactions on

neural networks, 14(6):1569–72, January 2003. ISSN 1045-9227. doi:

10.1109/TNN.2003.820440. URL http://ieeexplore.ieee.org/xpl/

freeabs_all.jsp?arnumber=1257420. 28, 78

Eugene Izhikevich. Dynamical Systems in Neuroscience: The Geometry

of Excitability and Bursting. The MIT Press, Cambridge, USA, 1 edi-

tion, 2006. ISBN 0262090430. URL http://www.citeulike.org/user/

sdaehne/article/1396879. 30, 80

Eugene M Izhikevich. Which model to use for cortical spiking

neurons? IEEE transactions on neural networks, 15(5):1063–

70, September 2004. ISSN 1045-9227. doi: 10.1109/TNN.2004.

832719. URL http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?

arnumber=1333071. 28, 30, 79, 80, 95

bibliography 143

J P Jones and L A Palmer. An evaluation of the two-dimensional Gabor

filter model of simple receptive fields in cat striate cortex. Journal of

Neurophysiology, 58(6):1233–58, December 1987. ISSN 0022-3077. URL

http://www.ncbi.nlm.nih.gov/pubmed/3437332. 25

Sewoong Jun, Youngouk Kim, and Jongbae Lee. Difference of wavelet

SIFT based mobile robot navigation. In 2009 IEEE International Con-

ference on Control and Automation, pages 2305–2310. IEEE, December

2009. ISBN 978-1-4244-4706-0. doi: 10.1109/ICCA.2009.5410318. 90

Martin Kaltenbrunner and Ross Bencina. reacTIVision: a computer-vision

framework for table-based tangible interaction. In Proceedings of the 1st

international conference on Tangible and embedded interaction - TEI

’07, page 69, New York, New York, USA, February 2007. ACM Press.

ISBN 9781595936196. doi: 10.1145/1226969.1226983. 116

Martin Kaltenbrunner, Till Bovermann, Ross Bencina, and Enrico

Costanza. TUIO - A Protocol for Table Based Tangible User Inter-

faces. In Proceedings of the 6th International Workshop on Gesture

in Human-Computer Interaction and Simulation (GW 2005), Vannes,

France, 2005. 116

O. Khatib. Real-time obstacle avoidance for manipulators and mobile

robots. In IEEE International Conference on Robotics and Automa-

tion, volume 2, pages 500–505. Institute of Electrical and Electronics

Engineers, 1985. doi: 10.1109/ROBOT.1985.1087247. 120

Philipp Knusel, Reto Wyss, Peter Konig, and Paul F M J Ver-

schure. Decoding a temporal population code. Neural Computa-

tion, 16(10):2079–100, October 2004. ISSN 0899-7667. doi: 10.1162/

0899766041732459. URL http://portal.acm.org/citation.cfm?id=

1119706.1119711. 16, 24

Philipp Knusel, Mikael A Carlsson, Bill S Hansson, Tim C Pearce, and

Paul F M J Verschure. Time and space are complementary encoding

dimensions in the moth antennal lobe. Network, 18(1):35–62, 2007. ISSN

0954-898X (Print). doi: 10.1080/09548980701242573. 15, 24

Olivier Koch, Matthew R Walter, Albert S Huang, and Seth Teller.

144 bibliography

Ground robot navigation using uncalibrated cameras. In 2010 IEEE In-

ternational Conference on Robotics and Automation, pages 2423–2430.

IEEE, May 2010. ISBN 978-1-4244-5038-1. doi: 10.1109/ROBOT.

2010.5509325. URL http://ieeexplore.ieee.org/lpdocs/epic03/

wrapper.htm?arnumber=5509325. 90

Randal A Koene and Michael E Hasselmo. An integrate-and-fire model of

prefrontal cortex neuronal activity during performance of goal-directed

decision making. Cerebral Cortex, 15(12):1964–81, December 2005.

ISSN 1047-3211. doi: 10.1093/cercor/bhi072. URL http://cercor.

oxfordjournals.org/cgi/content/abstract/15/12/1964. 33

Gord Kurtenbach and Eric Hulteen. Gestures in Human-Computer Com-

munication. In Brenda Laurel, editor, The Art and Science of Interface

Design., pages 309–317. Addison-Wesley Publishing Co., Boston, 1 edi-

tion, 1990. 75

Stephane Lallee, Severin Lemaignan, Alexander Lenz, Chris Melhuish,

Lorenzo Natale, S Skachek, T Van Der Zant, Felix Warneken, and P F

Dominey. Towards a Platform-Independent Cooperative Human-Robot

Interaction System: I. Perception. Proceedings of the IEEERSJ Interna-

tional Conference on Intelligent Robots and Systems, pages 4444–4451,

2010. 65, 126

M Laubach. Wavelet-based processing of neuronal spike trains prior

to discriminant analysis. Journal Neuroscience Methods, 134(2):

159–168, April 2004. ISSN 0165-0270. doi: http://dx.doi.org/10.

1016/j.jneumeth.2003.11.007. URL http://dx.doi.org/10.1016/j.

jneumeth.2003.11.007. 48

Sylvain Le Groux, Jonatas Manzolli, Marti Sanchez, Andre Luvizotto,

Anna Mura, Aleksander Valjamae, Christoph Guger, Robert Prueckl,

Ulysses Bernardet, and Paul Verschure. Disembodied and Collabora-

tive Musical Interaction in the Multimodal Brain Orchestra. In Pro-

ceedings of the international conference on New Interfaces for Musical

Expression, 2010. URL http://www.citeulike.org/user/slegroux/

article/8492764.

bibliography 145

A B Lee, B Blais, H Z Shouval, and L N Cooper. Statistics of lateral genic-

ulate nucleus (LGN) activity determine the segregation of ON/OFF

subfields for simple cells in visual cortex. Proceedings of the National

Academy of Sciences of the United States of America, 97(23):12875–9,

November 2000. ISSN 0027-8424. doi: 10.1073/pnas.97.23.12875. URL

http://www.pnas.org/cgi/content/abstract/97/23/12875. 4

K C Lee, J Ho, and D Kriegman. Acquiring Linear Subspaces for Face

Recognition under Variable Lighting. IEEE Trans. Pattern Anal. Mach.

Intelligence, 27(5):684–698, 2005. xxviii, xxxiii, 19, 66, 67, 72

Dong Zhang Li, S. Z. Li, and Daniel Gatica-perez. Real-Time Face Detec-

tion Using Boosting in Hierarchical Feature Spaces, 2004. URL http://

citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2088. 65

Bao-hua Liu, Guangying K Wu, Robert Arbuckle, Huizhong W Tao, and

Li I Zhang. Defining cortical frequency tuning with recurrent excitatory

circuitry. Nature neuroscience, 10(12):1594–600, December 2007. ISSN

1097-6256. doi: 10.1038/nn2012. URL http://dx.doi.org/10.1038/

nn2012. 90

N K Logothetis and D L Sheinberg. Visual object recognition. Annual

review of neuroscience, 19:577–621, January 1996. ISSN 0147-006X.

doi: 10.1146/annurev.ne.19.030196.003045. URL http://www.ncbi.

nlm.nih.gov/pubmed/8833455. 9

David G. Lowe. Distinctive Image Features from Scale-Invariant Key-

points. International Journal of Computer Vision, 60(2):91–110,

November 2004. ISSN 0920-5691. doi: 10.1023/B:VISI.0000029664.

99615.94. URL http://portal.acm.org/citation.cfm?id=993451.

996342. 73

Mantas Lukosevicius and Herbert Jaeger. Reservoir computing ap-

proaches to recurrent neural network training. Computer Science Re-

view, 3(3):127–149, 2009. ISSN 15740137. doi: 10.1016/j.cosrev.

2009.03.005. URL http://linkinghub.elsevier.com/retrieve/pii/

S1574013709000173. 24, 91

Andre Luvizotto, Cesar Renno-Costa, Ugo Pattacini, and Paul Verschure.

146 bibliography

The encoding of complex visual stimuli by a canonical model of the

primary visual cortex: temporal population coding for face recognition

on the iCub robot. In IEEE International Conference on Robotics and

Biomimetics, page 6, Thailand, 2011. 19, 23, 76, 81, 86, 92, 132

Andre Luvizotto, Maxime Petit, Vasiliki Vouloutsi, and Et Al. Exper-

imental and Functional Android Assistant : I . A Novel Architecture

for a Controlled Human-Robot Interaction Environment. In IROS 2012

(Submitted), 2012a. 61

Andre Luvizotto, Cesar Renno-Costa, and Paul F.M.J. Verschure. A

framework for mobile robot navigation using a temporal population

code. In Springer Lecture Notes in Computer Science - Living Ma-

chines, page 12, 2012b. 20

Andre Luvizotto, Cesar Renno-Costa, and Paul F.M.J. Verschure. A

wavelet based neural model to optimize and read out a temporal popu-

lation code. Frontiers in Computational Neuroscience, 6(21):14, 2012c.

17, 82

Sean P MacEvoy, Thomas R Tucker, and David Fitzpatrick. A precise form

of divisive suppression supports population coding in the primary visual

cortex. Nature Neuroscience, 12(5):637–45, May 2009. ISSN 1546-1726.

doi: 10.1038/nn.2310. URL http://dx.doi.org/10.1038/nn.2310. 23

Stephane Mallat. A Wavelet Tour of Signal Processing. Academic Press,

Burlington, USA, 1998. 17, 25, 31, 82

Gary Marsat and Leonard Maler. Neural heterogeneity and efficient

population codes for communication signals. Journal of neurophys-

iology, 104(5):2543–55, November 2010. ISSN 1522-1598. doi: 10.

1152/jn.00256.2010. URL http://jn.physiology.org/cgi/content/

abstract/104/5/2543. 15

Marialuisa; Martelli, Najib J.; Majaj, and Denis G.; Pelli. Are faces

processed like words? A diagnostic test for recognition by parts. Journal

of Vision, 5(1):58–70, 2005. 19, 55

Zenon Mathews, Sergi Bermudez i Badia, and Paul F.M.J. Verschure.

PASAR: An integrated model of prediction, anticipation, sensation, at-

bibliography 147

tention and response for artificial sensorimotor systems. Information

Sciences, 186(1):1–19, March 2011. ISSN 00200255. doi: 10.1016/j.ins.

2011.09.042. 94, 132

Andre Maurer, Micha Hersch, and Aude G Billard. Extended hopfield

network for sequence learning: Application to gesture recognition. Pro-

ceedings of ICANN, 1, 2005. URL http://citeseerx.ist.psu.edu/

viewdoc/summary?doi=10.1.1.68.7770. 76

David McNeill. Hand and Mind: What Gestures Reveal about Thought.

University Of Chicago Press, 1996. ISBN 0226561348. URL

http://www.amazon.com/Hand-Mind-Gestures-Reveal-Thought/

dp/0226561348. 75

Giorgio Metta, Paul Fitzpatrick, and Lorenzo Natale. YARP: Yet Another

Robot Platform. International Journal of Advanced Robotic Systems, 3

(1), 2006. URL http://eris.liralab.it/yarp/. 64, 115

Giorgio Metta, Giulio Sandini, David Vernon, Lorenzo Natale, and

Francesco Nori. The iCub humanoid robot: an open platform for re-

search in embodied cognition. In Proceedings of the 8th Workshop on

Performance Metrics for Intelligent Systems - PerMIS ’08, page 50,

New York, New York, USA, 2008. ACM Press. doi: 10.1145/1774674.

1774683. 55, 63, 110, 114

Ethan M Meyers, David J Freedman, Gabriel Kreiman, Earl K Miller,

and Tomaso Poggio. Dynamic population coding of category infor-

mation in inferior temporal and prefrontal cortex. Journal of Neuro-

physiology, 100(3):1407–19, September 2008. ISSN 0022-3077. doi: 10.

1152/jn.90248.2008. URL http://jn.physiology.org/cgi/content/

abstract/100/3/1407. 23

Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, and

Louis Goldstein. Recognizing articulatory gestures from speech for ro-

bust speech recognition. The Journal of the Acoustical Society of Amer-

ica, 131(3):2270–87, March 2012. ISSN 1520-8524. doi: 10.1121/1.

3682038. URL http://www.ncbi.nlm.nih.gov/pubmed/22423722. 75

V. Mohan, P. Morasso, G. Metta, and G. Sandini. A biomimetic, force-field

148 bibliography

based computational model for motion planning and bimanual coordi-

nation in humanoid robots. Autonomous Robots, 27(3):291–307, August

2009. ISSN 0929-5593. doi: 10.1007/s10514-009-9127-x. 63, 120

V B Mountcastle. Modality and topographic properties of single neurons

of cat’s somatic sensory cortex. Journal of neurophysiology, 20(4):408–

34, July 1957. ISSN 0022-3077. URL http://www.ncbi.nlm.nih.gov/

pubmed/13439410. 6

F.A. Mussa Ivaldi, P. Morasso, and R. Zaccaria. Kinematic networks

distributed model for representing and regularizing motor redundancy.

Biological Cybernetics, 60(1):1–16, November 1988. ISSN 0340-1200.

120

Jonathan J Nassi and Edward M Callaway. Parallel processing strategies

of the primate visual system. Nature Reviews Neuroscience, 10(5):360–

372, April 2009. ISSN 1471-003X. doi: 10.1038/nrn2619. URL http:

//dx.doi.org/10.1038/nrn2619. xxiv, 3, 4, 5, 6

Ian Nauhaus, Andrea Benucci, Matteo Carandini, and Dario L Ringach.

Neuronal selectivity and local map structure in visual cortex. Neuron, 57

(5):673–9, March 2008. ISSN 1097-4199. doi: 10.1016/j.neuron.2008.01.

020. URL http://www.cell.com/neuron/fulltext/S0896-6273(08)

00105-0. 7

Ian Nauhaus, Laura Busse, Matteo Carandini, and Dario L Ringach. Stim-

ulus contrast modulates functional connectivity in visual cortex. Na-

ture Neuroscience, 12(1):70–6, January 2009. ISSN 1546-1726. doi:

10.1038/nn.2232. URL http://dx.doi.org/10.1038/nn.2232. 7, 90

Andreas Nieder and Katharina Merten. A labeled-line code for small and

large numerosities in the monkey prefrontal cortex. The Journal of

Neuroscience, 27(22):5986–93, May 2007. ISSN 1529-2401. doi: 10.

1523/JNEUROSCI.1056-07.2007. URL http://www.jneurosci.org/

cgi/content/abstract/27/22/5986. 27

Behrad Noudoost, Mindy H Chang, Nicholas A Steinmetz, and Tirin

Moore. Top-down control of visual attention. Current opinion in neu-

robiology, 20(2):183–90, April 2010. ISSN 1873-6882. doi: 10.1016/

bibliography 149

j.conb.2010.02.003. URL http://dx.doi.org/10.1016/j.conb.2010.

02.003. 16

Selim Onat, Nora Nortmann, Sascha Rekauzke, Peter Konig, and Dirk

Jancke. Independent encoding of grating motion across stationary fea-

ture maps in primary visual cortex visualized with voltage-sensitive dye

imaging. NeuroImage, 55(4):1763–70, April 2011. ISSN 1095-9572.

doi: 10.1016/j.neuroimage.2011.01.004. URL http://dx.doi.org/10.

1016/j.neuroimage.2011.01.004. 17, 49, 103

R C O’Reilly. Six principles for biologically based computational models of

cortical cognition. Trends in cognitive sciences, 2(11):455–62, November

1998. ISSN 1364-6613. URL http://www.ncbi.nlm.nih.gov/pubmed/

21227277. 10

C.P. Papageorgiou, M. Oren, and T. Poggio. A general frame-

work for object detection. Narosa Publishing House, Bombay, In-

dia, January 1998. ISBN 81-7319-221-9. doi: 10.1109/ICCV.1998.

710772. URL http://www.computer.org/portal/web/csdl/doi/10.

1109/ICCV.1998.710772. 26

U Pattacini, F Nori, L Natale, G Metta, and G Sandini. An experi-

mental evaluation of a novel minimum-jerk cartesian controller for hu-

manoid robots. In 2010 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems, pages 1668–1674. IEEE, October 2010. ISBN

978-1-4244-6674-0. doi: 10.1109/IROS.2010.5650851. 63, 120

Maxime Petit, Stephane Lallee, Jean-David Boucher, Gregoire Pointeau,

Pierrick Cheminade, Dimitri Ognibene, Eris Chinellato, Ugo Pattacini,

Ilaria Gori, Giorgio Metta, Uriel Martinez-Hernandez, Hector Barron,

Martin Inderbitzin, Andre Luvizotto, Vicky Vouloutsi, Yannis Demiris,

and Peter Ford Dominey. The Coordinating Role of Language in Real-

Time Multi-Modal Learning of Cooperative Tasks. IEEE Transactions

on Autonomous Mental Development (TAMD), 2012. 125

N Pinto, Y Barhomi, D D Cox, and J J DiCarlo. Comparing state-of-the-

art visual features on invariant object recognition tasks. In Applications

of Computer Vision (WACV), 2011 IEEE Workshop on, pages 463–470,

150 bibliography

2011. doi: 10.1109/WACV.2011.5711540. 110

Nicolas Pinto, David D Cox, and James J DiCarlo. Why is real-world

visual object recognition hard? PLoS computational biology, 4(1):e27,

January 2008. ISSN 1553-7358. doi: 10.1371/journal.pcbi.0040027. URL

http://dx.plos.org/10.1371/journal.pcbi.0040027. 2

Tomaso Poggio and Emilio Bizzi. Generalization in vision and motor con-

trol. Nature, 431(7010):768–74, October 2004. ISSN 1476-4687. doi: 10.

1038/nature03014. URL http://dx.doi.org/10.1038/nature03014.

11, 54

P Reinagel and R C Reid. Temporal coding of visual information in the

thalamus. The Journal of neuroscience : the official journal of the

Society for Neuroscience, 20(14):5392–400, July 2000. ISSN 0270-6474.

URL http://www.ncbi.nlm.nih.gov/pubmed/10884324. 12

Cesar Renno-Costa, John E Lisman, and Paul F M J Verschure. The

mechanism of rate remapping in the dentate gyrus. Neuron, 68(6):1051–

8, December 2010. ISSN 1097-4199. doi: 10.1016/j.neuron.2010.11.

024. URL http://www.cell.com/neuron/fulltext/S0896-6273(10)

00940-2. 92, 96, 106

Cesar Renno-Costa, Andre L Luvizotto, Encarni Marcos, Armin Duff,

Martı Sanchez-Fibla, and Paul F M J Verschure. Integrating

Neuroscience-based Models Towards an Autonomous Biomimetic Syn-

thetic. In 2011 IEEE International Conference on RObotics and

BIOmimetics (IEEE-ROBIO 2011), Phuket Island, Thailand, 2011.

IEEE, IEEE. 91, 97

Cesar Renno-Costa, Andre Luvizotto, Alberto Betella, Marti Sanchez Fi-

bla, and Paul F. M. J. Verschure. Internal drive regulation of sensori-

motor reflexes in the control of a catering assistant autonomous robot.

In Lecture Notes in Artificial Intelligence: Living Machines, 2012.

M Riesenhuber and T Poggio. Hierarchical models of object recognition

in cortex. Nature Neuroscience, 2(11):1019–25, November 1999. ISSN

1097-6256. doi: 10.1038/14819. URL http://www.ncbi.nlm.nih.gov/

pubmed/10526343. 2, 23

bibliography 151

M Riesenhuber and T Poggio. Models of object recognition. Nature

neuroscience, 3 Suppl:1199–204, November 2000. ISSN 1097-6256.

doi: 10.1038/81479. URL http://www.nature.com/neuro/journal/

v3/n11s/full/nn1100_1199.html#B47. 2

Dario L Ringach. Spatial Structure and Symmetry of Simple-Cell Recep-

tive Fields in Macaque Primary Visual Cortex. Journal of Neurophys-

iology, 88(1):455–463, 2002. URL http://jn.physiology.org/cgi/

content/abstract/88/1/455. 25

Dario L. Ringach, Robert M. Shapley, and Michael J. Hawken. Orientation

Selectivity in Macaque V1: Diversity and Laminar Dependence. J.

Neurosci., 22(13):5639–5651, July 2002. URL http://www.jneurosci.

org/cgi/content/abstract/22/13/5639. 7

R W Rodieck and J Stone. Analysis of receptive fields of cat retinal gan-

glion cells. Journal of Neurophysiology, 28(5):833, 1965. ISSN 00223077.

URL http://jn.physiology.org/cgi/reprint/28/5/833.pdf. 6, 27,

94

F Rosenblatt. The perceptron: a probabilistic model for information stor-

age and organization in the brain. Psychological review, 65(6):386–408,

November 1958. ISSN 0033-295X. URL http://www.ncbi.nlm.nih.

gov/pubmed/13602029. 2

David E. Rumelhart, James L. McClelland, and PDP Research

Group. Parallel Distributed Processing: Explorations in the Mi-

crostructure of Cognition: Foundations (Volume 1). A Bradford

Book, 1986. ISBN 0262181207. URL http://www.amazon.com/

Parallel-Distributed-Processing-Explorations-Microstructure/

dp/0262181207. 2

P A Salin, J Bullier, and H Kennedy. Convergence and divergence in the

afferent projections to cat area 17. The Journal of comparative neu-

rology, 283(4):486–512, May 1989. ISSN 0021-9967. doi: 10.1002/cne.

902830405. URL http://www.ncbi.nlm.nih.gov/pubmed/2745751. 8

Jason M Samonds and A B Bonds. From another angle: Differences in

cortical coding between fine and coarse discrimination of orientation.

152 bibliography

Journal of neurophysiology, 91(3):1193–202, March 2004. ISSN 0022-

3077. doi: 10.1152/jn.00829.2003. URL http://www.ncbi.nlm.nih.

gov/pubmed/14614106. 23

Miguel Sarabia, Raquel Ros, and Yiannis Demiris. Towards an open-source

social middleware for humanoid robots. In 11th IEEE-RAS Interna-

tional Conference on Humanoid Robots, pages 670 – 675, Bled, 2011.

132

Alan B Saul. Temporal receptive field estimation using wavelets. Jour-

nal of Neuroscience Methods, 168(2):450–464, 2008. ISSN 0165-0270

(Print). doi: 10.1016/j.jneumeth.2007.11.014. 25

Dirk Schubert, Rolf Kotter, and Jochen F Staiger. Mapping func-

tional connectivity in barrel-related columns reveals layer- and cell

type-specific microcircuits. Brain structure & function, 212(2):107–

19, September 2007. ISSN 1863-2653. doi: 10.1007/s00429-007-0147-z.

URL http://www.ncbi.nlm.nih.gov/pubmed/17717691. 90

Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber,

and Tomaso Poggio. Robust Object Recognition with Cortex-Like

Mechanisms. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 29:411–426, 2007. ISSN 0162-8828. doi: http://doi.

ieeecomputersociety.org/10.1109/TPAMI.2007.56. 2, 23

Robert Shapley and Michael J Hawken. Color in the cortex: single- and

double-opponent cells. Vision research, 51(7):701–17, April 2011. ISSN

1878-5646. doi: 10.1016/j.visres.2011.02.012. URL /pmc/articles/

PMC3121536/?report=abstract. 5

S Shushruth, Pradeep Mangapathy, Jennifer M Ichida, Paul C Bressloff,

Lars Schwabe, and Alessandra Angelucci. Strong recurrent networks

compute the orientation tuning of surround modulation in the pri-

mate primary visual cortex. The Journal of neuroscience : the of-

ficial journal of the Society for Neuroscience, 32(1):308–21, January

2012. ISSN 1529-2401. doi: 10.1523/JNEUROSCI.3789-11.2012. URL

http://www.jneurosci.org/content/32/1/308.full. 8

Bruno Siciliano, Lorenzo Sciavicco, Luigi Villani, and Giuseppe Oriolo.

bibliography 153

Robotics: Modelling, Planning and Control (Advanced Textbooks in Con-

trol and Signal Processing). Springer, 2011. ISBN 1846286417. 119

Kyriaki Sidiropoulou, Fang-Min Lu, Melissa A Fowler, Rui Xiao, Christo-

pher Phillips, Emin D Ozkan, Michael X Zhu, Francis J White, and

Donald C Cooper. Dopamine modulates an mGluR5-mediated depolar-

ization underlying prefrontal persistent activity. Nature Neuroscience,

12(2):190–199, January 2009. ISSN 1097-6256. doi: 10.1038/nn.2245.

URL http://dx.doi.org/10.1038/nn.2245. 33

Markus Siegel, Tobias H Donner, and Andreas K Engel. Spectral finger-

prints of large-scale neuronal interactions. Nature reviews. Neuroscience,

13(2):121–34, February 2012. ISSN 1471-0048. doi: 10.1038/nrn3137.

URL http://dx.doi.org/10.1038/nrn3137. 17

Lawrence C Sincich and Jonathan C Horton. The circuitry of V1

and V2: integration of color, form, and motion. Annual review of

neuroscience, 28:303–26, January 2005. ISSN 0147-006X. doi: 10.

1146/annurev.neuro.28.061604.135731. URL http://www.ncbi.nlm.

nih.gov/pubmed/16022598. 8

Lawrence Sirovich and Marsha Meytlis. Symmetry, probability, and recog-

nition in face space. Proceedings of the National Academy of Sciences

of the United States of America, 106(17):6895–9, April 2009. ISSN

1091-6490. doi: 10.1073/pnas.0812680106. URL http://www.pnas.

org/cgi/content/abstract/106/17/6895. 19, 55

Olaf Sporns and Jonathan D Zwi. The small world of the cerebral

cortex. Neuroinformatics, 2(2):145–62, January 2004. ISSN 1539-

2791. doi: 10.1385/NI:2:2:145. URL http://www.ncbi.nlm.nih.gov/

pubmed/15319512. 7, 23

Dan D Stettler, Aniruddha Das, Jean Bennett, and Charles D Gilbert.

Lateral Connectivity and Contextual Interactions in Macaque Primary

Visual Cortex. Neuron, 36(4):739–750, November 2002. ISSN 08966273.

doi: 10.1016/S0896-6273(02)01029-2. URL http://dx.doi.org/10.

1016/S0896-6273(02)01029-2. 7, 31

Charles F Stevens. Preserving properties of object shape by computations

154 bibliography

in primary visual cortex. Proceedings of the National Academy of Sci-

ences of the United States of America, 101(43):15524–15529, 2004. doi:

10.1073/pnas.0406664101. URL http://www.pnas.org/content/101/

43/15524.abstract. 25

Stephen Sutton, Ronald Cole, Jacques De Villiers, Johan Schalkwyk,

Pieter Vermeulen, Mike Macon, Yonghong Yan, Ed Kaiser, Brian Run-

dle, Khaldoun Shobaki, Paul Hosom, Alex Kain, Johan, Johan Wouters,

Dominic Massaro, and Michael Cohen. Universal Speech Tools: The

Cslu Toolkit. in proceedings of the international conference on spoken

language processing (ICSLP), 7:3221 – 3224, 1998. 65, 122

Carolina Tamara and William Timberlake. Route and landmark learning

by rats searching for food. Behavioural processes, 86(1):125–32, January

2011. ISSN 1872-8308. doi: 10.1016/j.beproc.2010.10.007. URL http:

//dx.doi.org/10.1016/j.beproc.2010.10.007. 1

K Tanaka. Inferotemporal cortex and object vision. Annual review of

neuroscience, 19:109–39, January 1996. ISSN 0147-006X. doi: 10.1146/

annurev.ne.19.030196.000545. URL https://vpn.upf.edu/+CSCO+

0h756767633A2F2F6A6A6A2E6E6161686E7965726976726A662E626574+

+/doi/abs/10.1146/annurev.ne.19.030196.000545. 10, 11

S Thorpe, D Fize, and C Marlot. Speed of processing in the human

visual system. Nature, 381(6582):520–2, June 1996. ISSN 0028-0836.

doi: 10.1038/381520a0. URL http://www.ncbi.nlm.nih.gov/pubmed/

8632824. 17, 41, 68

Sebastian Thrun. Toward a framework for human-robot interaction.

Human-Computer Interaction, 19, 2004. 110

Paul Tiesinga and Terrence J Sejnowski. Cortical enlightenment: are

attentional gamma oscillations driven by ING or PING? Neuron, 63

(6):727–32, September 2009a. ISSN 1097-4199. doi: 10.1016/j.neuron.

2009.09.009. URL http://dx.doi.org/10.1016/j.neuron.2009.09.

009. 16

Paul Tiesinga and Terrence J Sejnowski. Cortical enlightenment: are

attentional gamma oscillations driven by ING or PING? Neuron, 63(6):

bibliography 155

727–32, September 2009b. ISSN 1097-4199. doi: 10.1016/j.neuron.2009.

09.009. URL http://www.pubmedcentral.nih.gov/articlerender.

fcgi?artid=2778762&tool=pmcentrez&rendertype=abstract. 104

Paul Tiesinga, Jean-Marc Fellous, and Terrence J Sejnowski. Regulation

of spike timing in visual cortical circuits. Nature reviews. Neuroscience,

9(2):97–107, February 2008. ISSN 1471-0048. doi: 10.1038/nrn2315.

URL http://dx.doi.org/10.1038/nrn2315. 12

Thomas Tollner, Michael Zehetleitner, Klaus Gramann, and Hermann J

Muller. Stimulus saliency modulates pre-attentive processing speed in

human visual cortex. PloS one, 6(1):e16276, January 2011. ISSN 1932-

6203. doi: 10.1371/journal.pone.0016276. URL http://dx.plos.org/

10.1371/journal.pone.0016276. 41

Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and

Henrike Moll. Understanding and sharing intentions: the origins of

cultural cognition. The Behavioral and brain sciences, 28(5):675–91;

discussion 691–735, October 2005. ISSN 0140-525X. doi: 10.1017/

S0140525X05000129. 125, 126

Doris Y Tsao, Winrich A Freiwald, Tamara A Knutsen, Joseph B Man-

deville, and Roger B H Tootell. Faces and objects in macaque cere-

bral cortex. Nature neuroscience, 6(9):989–95, September 2003. ISSN

1097-6256. doi: 10.1038/nn1111. URL http://dx.doi.org/10.1038/

nn1111. 10

Doris Y Tsao, Winrich A Freiwald, Roger B H Tootell, and Margaret S

Livingstone. A cortical region consisting entirely of face-selective cells.

Science (New York, N.Y.), 311(5761):670–4, February 2006. ISSN 1095-

9203. doi: 10.1126/science.1119983. URL http://www.sciencemag.

org/content/311/5761/670.abstract. 10

P F M J Verschure and P Althaus. A real-world rational agent: Unifying

old and new AI. Cognitive Science, 27:561–590, 2003. 61, 111, 112

Paul F M J Verschure, Thomas Voegtlin, and Rodney J Douglas. En-

vironmentally mediated synergy between perception and behaviour in

mobile robots. Nature, 425(6958):620–4, October 2003. ISSN 1476-

156 bibliography

4687. doi: 10.1038/nature02024. URL http://dx.doi.org/10.1038/

nature02024. 27, 105

J Victor and K Purpura. Estimation of information in neuronal responses.

Trends in Neurosciences, 22(12):543, 1999. ISSN 0166-2236 (Print). 41,

51

P. Viola and M. Jones. Rapid object detection using a boosted cas-

cade of simple features. IEEE Comput. Soc, Kauai, HI, USA, 2001.

ISBN 0-7695-1272-0. doi: 10.1109/CVPR.2001.990517. URL http://

ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=990517. 26

Andreas Wachter and Lorenz T. Biegler. On the implementation of an

interior-point filter line-search algorithm for large-scale nonlinear pro-

gramming. Mathematical Programming, 106(1):25–57, April 2005. ISSN

0025-5610. doi: 10.1007/s10107-004-0559-y. 120

T N Wiesel and D H Hubel. Spatial and chromatic interactions in

the lateral geniculate body of the rhesus monkey. Journal of neu-

rophysiology, 29(6):1115–56, November 1966. ISSN 0022-3077. URL

http://www.ncbi.nlm.nih.gov/pubmed/4961644. 4

Ben Willmore, Ryan J Prenger, Michael C-K Wu, and Jack L Gallant. The

berkeley wavelet transform: a biologically inspired orthogonal wavelet

transform. Neural Computation, 20(6):1537–1564, 2008. ISSN 0899-7667

(Print). doi: 10.1162/neco.2007.05-07-513. 25

Thilo Womelsdorf, Jan-Mathijs Schoffelen, Robert Oostenveld, Wolf

Singer, Robert Desimone, Andreas K Engel, and Pascal Fries. Mod-

ulation of neuronal interactions through neuronal synchronization. Sci-

ence (New York, N.Y.), 316(5831):1609–12, June 2007. ISSN 1095-

9203. doi: 10.1126/science.1139597. URL http://www.sciencemag.

org/content/316/5831/1609.abstract. 16

Reto Wyss and Paul F. M. J. Verschure. Bounded Invariance and the

Formation of Place Fields. In Advances in Neural Information Process-

ing Systems 16. MIT Press, 2004a. URL http://citeseerx.ist.psu.

edu/viewdoc/summary?doi=10.1.1.6.4379. 99

Reto Wyss and Paul F M J Verschure. Bounded Invariance and the For-

bibliography 157

mation of Place Fields. In Neural Information Processing Systems 16,

Vancouver, B.C. Canada, 2004b. 18, 22

Reto Wyss, Peter Konig, and Paul F M J Verschure. Invariant representa-

tions of visual patterns in a temporal population code. Proceedings of the

National Academy of Sciences of the United States of America, 100(1):

324–9, January 2003a. ISSN 0027-8424. doi: 10.1073/pnas.0136977100.

URL http://www.pnas.org/cgi/content/abstract/100/1/324. 12,

13, 18, 22, 27, 51, 55, 56, 76, 77, 90, 91, 92, 94

Reto Wyss, Paul F M J Verschure, and Peter Konig. Properties of a

temporal population code. Reviews in the neurosciences, 14(1-2):21–

33, January 2003b. ISSN 0334-1763. URL http://www.ncbi.nlm.nih.

gov/pubmed/12929915. xxiv, 12, 13, 18, 103

Reto Wyss, Paul F M J Verschure, and Peter Konig. Properties of a

temporal population code. Reviews in the neurosciences, 14(1-2):21–

33, January 2003c. ISSN 0334-1763. URL http://www.ncbi.nlm.nih.

gov/pubmed/12929915. 22, 27, 35, 41, 50, 51, 55, 56, 77, 86, 90, 92, 94

Reto Wyss, Peter Konig, and Paul F M J Verschure. A model of the

ventral visual system based on temporal stability and local memory.

PLoS biology, 4(5):e120, May 2006. ISSN 1545-7885. doi: 10.1371/

journal.pbio.0040120. URL http://dx.plos.org/10.1371/journal.

pbio.0040120. 18, 91, 106

T Yousef, E Toth, M Rausch, U T Eysel, and Z F Kisvarday. Topography

of orientation centre connections in the primary visual cortex of the cat.

Neuroreport, 12(8):1693–9, July 2001. ISSN 0959-4965. URL http:

//www.ncbi.nlm.nih.gov/pubmed/11409741. 7

Jochen Zeil, Martin I. Hofmann, and Javaan S. Chahl. Catchment areas of

panoramic snapshots in outdoor scenes. Journal of the Optical Society

of America A, 20(3):450, March 2003. ISSN 1084-7529. doi: 10.1364/

JOSAA.20.000450. URL http://josaa.osa.org/abstract.cfm?URI=

josaa-20-3-450. 101, 107

S Zeki and S Shipp. The functional logic of cortical connections. Na-

ture, 335(6188):311–7, September 1988. ISSN 0028-0836. doi: 10.1038/

158 bibliography

335311a0. URL http://dx.doi.org/10.1038/335311a0. 6

Z. Zhang. A flexible new technique for camera calibration. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334,

2000. ISSN 01628828. doi: 10.1109/34.888718. 119

W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recog-

nition:A literature survey. ACM Computing Surveys (CSUR), 35(4),

2003. ISSN 0360-0300. URL http://portal.acm.org/citation.cfm?

id=954342. 19, 55

Davide Zoccolan, Nadja Oertelt, James J DiCarlo, and David D Cox.

A rodent model for the study of invariant visual object recogni-

tion. Proceedings of the National Academy of Sciences of the United

States of America, 106(21):8748–53, May 2009. ISSN 1091-6490. doi:

10.1073/pnas.0811583106. URL http://www.pnas.org/cgi/content/

abstract/106/21/8748. 1