Sensory-Motor Coordination in Gaze Control

Sensory-motor Coordination in Gaze Control

G. de Croon, E.O. Postma, and H.J. van den Herik

IKAT, Universiteit MaastrichtP.O. Box 616, 6200 MD, Maastricht, The Netherlands

Voice: 0031-433883477 Fax: 0031-433884897e-mail: [email protected], [email protected], [email protected]

http://www.cs.unimaas.nl

Abstract. In the field of artificial intelligence, there is a considerable inter-est in the notion of sensory-motor coordination as an explanation for intel-ligent behaviour. However, there has been little research on sensory-motorcoordination in tasks that go beyond low-level behavioural tasks. In this pa-per we show that sensory-motor coordination can also enhance performanceon a high-level task: artificial gaze control for gender recognition in natu-ral images. To investigate the advantage of sensory-motor coordination, wecompare a non-situated model of gaze control (incapable of sensory-motorcoordination) with a situated model of gaze control (capable of sensory-motor coordination). The non-situated model of gaze control shifts the gazeaccording to a fixed set of locations, optimised by an evolutionary algorithm.The situated model of gaze control determines gaze shifts on the basis of lo-cal inputs in a visual scene. An evolutionary algorithm optimises the model’sgaze control policy. From the experiments performed, we may conclude thatsensory-motor coordination contributes to artificial gaze control for the high-level task of gender recognition in natural images: the situated model out-performs the non-situated model. The mechanism of sensory-motor coordi-nation establishes dependencies between multiple actions and observationsthat are exploited to optimise categorisation performance.

1 Introduction

In the field of artificial intelligence there is a considerable interest in situated modelsof intelligence that employ sensory-motor coordination to solve specific tasks [1, 2].A situated model of intelligence is a model in which motor actions co-determinefuture sensory inputs. Together, the sensory inputs and the motor actions form aclosed loop. Sensory-motor coordination exploits this closed loop in such a way thatthe performance on a particular task is optimised.

Several studies have investigated the mechanism of sensory-motor coordination[3–6]. For instance, they show that sensory-motor coordination can simplify the ex-ecution of tasks, so that the performance is enhanced. However, until now, researchon sensory-motor coordination has only examined low-level tasks, e.g., categorisinggeometrical forms [3–7]. It is unknown to what extent sensory-motor coordinationcan contribute to high-level tasks.

So, the research question in this subdomain of AI research reads: Can sensory-motor coordination contribute to performance of situated models on high-level tasks?In this paper we restrict ourselves to the analysis of two models both performing thesame task, viz. gaze control for gender recognition in natural images. The motivationfor the choice of this task is two-fold: (1) it is a challenging task, to which no situatedgaze control models have been applied before; (2) it enables the comparison of twomodels that are identical, except for their capability to coordinate sensory inputsand motor actions. Thus, we will compare a situated with a non-situated model ofgaze control. If the situated model’s performance is better, we focus on a second

research question: How does the mechanism of sensory-motor coordination enhancethe performance of the situated model on the task? We explicitly state that we areinterested in the relative performance of the models and the cause of an eventualdifference in performance. It is not our intention to build the gender-recognitionsystem with the best categorisation performance. Our only requirement is that themodels perform above chance level (say 60% to 80%), so that a comparison ispossible.

The rest of the paper is organised as follows. In Sect. 2 we describe the non-situated and the situated model of gaze control. In Sect. 3 we outline the experimentused to compare the two models of gaze control. In Sect. 4 we show the experimentalresults and analyse the gaze control policies involved. In Sect. 5 we discuss therelevance of the results. Finally, we draw our conclusions in Sect. 6.

2 Two Models of Gaze Control

Below, we describe the non-situated model of gaze control (Sect. 2.1) and the situ-ated model of gaze control (Sect. 2.2). Then we discuss the adaptable parametersof both models (Sect. 2.3).

2.1 Non-situated Model of Gaze Control

The non-situated model consists of three modules. The first module receives thesensory input and extracts input features, given the current fixation location. Thesecond module consists of a neural network that determines a categorisation basedon the extracted input features. The third module controls the gaze shifts. Figure1 shows an overview of the model. The three modules are illustrated by the dashedboxes, labelled ‘I’, ‘II’, and ‘III’. The current fixation location is indicated by an ‘x’in the face.

Fig. 1. Overview of the non-situated model of gaze control.

Module I receives the raw input from the window with centre x as sensoryinput. In Fig. 1 the raw input is shown on the left in box I; it contains a part ofthe face. From that window, input features are extracted (described later). Theseinput features serve as input to module II, a neural network. The input layer ofthe neural network is illustrated by the box ‘input layer’. Subsequently, the neuralnetwork calculates the activations of the hidden neurons in the ‘hidden layer’ and ofthe output neuron in the ‘output layer’. There is one output neuron that indicatesthe category of the image. The third module determines the next fixation location,where the process is repeated.

Below we describe the three modules of the non-situated model of gaze controlin more detail.

Module I: Sensory Input. Here, we focus on the extraction procedure of theinput features. For our research, we adopt the set of input features as introduced in[8], but we apply them differently.

An input feature represents the difference in mean light intensity between twoareas in the raw input window. These areas are determined by the feature’s type andlocation. Figure 2 shows eight different types of input features (top row) and ninedifferently sized locations in the raw input window from which the input featurescan be extracted (middle row, left). The sizes vary from the whole raw input windowto a quarter of the raw input window. In total, there are 8× 9 = 72 different inputfeatures. In the figure, two example input features are given (middle row, right).Example feature ‘L’ is a combination of the first type and the second location,example feature ‘R’ of the third type and the sixth location. The bottom row of thefigure illustrates how an input feature is calculated. We calculate an input featureby subtracting the mean light intensity in the image covered by the grey surfacefrom the mean light intensity in the image covered by the white surface. The resultis a real number in the interval [−1, 1]. In the case of example feature L, only theleft half of the raw input window is involved in the calculation. The mean lightintensity in the raw input window of area ‘A’ is subtracted from the mean lightintensity of area ‘B’.

Fig. 2. An input feature consists of a type and a location.

Module II: Neural Network. The second module is a neural network that takesthe extracted input features as inputs. It is a fully-connected feedforward neuralnetwork with h hidden neurons and one output neuron. The hidden and outputneurons all have sigmoid activation functions: a(x) = tanh(x), a(x) ∈ 〈−1, 1〉. Theactivation of the output neuron, o1, determines the categorisation (c) as follows.

c ={

Male , if o1 > 0Female , if o1 ≤ 0 (1)

Module III: Fixation locations. The third module controls the gaze in such away, that for every image the same locations in the image are fixated. It containscoordinates that represent all locations that the non-situated model fixates. Themodel first shifts its gaze to location (x1, y1) and categorises the image. Then, itfixates the next location, (x2, y2), and again categorises the image. This process con-tinues, so that the model fixates all locations from (x1, y1) to (xT , yT ) in sequence,assigning a category to the image at every fixation. The performance is based onthese categorisations (see Sect. 3.2). Out of all locations in an image, an evolution-ary algorithm selects the T fixation locations. Selecting the fixation locations alsoimplies selecting the order in which they are fixated.

2.2 Situated Model of Gaze Control

The situated model of gaze control (inspired by the model in [7]) is almost iden-tical to the non-situated model of gaze control. The only difference is that thegaze shifts of the situated model are not determined by a third module, but bythe neural network (Fig. 3). Therefore, the situated model has only two modules.Consequently, the current neural network has three output neurons. The first out-put neuron indicates the categorisation as in (1). The second and the third outputneurons determine a gaze shift (∆x, ∆y) as follows.

∆x = bmo2c (2)

∆y = bmo3c, (3)

where oi, i ∈ {2, 3}, are the activations of the second and third output neurons.Moreover, m is the maximum number of pixels that the gaze can shift in the x-or y-direction. As a result, ∆x and ∆y are expressed in pixels. If a shift resultsin a fixation location outside of the image, the fixation location is repositioned tothe nearest possible fixation location. In Fig. 3 ‘x’ represents the current fixationlocation, and ‘o’ represents the new fixation location as determined by the neuralnetwork.

Fig. 3. Overview of the situated model of gaze control.

2.3 Adaptable Parameters

In subsections 2.1 and 2.2 we described the non-situated and the situated model ofgaze control. Four types of parameter values define specific instantiations of bothmodels. We refer to these instantiations as agents. The four types of parametersare: the input features, the scale of the raw input window from which features areextracted, the neural network weights, and for the non-situated model the coordi-nates of all fixation locations. An evolutionary algorithm generates and optimisesthe agents (i.e., parameter values) by evaluating their performance on the gaze-control task.

3 Experimental Setup

In this section, we describe the gender-recognition task on which we compare thenon-situated and the situated model of gaze control (Sect. 3.1). In addition, wediscuss the evolutionary algorithm that optimises the models’ adaptable parameters(Sect. 3.2). Finally, we mention the experimental settings (Sect. 3.3).

3.1 Gender-Recognition Task

Below, we motivate our choice for the task of gender recognition. Then we describethe data set used for the experiment. Finally, we outline the procedure of trainingand testing the two types of gaze-control models.

We choose the task of gender recognition in images containing photos of femaleor male faces, since it is a challenging and well-studied task [9]. There are many dif-ferences between male and female faces that can be exploited by gender-recognitionalgorithms [10, 11]. State-of-the-art algorithms use global features, extracted in anon-situated manner. So far, none of the current algorithms is based on gaze controlwith a local fixation window.

The data set for the experiment consists of images from J.E. Litton of theKarolinska Institutet in Sweden. It contains 278 images with angry-looking andhappy-looking human subjects. These images are converted to gray-scale imagesand resized to 600× 800 pixels.

One half of the image set serves as a training set for both the non-situatedand the situated model of gaze control. Both models have to determine whether animage contains a photo of a male or female, based on the input features extractedfrom the gray-scale images. For the non-situated model, the sequence of T fixationlocations is optimised by an evolutionary algorithm. For the situated model, theinitial fixation location is defined to be the centre of the image and the subsequentT −1 fixation locations are determined by the gaze-shift output values of the neuralnetwork (outputs o2 and o3). At every fixation, the models have to assign a categoryto the image. After optimising categorisation on the training set, the remaining halfof the image set is used as a test set to determine the performance of the optimisedgaze-control models. Both training set and test set consist of 50% males and 50%females.

3.2 Evolutionary Algorithm

As stated in subsection 2.3, an evolutionary algorithm optimises the parametervalues that define the non-situated and the situated agents, i.e., instantiations of thenon-situated and situated model, respectively. We choose an evolutionary algorithmas our training paradigm, since it allows self-organisation of the closed loop of actionsand inputs.

In our experiment, we perform 15 independent ‘evolutionary runs’ to obtain areliable estimate of the average performance. Each evolutionary run starts by creat-ing an initial population of M randomly initialised agents. Each agent operates onevery image in the training set, and its performance is determined by the followingfitness function:

f(a) =tc,I

IT, (4)

in which a represents the agent, tc,I is the number of time steps at which the agentcorrectly classified images from the training set, I is the number of images in thetraining set, and T is the total number of time steps (fixations) per image. We notethat the product IT is a constant that normalises the performance measure. TheM2 agents with the highest performance are selected to form the population of the

next generation. Their adaptable parameter sets are mutated with probability Pf

for the input feature parameters and Pg for the other parameters, e.g., representingcoordinates or network weights. If mutation occurs, a feature parameter is perturbedby adding a random number drawn from the interval [−pf , pf ]. For other types ofparameters, this interval is [−pg, pg]. For every evolutionary run, the selection andreproduction operations are performed for G generations.

3.3 Experimental Settings

In our experiment the models use ten input features. Furthermore, the neural net-works of both models have 3 hidden neurons, h = 3. All weights of the neuralnetworks are constrained to a fixed interval [−r, r]. Since preliminary experimentsshowed that evolved weights were often close to 0, we have chosen the weight rangeto be [−1, 1], r = 1. The scale of the window from which the input features areextracted ranges from 50 to 150 pixels. Preliminary experiments showed that thisrange of scales is large enough to allow gender recognition, and small enough forlocal processing, which requires intelligent gaze control. The situated model’s max-imal gaze shift m is set to 500, so that the model can reach almost all locations inthe image in one time step.

For the evolutionary algorithm we have chosen the following parameter settings:M = 30, G = 300, and T = 5. The choice of T turns out not to be critical to theresults with respect to the difference in performance of the two models (see Sect.4.2). The mutation parameters are: Pf = 0.02, Pg = 0.10, pf = 0.5, and pg = 0.1.

4 Results

In this section, we show the performances of both models (Sect. 4.1). Then weanalyse the best situated agent to gain insight into the mechanism of sensory-motorcoordination (Sect. 4.2).

4.1 Performance

Table 1 shows the mean performances on the test set (and standard deviation)of the best agents of the 15 evolutionary runs. Performance is expressed as theproportion of correct categorisations. The table shows that the mean performanceof the best situated agents is 0.15 higher than that of the best non-situated agents.Figure 4 shows the histograms of the best performances obtained in the 15 runs fornon-situated agents (white) and for situated agents (gray). Since both distributionsof the performances are highly skewed, we applied a bootstrap method [12] to testthe statistical significance of the results. It revealed that the difference between themean performances of the two types of agents is significant (p < 0.05).

Table 1. Mean performance (f) andstandard deviation (σ) of the per-formance on the test set of the bestagents of the evolutionary runs.

f σ

Non-situated 0.60 0.057Situated 0.75 0.055

Fig. 4. Histograms of the best fitness ofeach evolutionary run. White bars arefor non-situated agents, gray bars forsituated agents.

4.2 Analysis

In this subsection, we analyse the evolved gaze-control policy of the best situatedagent of all evolutionary runs. The analysis clarifies how sensory-motor coordinationoptimises performance on the gender-recognition task.

The first part of the analysis shows that for each category the situated agentcontrols the gaze in a different way. This evolved behaviour aims at optimisingperformance by fixating suitable categorisation locations. The second part of theanalysis shows that for individual images, too, the situated agent controls the gazein different ways to fixate suitable categorisation locations.

Gaze Control per Category. Depending on the category, the situated agentfixates different locations. Below, we analyse per category the gaze path of thesituated agent when it receives inputs that are typical of that category. The fixationstake place at locations that are suitable for categorisation.

To find the suitable categorisation locations per category, we look at the situatedagent’s categorisation performance on the training set at all positions of a 100×100grid superimposed on the image. At every position we determine the categorisationratio for both classes. For the category ‘male’, the categorisation ratio is: cm(x,y)

Im,

where cm(x, y) is the number of correctly categorised male images at (x, y), and Im

is the total number of male images in the training set. The left part of Fig. 5 showsa picture of the categorisation ratios represented as intensity for all locations. Thehighest intensity represents a categorisation ratio of 1. The left part of Fig. 6 showsthe categorisation ratios for images containing females. The figures show that darkareas in Fig. 5 tend to have high intensity in Fig. 6 and vice versa. Hence, there isan obvious trade-off between good categorisation of males and good categorisationof females1. The presence of a trade-off implies that categorisation of males andfemales should ideally take place at different locations.

Fig. 5. Categorisation ratios of male images in the training set.

If we zoom into the area in which the agent fixates, we can see that it alwaysmoves its fixation location to an area in which it is better at categorising the pre-sumed category. The right part of Fig. 5 zooms in on the categorisation ratios andshows the gaze path that results when the agent receives average male inputs at allfixation locations. The first fixation location is indicated by an ‘o’-sign, the last fix-ation location by an arrow. Intermediate fixations are represented with the ‘x’-sign.1 Note that the images are not inverted copies: in locations where male and female inputs

are very different, good categorisation for both classes can be achieved.

Fig. 6. Categorisation ratios of female images in the training set.

The black lines in Fig. 5 connect the fixation locations. The agent moves from a re-gion with categorisation ratio 0.8 to a region with categorisation ratio 0.9. The rightpart of Fig. 6 shows the same information for images containing females, revealing amovement from a region with a categorisation ratio of 0.76 through a region with aratio of 0.98. Both figures show that the situated agent takes misclassifications intoaccount: it avoids areas in which the categorisation ratios for the other category aretoo low. For example, if we look at the right part of Fig. 5, we see that the agentfixates locations to the bottom left of the starting fixation, while the categorisationratios are even higher to the bottom right. The reason for this behaviour is that inthat area, the categorisation ratios for female images are very low (Fig. 6).

Non-situated agents cannot exploit the trade-off in categorisation ratios. Theycannot select fixation locations depending on a presumed category, since the fixationlocations are determined in advance for all images.

Gaze Control per Specific Image. The sensory-motor coordination of the situ-ated agents goes further than selecting sensory inputs depending on the presumedcategory. The categorisation ratios do not explain the complete performance of sit-uated agents. In this section we demonstrate that in specific images, situated agentsoften deviate from the exemplary gaze paths shown in Fig. 5 and 6 to search forfacial properties that enhance categorisation performance.

To see that the categorisation ratios do not explain the complete performanceof situated agents, we compare the actual performance of situated agents over timewith a predicted performance over time that is based on the categorisation ratios.For the prediction we assume that the categorisation ratios are conditional categori-sation probabilities. The categorisation ratio cm(x,y)

Imapproximates Px,y(c = M | M)

and cf (x,y)If

approximates Px,y(c = F | F ). In addition, we assume that conditionalprobabilities at different locations are independent from each other. We determinethe predicted performance of a situated agent by tracking its fixation locations overtime for all images and averaging over the conditional categorisation probabilitiesat those locations. Figure 7 shows both the actual performance (solid lines) as thepredicted performance (dotted lines) over time, averaged over all situated agents(squares) and for the best situated agent in particular (circles).

For the last three time steps the actual performances of the situated agents arehigher than the predicted performances. The cause of this discrepancy is that thepredicted performance is based on the assumption that the conditional categori-sation probabilities at different positions are independent from each other. Thisassumption can be violated, for example, in the case of two adjacent locations.

Fig. 7. Actual performance (solid lines) and predicted performance (dotted lines) overtime, averaged over all agents (squares) and for the best situated agent in particular(circles).

The situated agent exploits the dependencies by using input features to shiftgaze to fixation locations that are well suited for the task of gender recognition. Forexample, the best situated agent bases its categorisation partly on the eye-brows of aperson. If the eye-brows of a male are lifted higher than usual, the agent occasionallyfixates a location right and above of the starting fixation. This area is generally notgood for male categorisation ( cm(x,y)

Im= 0.57, see Fig. 5), since eye-brows in our

training set are usually not lifted. However, for some specific images it is a goodarea, because it contains a (part of a) lifted eye-brow.

The gaze control policy of the situated agents results in the optimisation of (ac-tual) performance over time. Figure 7 shows that the actual performance augmentsafter t = 1. The fact that performance generally increases over time reveals thatsensory-motor coordination establishes dependencies between multiple actions andobservations that are exploited to optimise categorisation performance. As men-tioned in Sect. 3.3, other settings of T (T > 1) lead to similar results. Finally,we remark that for large T (T > 20), the performance of the non-situated modeldeteriorates due to the increased search space of fixation locations.

5 Discussion

We expect our results to generalise to other image classification tasks. Furtheranalysis or empirical verification is necessary to confirm this expectation. Our resultsmay be relevant to two research areas.

First, the results may be relevant to the research area of computer vision. Mostresearch on computer vision focuses on improving pre-processing (i.e., finding appro-priate features) and on classification (i.e., mapping the features to an appropriateclass) [13]. However, a few studies focus on a situated model (or ‘closed-loop model’)[14, 15]. Our study extends the application of a situated model using a local inputwindow to the high-level task of gender recognition.

Second, the results are related to research on human gaze control. Of coursethere is an enormous difference between the sensory-motor apparatus and neuralapparatus of the situated model and that of a real human subject. Nonetheless,there might be parallels between the gaze-control policies of the situated model andthat of human subjects. There are a few other studies that focus explicitly on theuse of situated computational models in gaze control [16, 17], but they also rely on

simplified visual environments. Our model may contribute to a better understandingof gaze control in realistic visual environments.

6 Conclusion

We may draw two conclusions as an answer to the research questions posed inthe introduction. First, we conclude that sensory-motor coordination contributes tothe performance of situated models on the high-level task of artificial gaze controlfor gender recognition in natural images. Second, we conclude that the mechanismof sensory-motor coordination optimises categorisation performance by establish-ing useful dependencies between multiple actions and observations; situated agentssearch adequate categorisation areas in the image by determining fixation locationsthat depend on the presumed image category and on specific image properties.

References

1. Pfeifer, R., Scheier, C.: Understanding Intelligence. MIT Press, Cambridge, MA (1999)2. O’Regan, J.K., Noe, A.: A sensorimotor account of vision and visual consciousness.

Behavioral and Brain Sciences 24:5 (2001) 883–9173. Nolfi, S.: Power and the limits of reactive agents. Neurocomputing 42 (2002) 119–1454. Nolfi, S., Marocco, D.: Evolving robots able to visually discriminate between objects

with different size. International Journal of Robotics and Automation 17:4 (2002)163–170

5. Beer, R.D.: The dynamics of active categorical perception in an evolved model agent.Adaptive Behavior 11:4 (2003) 209–243

6. van Dartel, M.F., Sprinkhuizen-Kuyper, I.G., Postma, E.O., van den Herik, H.J.: Re-active agents and perceptual ambiguity. Adaptive Behavior (in press)

7. Floreano, D., Kato, T., Marocco, D., Sauser, E.: Coevolution of active vision andfeature selection. Biological Cybernetics 90:3 (2004) 218–228

8. Viola, P., Jones, M.J.: Robust real-time object detection. Cambridge Research Labo-ratory, Technical Report Series (2001)

9. Bruce, V., Young, A.: In the eye of the beholder. Oxford University Press (2000)10. Moghaddam, B., Yang, M.H.: Learning gender with support faces. IEEE Trans.

Pattern Analysis and Machine Intelligence 24:5 (2002) 707–71111. Calder, A.J., Burton, A.M., Miller, P., Young, A.W., Akamatsu, S.: A principal com-

ponent analysis of facial expressions. Vision research 41:9 (2001) 1179–120812. Cohen., P.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge,

Massachusetts (1995)13. Forsyth, D.A., Ponce, J.: Computer Vision: a Modern Approach. Prentice Hall, New

Jersey (2003)14. Koppen, M., Nickolay, B.: Design of image exploring agent using genetic programming.

In: Proc. IIZUKA’96, Iizuka, Japan (1996) 549–55215. Peng, J., Bhanu, B.: Closed-loop object recognition using reinforcement learning. IEEE

Transactions on Pattern Analysis and Machine Intelligence 20:2 (1998) 139–15416. Schlesinger, M., Parisi, D.: The agent-based approach: A new direction for computa-

tional models of development. Developmental review 21 (2001) 121–14617. Sprague, N., Ballard, D.: Eye movements for reward maximization. In Thrun, S.,

Saul, L., Scholkopf, B., eds.: Advances in Neural Information Processing Systems 16.MIT Press, Cambridge, MA (2004)

Sensory-Motor Coordination in Gaze Control

Documents

Transcript of Sensory-Motor Coordination in Gaze Control