Navigation with Memory in a Partially Observable Environment

25
Navigation with Memory in a Partially Observable Environment A. Montesanto°, G. Tascini*, P. Puliti‡, P.Baldassarri† ° Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, Italy Tel. mailto: * Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, Italy. Tel.+39-071-2204830; Fax: +39-071- mailto: Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, Italy Tel. +39-071-2204900; Fax: +39- mailto: [email protected] Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, Italy Tel. mailto:

Transcript of Navigation with Memory in a Partially Observable Environment

Navigation with Memory in a PartiallyObservable Environment

A. Montesanto°, G. Tascini*, P. Puliti‡, P.Baldassarri†

° Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, ItalyTel. mailto:

* Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni,Università Politecnica delle Marche, Ancona, Italy.Tel.+39-071-2204830; Fax: +39-071-mailto:

‡ Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, ItalyTel. +39-071-2204900; Fax: +39-mailto: [email protected]

† Dipartimento di Elettronica, Intelligenza Artificiale e Telecomunicazioni, Università Politecnica delle Marche, Ancona, ItalyTel.mailto:

AbstractWe describe a hierarchical layered architecture forrobot navigation which uses an unsupervisedlearning approach based on the reinforcementlearning. Aim of the agent is to navigate from aknown or unknown zone to a region known through aprevious localization procedure. During itsmovement the agent must avoid the contact with theobstacles. The agent is endowed by two kind ofsensors: the visual sensors and the contactsensors. It uses both sensors when it moves along arectilinear path, while it uses the contact sensorsonly when the path is curvilinear. Therefore theenvironment is considered as partially observableand the analysed system can be modelled as apartially observable Markov Decision process(POMDP) with continuous state space. The problem issolved modifying the Q-learning method. When it isnecessary, we consider the agent with a memory forthe previous observations and action, providing tothe agent an evaluation of the unknown states.

Keyword POMDP, Self-localization, Navigation, Q-learning.

1. Introduction The autonomous navigation of mobile robots has recentlybeen the subject of intense researches. The capability ofthe robot to navigate successfully in the environments is

strictly related to the quality of its perceptions, whichin turn depends on the richness and reliability of itssensors and of its effectors that have to control themovement of the robot. The robotic navigation based on thevisual servoing requires understanding the way to processthe visual input in order to associate the action to aparticular visual data. By a systemic point of view,according to Varela [20], the sensorial and the motorprocesses, which subtend respectively the perception andthe actions, evolve together. So, they are consideredinseparable and ordered to increase the whole cognitivesystem. In this paper we analyse a problem related to thenavigation of an agent in an environment. In particularthe agent must go from a point of departure that can beunknown, to a point of arrival that should be recognizablethrough the localization. So, the agent goes from a knownregion (point of breaking) to a different and determinedknown region. In the same way, the agent goes from anunknown region to any known region. During all themovements, the agent must avoid to knock against theobstacles.In this work we consider a partially observableenvironment and so we have to solve a problem known inrobotics applications as Partially Observable MarkovDecision Processes (POMDP) [4,8,9,12,13]. In such aparticular situation, the state is not fully observableand, as a consequence, the current state of the system isnot always known. So, it is necessary to introduce a modelof observations. The procedure for the Markov DecisionProcess [8] could be modified to a partially observableenvironments, by an evaluation of the current state of therobot and applying a policy that maps belief states intothe actions. Several methods of reinforcement learning(RL) have been proposed in the literature to solve thePartially Observable Markow Decision Processes (POMDP)

[5,16,18]. During the RL [7], the agent receives as inputa scalar index, called reinforcement signal or reward.This index is an evaluation of the ended action. Thebehaviour of the agent is the result of a sequence ofsteps; in each step the agent receives an input, whichrepresents the state of the system. In the present work weconsider an approach to solve the POMDP problem using theRL method and endowing the agent of a short-term memory.In this way we intend to avoid the limitations of thetraditional RL, which cannot solve the problem of thehidden states [11]. There are hidden states when task-relevant aspects of the world-state are not alwaysobservable at any moment, but must be inferred byconsidering also past observations [3]We realized an architecture developed into twohierarchical levels [15,19]. The lower level contains theset of possible behaviours that the agent can perform ineach particular state. Whilst the upper level coordinatesthe behaviours of the lower level, establishing which kindof action must be executed in the current state.Moreover the visual observations obtained by two camerasmounted on the body of the agent are previously processedby a set of Self-Organizing Maps (SOM) of Kohonen [10]. Inparticular, we consider an approach of the SOM, known asLearning Vector Quantization (LVQ2) that has beenparticularly successful in various pattern classificationproblems, processed in this work. The problem concerns theclassification of the input signal into a finite number ofclasses, each of them identified by several nodes of thenetwork.

1.1 Partially Observable Markov Decision Process (POMDP)In the real world the environment can be partiallyobservable (PO) for the lack of information. We supposethat the environment is in the state s(t) at time step t.The agent that performs the action a(t), does not know the

current environment state, but it receives an observationo(t), which is a function of s(t). Moreover, the agentreceives special information, known as the reward r(t).These information associate a scalar reward to an actioncomputed in a particular state. Therefore the behaviour ofthe agent is controlled by three important functions: thestate transition function (S), the observation function(O) and the reward function (R). The goal of the agentis to learn a policy, which maps the observation into anaction in order to maximize the value of the policy.In this particular process one has to consider the problemof indistinguishable states, when different states cangenerate the same observation. In such a situation theagent cannot know the current state through the currentobservation only. In order to solve this problem weconsider a modified version of the Q-learning approach[21,22,23], endowing the agent with a memory. Theambiguity related to the current state can be overcomebuilding an estimator of the state, which has the previousobservations and actions and the current observation ininput, whilst an evaluation of the current state inoutput. The agent has a memory about the previousobservations and actions, renouncing to a purely reactiveapproach.

1.2 Reinforcement Learning When we talk about learning we consider, for example, theacquisition of the knowledge through the experience [6].In this work we used a particular kind of learning knownas Reinforcement Learning (RL). The main characteristic ofthis learning is its capability to acquire more knowledgethrough the advice or critic of a supervisor. So, the taskof the supervisor is to provide a measure (value) thatgives scalar reward or punishment based on behaviour. Inthis way the supervisor provides some measure of whetheran aim is being achieved, he does not indicate how to

achieve the task. In the RL an agent sequentially sensesand acts upon its environment over time while receiving areward. The goal is to learn behaviour, called also actionpolicy that maximises some function of the reward. Thisbehaviour is typically expressed as a mapping fromenvironment state to actions. As previously discussed, theenvironment state is not always observable at any moment,but it has to be inferred by also considering pastobservations. Such problems necessitate agents with somememory for storing past observations in order to optimisethe performances.In this paper we use the Q-learning method [2,], which isone of the most popular methods of the RL. The basic ideabehind Q-learning is that the learning algorithm learnsthe optimal evaluation function over the entire state-action space SA. The Q function provides a mapping Q:SAV, where V is the future reward of performing actiona in state s.The following equation represents the function Q(s,a) ofthe expected future reward:

Qt+1(s,a) = Qt(s,a)+[R+maxQ(st+1,at+1)-Q(s,a)](1)

Where R is the reward or the punishment resulting from theaction a taken in state s, is the discount factor, isthe learning rate which controls the convergence.

2. The system architectureThe actual system simulates the navigation of a mobileagent endowed with sensors and actuators. There are twovisual sensors, in particular two cameras mounted on thebody of the robot and four contact sensors that relieve acontact with the wall. The wall represents the onlyobstacle in the environment delimiting the available pathof the agent. As we can see in figure 1, the implemented

system is subdivided into modules. The core of thenavigation system has been developed into two hierarchicallevels: the upper and the lower level described in detaillater. The implementation of the navigation process ispreceded by a previous phase of self-localization, since akey component of mobile agent system is its ability tolocalize itself.

Fig. 1 The system architecture

The first part is dedicated to a supervised learning ofsome paths in the environment and of necessary informationto a raw localization [14]. The learning of the possiblepaths consists on the memorization of particular visualinformation associated to the paths.

Processing Image Module

Table Q

Upper level

Set of behaviours

Lower level

Vector Q

Sensors of

Evaluateturned-

Class of

Class of turned-

Actuators

ActionAction

Class of region +

Image RGB

Image RGB

RGB Splitting

RGB Splitting

First, the original stereo images, obtained by the twocameras, with an initial resolution of 100100 pixels arereduced in a bitmap format of 1010 pixels through abilinear filter, performing an integrated scaling. Duringthe pre-processing each RGB Splitting block (figure 1)separates the pixels of each image into RGB (Red, Green andBlue) components. Thus, for each image we obtain threeimages (three planes of colours) with the same resolutionof the original image, but the information on one colourcomponent only. Each component has an integer value between0 and 255.

2.1. THE PROCESSING IMAGE MODULEThe output of the RGB Splitting modules is the input forthe block called Processing Image Module (PIM), describedin detail in figure 2.1 and figure 2.2. The first block,called “Region Classifier”, is a neural system consistingof eight Learning Vector Quantization (LVQ2) networksproposed by Kohonen et al. [10]. The main feature of theLVQ2 algorithm is that two reference vectors of the networkare modified at the same time. This happens when the firstnearest class to an input vector (according to an euclidiandistance) is incorrect and the second nearest class to theinput vector is correct. Consequently the reference vectoris modified when a given training vector satisfies thefollowing three conditions: the nearest class to the givenvector must be incorrect, the next-nearest class to thegiven vector must be correct and, the training vector mustfall inside a small, symmetric window defined around amidpoint of the incorrect reference vector and the correctreference vector.

LVQ2_4

Figure 2.1 the Region Classifier of the Processing Image Module (PIM)

As it is shown in figure 2.1, the module is organized intothree layers. The first layer consists of six structuralidentical LVQ2 networks: three for each stereo image andeach of them for each RGB component. It identifies thespace of possible vectors classifying the images accordingto the levels of Red, Green and Blue. The output of thefirst layer is consisting of the coordinates of the sixwinner nodes, one for each network of the previous layer.The intermediate level processes simultaneously the twostereo images and the three colour levels in order toobtain the class of the particular point of view. Theoutput of the second layer is the coordinates of the winnernode in the network of this layer. This layer achieves asecond categorization guarantying a good immunity to thenoisy. The last layer gives in output the label of the“class of region” from which the pair of stereo images havebeen previously extracted.

LVQ2_R1

LVQ2_G1

LVQ2_B1

LVQ2_R2LVQ2_G2LVQ2_B2

LVQ2_3

FIRST LAYER

SECOND LAYER THIRD LAYER

Features

Vector

Unsupervised Kohonen

VQ

CLASS OFVIEW

CLASS OF REGION

Figure 2.2 the View Classifier of the Processing Image Module (PIM)

Since the processing of the visual input could bring to anexcess of generalization. We introduced a new module, shownin figure 2.2 and called “View Classifier” that addsinformation to the raw localization. Inside the class ofregion using the stereo images we can compute the distancesprofile. This is a features vector classified from aKohonen’s non-supervised VQ. The module learns theclassification off-line and gives as output the “class ofview”. These classes depend both on the direction of theagent and on its distance from the wall, even inside therange of images belonging to the class of region. In orderto help the agent in understanding the behaviour to beactivated, each class of view ca be associated to aparticular action.2.2. THE SYSTEM OF NAVIGATIONDuring the navigation the agent must change its positionavoiding the contact with the wall. In order to executethis operation the agent can use the information acquirednot only during the self-localization, but also derivedfrom the contact sensors. Whenever the agent goes along arectilinear path, like a corridor, it navigates “at sight”,receiving information about the environment both from thevisual sensors through the “classes of view” and thecontact sensors, through the “classes of contact”. We talkabout the “classes of contact” because we cluster thedifferent contacts against the walls. In fact thecollisions can be different, for example the agent can bumpone wall or contemporary two walls or does not bump.When the path is curvilinear, for instance it turns to theright or to the left, the agent navigates “blindly”. Inthis situation it receives information both from the

“classes of contact” and from a priori knowledge of thecorresponding trajectory.The part of the system which represents the core of thenavigation has been developed into two hierarchical levels,which communicate each other. The lower level of thearchitecture includes the set of possible behaviours of theagent while the upper level coordinates the actions of theagent in order to produce an overall behaviour. Where,behaviours or policies are the high level actions which theupper level can activate. In addition the upper level candisactivate a current behaviour providing a value of rewardas an evaluation of the effect of behaviour. For example,the current behaviour receives a negative reward if theagent goes against the wall, otherwise the reward ispositive. The behaviours derive from a servoing based onthe visual input and on the contact input.The behaviours are based on the Q-learning methodpreviously described, which makes uses the definition of aQ function. This function can be implemented through atable, called Q-table that memorizes the information comingfrom the sensors. Then, the function Q associates a Q valuefor each possible combination of state-action.In order to solve the problem of partially observableenvironments, the agent is endowed with a memory for thepast observations and actions. In particular, the systemobtains a scalar value whenever a behaviour has beenactivated. This value is updated during the simulation andwhen it is necessary it represents the memory for the pastobservations and actions. The value is a real scalar numberinitialised to zero at the beginning of each behaviour anddefined “turned angle”. When the agent turns around itsrotational axes, the turned angle is updated with therotational angle due to the action. In order to introducethis value in the Q-table, we subdivided the 360-degreeangle into an equal numbers of arcs: the “turned angle”

belong to a given arc and, as a consequence, to “turned-angle class”.Each combination of “class of view”, “class of contact” and“class of turned-angle” identifies a vector in the Q-table.This vector corresponds to a possible action, which can beexecuted in the active behaviour. The selected actionmodifies the state of the actuators and updates the valueand the corresponding “class of the turned angle”.The principal disadvantage of the Q-learning approach isthat the Q function converges slowly. In order to improvethe learning process, we modify the Q-learning method: inparticular, whenever a new action is selected we updatesimultaneously several cells in the Q-table not only onecell: a procedure already used in Dyna [17].

3. Experimental set-up: We have considered a complex indoor environment consistingof: corridors, curve to the right or to the left, crossesamongst three or four ways, in order to evaluate theperformances of the realised architecture. The wallsdelimit the movement of the agent.The agent simulates the movement of a robot endowed withactuators and sensors controlled by a computer mounted onthe body of the robot.The configuration of the agent is a differential guide(like cart tank), which allows the agent to curve with alittle radius of curvature [1]. The agent controls itsmovement adjusting, in an independent way, the rotationspeed of the two motors (right and left). The two motorsdetermine the rotation speed of the two wheels. The information about the surrounding environment derivesfrom two different kinds of sensors: the visual sensor andthe contact sensor.The visual sensor constitutes the main knowledge source ofthe environment and we use two cameras as visual sensormounted on the body of the robot. They are arranged very

close each other in order to provide a stereoscopic vision.In fact in this way they detect the characteristics ofdepth of the acquired images using the parallax concept.The tactile sensor detects physical contact with an object,with the wall in the present case. We use four contactsensors, mounted on the external side of the robot andarranged in a such way that the agent can relieve anycontact against the walls. Each sensor gives a binaryinput: the input is equal to 1 if the contact is againstone or more walls; otherwise the input is equal to 0. Thereplies of the four sensors are synthesized in a singlevalue which ranges from 5 to 10 and determines the state ofthe sensors.

The indoor environment we considered is shown in figure 3and can be divided in identical regions, except for atranslation or a rotation. The set of the possiblebehaviours, which the upper level can activate, depends oncharacteristics of the class of regions the agent enteredinto.

Fig. 3 Representation of the environment in which the agent navigates

In general the environment can be divided in four regions: Corridor (labels 2,4,6,8,10,12,13,15,17,18): in this

case there is an only one behaviour “cross corridor”,because the objective must go along the corridor.

Simple curve (labels 1,3,9,11,16): there are threepossible behaviours: “turn to the right” activated whenthe agent arrives in a simple curve and must turn on theright. This behaviour has a positive reward if the agentreaches the correct region, and a negative reward if theagent hits the walls. The second behaviour is “turn tothe left”, and it is similar to the previous behaviour.Then, the last is “to go away from the curve” which isactivated whenever the agent hits the wall.

Crosses amongst three ways (labels 7,14): in thissituation there are 7 possible activable behaviours. Wehave to consider each way of input and, for each way ofinput we have to evaluate two possible way of output.

Moreover, it is necessary to consider a behaviour “to goaway from”, as in the previous situation.

Crosses amongst four ways (label 5): a case very similarto the previous one. The total number of possiblebehaviours is 4. In this case it is not necessary toconsider independently the single inputs.

The total number of possible behaviours is 15 and they arememorized in the lower level and coordinated by the upperlevel.

4. Experimental resultsThe behaviour of the implemented model has been tested bymeans of simulations. During the learning, in all theexperiments, we have chosen the optimal parameters thathave given the best performances for the implementedsystem. In the following, some results obtained for someactivated behaviour are presented.During the learning, we have chosen fixed values for theparameters and , used for the updating of the Q table.The learning speed is equal to 0.45, whilst the discountfactor is equal to 0.8.Moreover the number of iterations used in the learningphase is 10000 epochs, whilst the learning factor has avariable value e-n, where n is referred to the currentnumber of iteration. During the learning the movement ofthe agent is initially random until the Q-table has arather high number of non-zero values. During theexploration, the behaviour more often activated is “cross-corridor” the corridor since is the region mainly expected.The behaviour “cross-corridor” obtains a reward valuewhenever it is deactivated.

Fig.4 Learning of the behaviour "cross corridor".

Figure 4 shows the state of this reward, during thelearning phase. The negative rewards are initiallypredominant (as shown by the first 20 activations) whilstsuccessively the rewards have mainly a positive value,except for some occasional negative values. In the lastpart of the figure 4 the rewards have positive and highvalue.

Fig.5 Number of contacts during the learning of the behaviour “cross-corridor”

The figure 5 shows the number of collisions against thewall during the activation of the behaviour “cross-corridor” and during the learning. Whenever the agentknocks against the wall, the total reward can be positive

or negative. The negative reward obtained after thecollision against the wall can be compensated by positiverewards due to the removal from the wall. Even if the totalreward is positive after the contact, the agent is trainedto avoid the collision the next time. The reward is zero ifthere are no collisions against the wall.During the testing, to verify the performances of thedeveloped architecture, the agent is initially in a randomposition. Then, the agent navigates following thepreviously learned rules.

Fig.6 Testing of the behaviour “cross corridor”

The figure 6 shows the results of the testing in particularin case of activation of the behaviour “cross-corridor”. Inthis phase the results are rather encouraging, since therewards have a mainly positive and high value.

Fig.7 Number of contacts during the testing of the behaviour “cross-corridor”

The previous figure 7 illustrates the contacts during thebehaviour “cross-corridor”, for the testing phase. We cansee that the agent hits the wall, but the contacts areisolated.

Fig.8 Learning of the behavior "curve to the right".

Figure 8 shows the progress of the reward value during theactivation of the behaviour “curve to the right”. In thiscase we can see that the learning has initially, for a longtime, negative values of the reward instead of thebehaviour previously analysed (figure 4). In fact duringthe activation of the behaviour “cross-corridor” the agent

receives the information about the surrounding environmentboth from the visual sensors and from the contact sensors.Whenever it curves to the right or to the left, itnavigates “blindly, by receiving information from thecontact against the wall, so only from the contact sensors.In the last part, as shown in figure 8, the tendencytowards a positive reward is evident. After an initialphase in which the agent seems to have no collision againstthe wall, there is a rather long sequence of negativerewards (between 90 and 120). This series is necessary tosteady the behaviour “curve to the right” on high standardvalues.

ConclusionsThis work intends to realize a system for the navigation ofa mobile agent in a partially controlled environment. Theagent is endowed with visual and contact sensors, whichprovide a good description of the scene. The stereo imagesacquired by two cameras allowed a primary raw localizationof the agent in the environment. The first part of therealized architecture consists of module that elaboratesthe images providing the necessary information to the self-localization. The part related to real navigation is developed into twohierarchical levels (the lower and the upper level), thatinteract each other. The upper level chooses the high levelaction to be executed. The lower level provides thisaction. In addition the upper level disactivates the highlevel action and provides an evaluation of the achievedsituation. This value is the reward signal for thelearning.The ambiguity on the state due to the intrinsic problems ofthe partial observability has been solved using a memoryfor the past actions and observations, when it isnecessary. In this way we abandoned a purely reactiveapproach.

The experimental results showed that, using thereinforcement learning in the case of 15 behavioursnecessary for the navigation in the representedenvironment, we could obtain interesting successes. Theyshowed also that there are different learning typesdepending on the behaviour. In fact the behaviours derivedfrom both the visual and the contact sensors, are learnedmore quickly than the others ones. The encouraging resultsindicate that the realized architecture allows thenavigation of an agent in a partially observableenvironment.

Bibliography[1] I.J. Cox and G.T. Wilfong, “Autonomous Robot Vehicles”,

New York, Springer Verlag, 1990.[2] C. Gaskett, L. Fletcher and A. Zelinsky, “Reinforcement

learning for a vision based mobile robot”, Robotic SystemsLaboratory Department of Systems Engineering, RSISE, Camberra,Australia.

[3] M.R. Glickman and K. Sycara, “Evolutionary Search,Stochastic Policies with Memory, and ReinforcementLearning with Hidden State” Proceedings of the EighteenthInternational Conference on Machine Learning, 2001.

[4] E.A.Hansen, “Solving POMDPs by Searching in Policyspace”, Proceedings of the 14th International Conference onUncertainty in Artificial Intelligence, Madison, Wisconsin, July1998.

[5] T. Jaakkola, S.P. Singh and M.I. Jordan, “ReinforcementLearning Algorithm for Partially Observable MarkowDecision Problems”, in G. Tesauro, D.S. Touretzky andT.K. Leen, editors, Advances in Neural Information ProcessingSystems 7, MIT Press, Cambridge MA, pp. 345-352, 1995.

[6] L.P. Kaebling, “Foundations of learning in AutonomousAgents", Robotics and Autonomous System, Vol. 8, 1991, pp.131-144.

[7] L.P. Kaebling, M.L. Littman, and A.W. Moore,“Reinforcement learning: a survey”, Journal of ArtificialIntelligence Research, vol. 4, 1996.

[8] L.P. Kaebling, M.L. Littman and A.R. Cassandra,“Planning and Acting in Partially Observable StochasticDomains”, Artificial Intelligence, Vol 101, 1998.

[9] M. Kearns, Y. Mansour, A. Y. Ng, “Approximate Planningin large POMDPs via Reusable Trajectories”, NIPS*99.

[10] T. Kohonen, “The Self-Organizing Map”, Proceedings ofIEEE, Vol. 78, No. 9, September 1990.

[11] L.J. Lin and T.M. Mitchell, “Reinforcement Learningwith hidden states”, Proceedings of the Second InternationalConference on Simulation of Adaptive Behavior: From Animals toAnimats.

[12] M.L. Littman, A.R. Cassandra, P.L. Kaebling,“Efficient dynamic-programming updates in partiallyobservable Markov decision processes”, Computer ScienceTechnical Report CS-95-19, Brown University, 1996.

[13] M.L. Littman, A.R. Cassandra, P.L. Kaebling, “Learningpolicies for partially observable environments: Scalingup”, Proceedings of Twelfth International Conference on MachineLearning, San Francisco, 1995, pp. 362-370.

[14] P. Maes and R.A. Brooks, “Learning to coordinatebehaviors”, Proceedings Eight National Conference on ArtificialIntelligence, San Francisco, 1990, pp. 796-802.

[15] S. Mahadevan and J. Connell, “Automatic programming ofbehavior-based robots using the subsumptionarchitecture”, Proceedings of the Ninth National Conference onArtificial Intelligence, Anaheim, 1991.

[16] S.P. Singh, T. Jakkola and M.I. Jordan, “Learningwithout state-estimation in partially observablemarkovian decision processes”, Machine Learning: Proceedingsof the Eleventh International Conference, 1994.

[17] R.S. Sutton, “Integrated architectures for Learning,Planning and reacting based on approximating dynamicprogramming”, Proceedings of the Seventh International

Conference on Machine Learning, Austin, TX, MorganKaufmann, 1990.

[18] R.S. Sutton, A.G. Barto and R.J. Williams,“Reinforcement learning is direct adaptive optimalcontrol”, Proceedings of the American Control Conference, Boston,1991, pp. 2143-2146.

[19] G. Theocharous, K. Rohanimanesh and S. Mahadevan,“Learning hierarchical partially observable Markowdecision process models for robot navigation”,Department of Computer Science and Engineering Michigan StateUniversity, East Lansing.

[20] F.J. Varela, E. Thompson and E. Rosch, “The embodiedmind: cognitive science and human experience”, MIT Press,Cambridge, MA, 1991.

[21] C.J.C.H. Watkins and P. Dayan, “Q-learning”, MachineLearning, 8 pp. 279-292, 1992.

[22] S.D. Whitehead, “Complexity and cooperation in Q-learning”, Proceedings of the Eight International Workshop onMachine Learning, Evanston, 1991.

[23] M. Wiering and J. Schmidhuber, “HQ-learning”, AdaptiveBehavior 6 (2), 1997, pp. 219-246.

Anna Montesanto, Ph.D. She received the Laureadegree in Psychology from the University of Rome"La Sapienza" (Rome, Italy) in 1995, the MScdegree in "Cognitive Psychology and NeuralNetworks" from the University of Rome "LaSapienza" in 1996, and the Ph.D degree in

"Artificial Intelligent Systems" from the University ofAncona (Ancona, Italy) in 2001. Currently she is aresearcher at the Institute of Computer Science, Universityof Ancona. From 1992 to 1996 Dr. Montesanto was with theArtificial Intelligence Group at the Faculty of Psychology,University of Rome "La Sapienza". She was a visitingscientist at the Robotic Lab, Department of ComputerScience University of Manchester in 2000. Her currentinterest are the development of hybrid architectures forreactive robotic navigation vision based, the developmentof new methodology approach of interfaces usabilityevaluation and of a theory of colour similarity.

Prof. Guido Tascini born in Perugia (Italy), isprofessor of Artificial Intelligence at theUniversity of Ancona. He received the Ms degreein Electronic Engineering from the University ofPisa, Italyprofessor of Operating Systems at

University of Ancona during the period '81-'92 andprofessor of Artificial Intelligence during the period'92-'97. He visited some international research centers andspent a period of study at ICSI (Berkeley), California(1995). He took part to several European and nationalprojects on artificial intelligence, parallel computing androbotic. He is member of various scientific associations:IEEE Computer Society, System Man and Cybernetics Society,SPIE, IAPR. Delegate at AEI. Founding member of AI*IA.

Chairman of Perception group of AI*IA. His researchinterests concerns Artificial intelligence, ComputerVision, Perception, Learning, Multimodal Informatics. Hismain research themes, during the years was: Analysis ofInformation Systems, Analysis and Automatic Recognition ofSignals, Segmentation of Images, 3D Vision, Space TemporalKnowledge Representation, Analysis and Understanding ofImages, Recognition of hand-written characters, Learning inPattern Recognition and Computer Vision, Applications ofNeural Networks in Computer Vision, Applications ofevolutionary techniques in mobile robotics and in ComputerVision.

Prof. Paolo Puliti is Associate Professor withthe University of Ancona since 1983. Previouslyhe has been professor with temporary appointmentand researcher with the University of Ancona.His research interests concern computer vision,

image processing, knowledge representation, artificialintelligence and distributed artificial intelligence. Hehas been previously interested in physics. He has beenresponsible of several MPI 60% and has collaborate to MPI40% and CNR research projects. He is currently responsibleof scientific research of University (ex MPI 60%.) He ispromoter of AI*IA Associazione Italiana per l'IntelligenzaArtificiale and member of the Italian Chapter of the IAPRInternational Association for Pattern Recognition.

Paola Baldassarri She received the Laureadegree in electronics engineering from theUniversity of Ancona (Italy) in 1999. Currentlyshe is a Ph.D. student in “ArtificialIntelligent Systems” at the Institute of

Computer Science, University of Ancona.Her research interests include the study of hybridarchitectures for reactive robotic localization visionbased, unsupervised learning theory for analysis ofelectromyography signals, vision and image processing byneural networks.